What is toil, and why are SREs obsessed with it?

This article originally appeared in “97 Things Every Cloud Engineer Should Know” edited by Emily Freeman and Nathen Harvey, 2021 O’Reilly Media

Site Reliability Engineers love to hate toil, but what is toil? Why are SREs obsessed with removing toil? Site Reliability Engineering is what happens when you treat operations like a software problem. How do you treat ops like a software problem?

SRE can feel opaque, but in practice, it is the essence of engineering: Remove inefficiencies in one component, so that other components may perform quantifiably better. Software engineers want their code to be simple, fast, and reliable: bug and cruft free. SREs want operations to be bug and cruft free! Cruft and bugs in ops and infrastructure can be described in one word → toil. Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.Toil is any engineering effort devoid of meaningful value.

If a piece of software is going to be used, we should make commitments, at a minimum to ourselves, that it is reliable, secure, and observable, but there is NO such thing as 100% reliable or 100% secure. When issues occur, software engineers need to be able to identify the issue, remediate or recover, and restore service. Slowing down to find all potential problems before release isn’t the answer. If we slow down releases, we sacrifice velocity and the features we spent engineering efforts on don’t get released. Increased velocity is what we want. We want to ship new features quickly and release often. The answer lies in automation – in removing all the toil from the process of getting software deployed. We need automated testing in CI/CD pipelines, automated infrastructure provisioning and control via infrastructure as code, and automated monitoring and alerting for when bad things happen. We need to remove as much manual, repetitive, low return work, so we can spend our efforts engineering new features and new software.

Toil also shows not only while working on and shipping features. When things go wrong, toil gets in the way of remediation and recovery. Debugging a broken deployment script or manually managing environment drift takes us away from positive work, and forces us to focus on negatives. If we automate as many negatives out of our equation, we get to spend more of our time on the positives. Removing toil from the entire software lifecycle makes the entire lifecycle quantifiably more efficient and effective, more reliable and secure. Removing toil makes the development experience more enjoyable. It makes deployments more enjoyable. Removing toil makes error remediation and incident response faster. Removing toil from the lifecycle makes engineers happier, and happy engineers create better software!