Recommended Reading
SRE Recommended Reading Foundational Reading Site Reliability Engineering (GOOGLE) Site Reliability Engineering Workbook (GOOGLE) Designing Distributed Systems by Brendan Burns Building Secure and Reliabile Systems (GOOGLE) DevOps and SRE Popular titles The Phoenix Project by Gene Kim The Unicorn Project by Gene Kim Accelerate by Nicole Forsgren, Jez Humble, Gene Kim The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Gene Kim Programming Language References and Introductions Python for DevOps by Gift, Behrman, Deza, & Gheorghi Introducing Go by Caleb Doxsey Think Java by Downey and Mayfield SRE Concept Deep Dive Implementing Service Level Objectives by Alex Hidalgo Chaos Engineering by Casey Rosenthal Seeking SRE by Blank-Edelman Kubernetes Kubernetes Up and Running (2nd Edition) by Burns, Beda, and Hightower Kubernetes Best Practices by Micheal Elder Cloud Native DevOps with Kubernetes by John Arundel Observability and Monitoring Systems Production Kubernetes by Rosso, Lander, Brand, & Harris 97 Things Every SRE Should Know edited by Emily Freeman and Nathen Harvey Distributed Tracing in Practice by Austin Parker Engineering Systems for Production Workloads 97 Things Every Cloud Engineer Should Know by Emil Stolarsky and Jaime Woo Prometheus Up and Running by Brian Brazil Cloud Native by Scholl and Swanson Cloud Native Transformation by Reznik and Dobson and Gienow Math Foundations for Advanced Reliability Think Bayes by Allen Downey 40 Algorithims Every Programmer Should Know Hands on Data Analysis with Pandas Practical Application of Bayesian Reliability Bayesian Statistics the Fun Way: Understanding Statistics and Probability with Star Wars, LEGO, and Rubber Ducks by Will Kurt Resilience vs Reliability vs Antifragility Resilience Engineering Association Good resource for introduction to RE and gateway to more depth into safety science.
Treat your Infrastructure like Software
This article originally appeared in “97 Things Every Cloud Engineer Should Know” edited by Emily Freeman and Nathen Harvey, 2021 O’Reilly Media
Infrastructure is important. Infrastructure and application code are equally critical to success as a cloud engineer. Most engineers either choose the correct runtime environment or iterate through runtime environments until they find the appropriate one for their application. How you provision, deploy, and recover whatever infrastructure you use is equally critical to choosing the appropriate runtime.
What is toil, and why are SREs obsessed with it?
This article originally appeared in “97 Things Every Cloud Engineer Should Know” edited by Emily Freeman and Nathen Harvey, 2021 O’Reilly Media
Site Reliability Engineers love to hate toil, but what is toil? Why are SREs obsessed with removing toil? Site Reliability Engineering is what happens when you treat operations like a software problem. How do you treat ops like a software problem?
SRE can feel opaque, but in practice, it is the essence of engineering: Remove inefficiencies in one component, so that other components may perform quantifiably better.