Takeaways from Google's Site Reliability Engineering Book
28/Jul 2019This popular book was on my list for a while until I recently have the time to read it. The book is about how the software systems are managed throughout their lifecycle in Google at its massive scale. Here I will jot down some key takeaways.
SRE enables a better balance between innovation and reliability of products. (Chapter 3)
Introducing planned outage may help identify parts that have false assumptions about reliability. (Chapter 4)
Toil exhausts engineers and hampers the speed of innovation, and therefore should be eliminated. (Chapter 5)
A healthy monitoring system and alerting pipeline should give proper alerts without introducing noise, and should ease debugging and root cause analysis. (Chapter 6)
Automation enables success and failure at scale. (Chapter 7)
Trust but verify. Perfect algorithms may not have perfect implementations. Hope is not a strategy. Backups are useful only when recovery works. Recovery won’t work without practice. (Chapter 26)
Best practices for production services: fail sanely, progressive rollouts, define SLOs like a user, error budgets, monitoring, postmortems, capacity planning, overloads and failure, SRE teams. (Appendix B)