10.1 SLIs, SLOs, and SLAs
Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are the metrics and agreements that define reliability.
10.2 Error Budgets
An error budget is the maximum amount of time that a technical system can fail without contractual consequences. It balances the need for innovation with the need for reliability.
10.3 Incident Management
The process of responding to an unplanned event or service interruption and restoring the service to its operational state.
10.4 Post-Mortems
A blameless post-mortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, and the root cause(s).
🎯 Practical Exercise
Draft a post-mortem for a hypothetical outage, identifying the root cause and action items to prevent recurrence.