Mean time to recovery
DevOps is a practice that blurs the line between development & operations. In a Dev ops practice, there are four main data points to measure the success of a DevOps pipeline. They are,
- Lead time for changes
- Change failure rate
- Deployment frequency
- Mean Time To Recovery
Out of these four metrics, Mean Time To Recovery plays an important role in maintaining a quality response for failure tolerance. Organizations must attempt to reduce MTTR to make their systems stable.
So what is MTTR, MTTR is the average time taken to recover from a system or product failure. This includes the entire time a product or service is unavailable to the point of it is recovered.
MTTR is calculated by adding all the downtimes for a specific period and dividing it by the number of outages for the period. For example, For a period of 7 days, if you had 2 outages which amount to a total of 1 hour of downtime for the period of that 7 days, the MTTR would be 1 hour/ 2 = 30 minutes. We are making 2 assumptions here.
1. Outages and recovery happen sequentially
2. Recovery is handled by trained professionals
Now we know how to calculate MTTR. For a period of year rule of thumb is to maintain the MTTR below 6 hours. Now let’s have a look at factors that can increase your organization’s MTTR.
- Delay in alerting the outage
- Delay in diagnosing the issue causes the outage
- Time taken to apply a fix
In my current organization, we use “OpsGenie” for alerting & on-call management. OpsGenie provides tools to effectively alert organizations on issues and escalate them to on-call support as necessary. Tools & integration includes slack messages on errors, SMS & voice calls to responders & incident progress tracking.
Interestingly MTTR has other meanings as well. Those are
- Mean Time To Repair
This gives us the average time taken to fix the issue, test the fix & apply the fix. For example, For a period of 7 days, if you had 2 outages which amount to a total of 1 hour of downtime & 30 minutes to fix, test & apply the fix. For the period of that 7 days, the MTTR would be 30 minutes/ 2 = 15 minutes.
With this metric, you can get a rough idea on how effective a team in repairing an issue
- Mean Time To Resolve
This is an extension of Mean Time To Resolve. The calculation includes the time spent on safeguarding your product against further failures. So the calculation includes time spent on diagnosing, time spent on fixing, time spent on restoring & time spent on further fixes to prevent this from happening in the future. For example, For a period of 7 days, if you had 2 outages which amount to a total of 1 hour of downtime & took additional 2 hours to set up protective measures to prevent this happening in future, for the period of that 7 days, the MTTR would be (1 hour + 2 hour)/ 2 = 1.5 hours.
- Mean Time To Respond
This eliminates any delays that has happened alerting operations on the outage and only takes the actual time spent on recovery & fixing. With this, a team’s success on tackling outages can be measured.
For example, For a period of 7 days, if you had 2 outages which amount to a total of 1 hour of downtime & team spent 30 minutes fixing them, for the period of that 7 days, the MTTR would be 30 minutes/ 2 = 15 minutes.
Apart from these MTTR measurement there are number of other metrics available to calculate & visualize how stable a system is. This write up only looked in to the MTTR as a metric.