Why, what, and how to measure success in DevOps
The old adage that you can’t improve what you don’t measure is just as true for DevOps as any other practice. In order to fulfill the promise of DevOps — shipping higher quality products, faster — teams need to collect, analyze, and measure numerous metrics. These DevOps metrics provide the essential data DevOps teams require to have the visibility and control over their development pipeline.
What are DevOps metrics?
DevOps metrics are data points that directly reveal the performance of a DevOps software development pipeline and help quickly identify and remove any bottlenecks in the process. These metrics can be used to track both technical capabilities and team processes.
At its core, DevOps focuses on blurring the line between development and operations teams, enabling greater collaboration between developers and system administrators. Metrics allows DevOps teams to measure and assess collaborative workflows and track progress of achieving high-level goals including increased quality, faster release cycles, and improved application performance.
Four critical DevOps metrics
Though there are numerous metrics used to measure DevOps performance, the following are four key metrics every DevOps team should measure.
1. Lead time for changes
One of the critical DevOps metrics to track is lead time for changes. Not to be confused with cycle time (discussed below), lead time for changes is the length of time between when a code change is committed to the trunk branch and when it is in a deployable state. For example, when code passes all necessary pre-release tests.
2. Change failure rate
The change failure rate is the percentage of code changes that require hot fixes or other remediation after production. This does not measure failures caught by testing and fixed before code is deployed.
3. Deployment frequency
Understanding the frequency of how often new code is deployed into production is critical to understanding DevOps success. Many practitioners use the term “delivery” to mean code changes that are released into a pre-production staging environment, and reserve “deployment” to refer to code changes that are released into production.
4. Mean time to recovery
Mean time to recovery (MTTR) measures how long it takes to recover from a partial service interruption or total failure. This is an important metric to track, regardless of whether the interruption is the result of a recent deployment or an isolated system failure.
Tools for an elite DevOps team
The importance of team structure in DevOps
How to measure, use, and improve DevOps metrics
Like other elements of the DevOps lifecycle, a culture of continuous improvement applies to DevOps metrics. The ability to receive fast feedback at each phase of development, coupled with the skill and authority to implement feedback, are hallmarks of high-performing teams. In the DevOps book “Accelerate”, the authors note that the four core metrics listed above are supported by 24 capabilities that high-performing software teams adopt. We cover most of these capabilities below (CI/CD, test automation, working in small batches, monitoring, and continuous learning), but it is worth reading “Accelerate” for a deeper dive into the research that supports these practices.
Lead time for changes
High-performing teams typically measure lead times in hours, versus medium and low-performing teams who measure lead times in days, weeks, or even months.
Test automation, trunk-based development, and working in small batches are key elements to improve lead time. These practices enable developers to receive fast feedback on the quality of the code they commit so they can identify and remediate any defects. Long lead times are almost guaranteed if developers work on large changes that exist on separate branches, and rely on manual testing for quality control.
Change failure rate
High-performing teams have change failure rates in the 0-15 percent range.
The same practices that enable shorter lead times — test automation, trunk-based development, and working in small batches — correlate with a reduction in change failure rates. All these practices make defects much easier to identify and remediate.
Tracking and reporting on change failure rates isn’t only important for identifying and fixing bugs, but to ensure that new code releases meet security requirements.
High-performing teams can deploy changes on demand, and often do so many times a day. Lower-performing teams are often limited to deploying weekly or monthly.
The ability to deploy on demand requires an automated deployment pipeline that incorporates the automated testing and feedback mechanisms referenced in the previous sections, and minimizes the need for human intervention.
Mean time to recovery
High-performing teams recover from system failures quickly — usually in less than an hour — whereas lower-performing teams may take up to a week to recover from a failure.
The ability to recover quickly from a failure depends on the ability to quickly identify when a failure occurs, and deploy a fix or roll-back any changes that led to the failure. This is usually done by continuously monitoring system health and alerting operations staff in the event of a failure. The operations staff must have the necessary processes, tools, and permissions to resolve incidents.
The focus on MTTR is a shift away from the historical practice of focusing on mean time between failures (MTBF). It reflects the increased complexity of modern applications and thus, an increased expectancy of failure. It also reinforces the practice of continuous learning and improvement. Instead of waiting until the deploy is “perfect” to avoid any failure (and thus, resetting the old MTBF scoreboard), teams continuously deploy. Instead of placing blame for ruining a “perfect” MTBF record, MTTR encourages blameless retrospectives to help teams improve their upstream processes and tooling.
Other related metrics
Another relevant metric is cycle time, which is the time a team spends working on an item until it is ready for shipment. In the development world, cycle time is the time from when developers make a commit to the moment it's deployed to production. This key DevOps metric helps project leads and engineering managers better understand what works well in the development pipeline. As a result, they can better align their work with the expectations of stakeholders and customers, ensuring their team's ship faster.
Cycle time reports allow project leads to establish a baseline for the development pipeline that can be used to evaluate future processes. When teams optimize for cycle time, developers typically have less work in progress and fewer inefficient workflows.
In Lean product management, there is a focus on value stream mapping , which is a visualization of the flow from product or feature concept to delivery. DevOps metrics provide many of the essential data points for effective value stream mapping and management but should be enhanced with other business and product metrics for a true end-to-end evaluation. For example, sprint burndown charts give insight into the efficacy of estimation and planning processes, while a Net Promoter Score indicates whether the final deliverable meets customers’ needs.
Continuous improvement is a core tenet of teams practicing DevOps. The ability to measure and track performance across lead time for changes, change failure rate, deployment frequency, and MTTR allows teams to accelerate velocity and increase quality.
Atlassian’s Open DevOps provides everything teams need to develop and operate software. Teams can build the DevOps toolchain they want, thanks to integrations with leading vendors and marketplace apps. Try it now.
Share this article
Bookmark these resources to learn about types of DevOps teams, or for ongoing updates about DevOps at Atlassian.