DevProd

Quantifying the Costs of Builds

In this document, I want to provide a model on how to calculate the costs of your builds and the return you get on improving it.

Developers and build engineers are under constant pressure to ship faster and more frequently. This requires fast feedback cycles and thus fast builds. At the same time, cloud, microservices, and mobile have made our software stacks more complex. Together with growing code bases, and without taking any actions, this will slow you down as your builds become slower and build failures are harder to debug.

Improving this is a huge organizational challenge for build engineers and development teams. To trigger organizational change it is important to have quantitative arguments. Inefficient builds do not just affect your ability to ship fast they also waste a lot of your R&D bandwidth.

Some of our customers have more than 100,000 Gradle build executions a day. Most medium to large engineering teams will have at least thousands of builds a day. For many organizations, this is a multi-million dollar developer productivity problem that is right under your nose. And every effort to improve it should start with assessing its impact.

Meet our example team

Our example team consists of 200 engineers with the following parameters:

CM$1Cost per minute of engineering time
DE230 daysWorking days of an engineer per year
CE = DE * CM * 8 * 60$110,400Cost per engineering year
BL2000Local builds per day
BCI2000CI builds per day
DW250 daysNumber of days the office is open
BYL = BL * DW500000Local builds per year
BYCI = BCI * DW500000CI builds per year

The example numbers above and later in the article reflect what we see typically in the wild. But the numbers out there also vary a lot. For some, we have better averages than for others. The example numbers are helpful to get a feeling for the potential magnitude of the hidden costs that come with builds. Your numbers might be similar or very different. You need your own data to get a good understanding what your situation is and what your priorities should be. The primary purpose of this document is to provide a model with which you can quantify costs based on your own numbers.

The number of builds depends a lot of how your code is structured. If your code is distributed over many source repositories you have more build executions compared to the code being in a single repository which then results in longer build times. But as a rule of thumb, one can say that successful software teams have many builds per day. It is a number that within your organization you want to see go up. As we evolve our model, we want to switch in the future to a lines of a code build per day metric.

Waiting Time for Builds

WL80%Average fraction of a local build that is unproductive waiting time.
WCI20%Average fraction of a CI build that is unproductive waiting time.

Local Builds

When developers execute local builds, waiting for the build to finish is pretty much idle time. Especially everything shorter than 10 minutes does not allow for meaningful context switching. That is why we assume WL as 80%. It could even be more than 100%. Let’s say people check and engage on Twitter while the local build is running. That distraction might take longer than the actual build to finish.

Here is the cost for our example team of unproductive waiting time for each minute the local build takes:

BYL * WL * CM = 500000 * 0.8 * $1 = $400,000 per year

Vice versa, every saved minute is worth $400,000. Every saved second $6667. A 17 seconds faster local build gives you additional R&D resources worth one engineer. If you make the build 5 minutes faster, which is often possible for teams of that size, that gives you the resources back of 18 engineers which are 9% of your engineering team or $2,000,000 worth of R&D costs!

The reality for most organizations is that the builds are taking longer and longer as the code bases are growing, builds are not well maintained, outdated build systems are used, and the technology stack becomes more complex. If the local build time grows by a minute per year, our example team needs an additional 4.5 engineers just to maintain your output. Furthermore, talent is hard to come by and anything you can do to make your existing talent more productive is worth gold.

If local builds are so expensive, why to do them at all ;). Actually, some organizations have come to that conclusion. But without the quality gate of a local build (including pull request builds), the quality of the merged commits drastically deteriorates leading to a debugging and stability nightmare on CI and many other problems for teams that consume output from upstream teams. So you would be kicking the can down the road with even significantly higher costs.

It is our experience that the most successful developer teams build very often, both locally and on CI. The faster your builds are, the more often you can build and the faster you are getting feedback, thus the more often you can release. So you want your developers to build often, and you want to make your build as fast as possible.

CI Builds

The correlation between CI builds and waiting time is more complicated. Depending on how you model your CI process and what type of CI build is running, sometimes you are waiting and sometimes not. We don’t have good data for what the typical numbers are in the wild. But it is usually a significant aspect of the build cost, so it needs to be in the model. For this example we assume WCI is 20%, which results in:

BYCI * WCI * CM = 500000 * 0.2 * $1 = $100,000 per year costs of waiting time for developers for each minute the CI build takes.

Long CI feedback is very costly beyond the waiting cost:

  • Context switching for fixing problems on CI will be more expensive
  • The number of merge conflicts for pull request builds will be higher
  • The average number of changes per CI build will be higher and the time finding the root cause of the problem will increase and it will often require all the people involved with the changes.

We are working on quantifying the costs associated with those activities and they will be part of a future version of our cost model. The CI build time is a very important metric to measure and minimize.

Potential Investments to reduce waiting time

  • Only rebuild files that have changed (Incremental Builds)
  • Reuse build output across machines (Build Cache)
  • Collect build metrics to optimize performance (Build Performance Management)

The cost of debugging build failures

One of the biggest time sinks for developers is to figure out why a build is broken (see the challenge of the build engineer for more detail). When we say the build failed, it can mean two things. Something might be wrong with the build itself, e.g., an out of memory exception when running the build. We will talk about those kinds of failures in the next section. In this section, we talk about build failures caused by the build detecting a problem with the code (e.g., a compile, test or code quality failure). We’ve seen roughly these statistics for a team of that size:

FL20%Percentage of local builds that fail
FCI10%Percentage of CI builds that fail
IL5%How many percents of the failed local builds require an investigation
ICI20%How many percents of the failed CI builds require an investigation
TL20 minsAverage investigation time for failed local builds
TCI60 minsAverage investigation time for failed CI builds

Such failure rates for FL and FCI come with the territory of changing the code base and creating new features. If the failure rate is much lower I would be concerned about low test coverage or low development activity.

For many failed builds the root cause is obvious and does not require any investigation, but there are enough where you need to investigate which is expressed by IL and ICI. CI builds usually include changes from multiple sources. They are harder to debug, and multiple people might need to be involved. That is why TCI is larger than TL.

Costs

Debugging local build failures:

BYL * FL * IL * TL * CM = 500000 * 0.2 * 0.05 * 20 * $1 = $100000 per year

Debugging CI build failures:

BYCI * FCI * ICI * TCI * CM = 500000 * 0.1 * 0.2 * 60 * $1 = $600000 per year

Overall this is $700,000 per year.

People often underestimate their actual failure rate. At the same time, there is quite a bit of variation in those numbers out there in the wild. You may have teams with very long running builds. Because the builds are slow, developers don’t run them that often and there are also less CI builds. Fewer builds mean a lower absolute number of build failures. Hey, long-running builds are saving money after all 😉 Not so fast: A small number of builds means a lot of changes accumulate until the next build is run. This increases the likelihood of a failure, so the failure rates go up. As many changes might be responsible for the failure, the investigation is more complex and the average investigation times go up. I have seen quite a few companies with average investigation times for CI failures of a day or more. This is expensive debugging but the costs of such long-living CI failures goes beyond that. It kills your very capability to ship software regularly and fast.

The basic rule is that the later a failure shows up, the investigation time grows exponentially.

So following up on the section of local build time. If developers don’t do a pre-commit build, it will push up the failure rate and investigation time on CI. Everything is connected. If you have very poor test coverage your failure rate might be low. But that pushes the problems with your code to manual QA or production.

Potential Investments for reducing debugging costs

  • Tools that make debugging build failures more efficient
  • Everything that makes builds faster

Faulty build logic

If the build itself is faulty, those failures are particularly toxic. Those problems are often very hard to explore and often look to the developer like a problem with the code.

FL0.2%Percentage of local builds that fail due to bugs in the build logic
FCI0.1%Percentage of CI builds that fail due to bugs in the build logic
IL100%How many percents of the failed local builds require an investigation
ICI100%How many percents of the failed CI builds require an investigation
TL240 minsAverage investigation time for failed local builds
TCI90 minsAverage investigation time for failed CI builds

FL is usually larger than FCI as the local environment is less controlled and more optimizations are used to reduce the build time like incremental builds. If not properly managed they often introduce some instability. Such problems usually require an investigation which is why the investigation rate is 100%. Such problems are hard to debug, for local builds even more so as most organizations don’t have any durable records for local build execution. So a build engineer needs to work together with a developer trying to reproduce and debug the issue in her environment. For CI builds there is at least some primitive form of durable record that might give you an idea of what happened, like the console output. We have seen organizations which much higher rates for FL and FCI than 0.2% and 0.1%. But as this is currently very hard to measure we don’t have good averages and therefore are conservative with the numbers we assume for the example team.

Cost

Debugging local build failures:

BY* FL * IL * TL * CM = 500000 * 0.002 * 1 * 240 * $1 = $240,000 per year

Debugging CI build failures:

BYCI * FCI * ICI * TCI * CM = 500000 * 0.001 * 1 * 120 * $1 = $60000 per year

Overall this is $300,000 per year.

There is a side effect caused by those problems: if developers regularly run into faulty builds, they might stop using certain build optimizations like caching or incremental builds. This will reduce the number of faulty build failures but at the cost of longer build times. Also when it is expensive to debug reliability issues, it means they will often not get fixed. Investing in reliable builds is key.

Potential Investments

  • Collect build metrics that allow you to find the root causes effectively
  • Reproducible Builds
  • Disposable Builds

CI Infrastructure Cost

Often half of the operational DevOps costs are spent for R&D. The CI hardware is a big part of that. For our example team, a typical number would be $200K.

Potential Investments to reduce CI infrastructure cost

  • Reuse build output across machines (Build Cache)
  • Collect build metrics to optimize performance (Build Performance Management)

Overall Costs

We assume the following average build times for our example team:

AL3 minsAverage build time for local builds
ACI8 minsAverage build time for CI builds

This results in the following overall cost:

Waiting time for local builds$1,200,000
Waiting time for CI builds$800,000
Debugging build failures$700,000
Debugging faulty build logic$300,000
CI hardware cost$200,000
Absolute Cost$3,200,000

While this cost will never be zero, for almost every organizations it can be significantly improved. For example by migrating from Maven to Gradle (see performance comparison). Cutting it in half would give you the R&D worth of 15 engineers for our example team of 200. And keep in mind that if you don’t do anything about it, it will increase year by year as your code bases and the complexity of your software stacks are growing.

There are a lot of other costs that are not quantified in the scenarios above. For example, the frequency of production failures due to ineffective build quality gates or very expensive manual testing for similar reasons. They add to the costs and the potential savings.

Why these opportunities stay hidden

I frequently encounter two primary obstacles that prevent organizations from realizing the benefits of investing in this area.

Immediate customer needs always come first

Especially when talking to teams who have many small repositories I hear regularly the statement that “build performance and efficiency is not a problem”. What do they mean with not a problem? In their case, it is simply that developers do not complain about build performance. For those teams, unless the developers are really yelling, nothing is a priority. While developer input is very important for build engineering, anecdotal input from developers should not be the sole source of prioritization. You might leave millions of dollars of lost R&D on the table. Build engineering should operate more professionally and more data-driven.

Benefit is understated

For other teams, there is a lot of pain awareness, e.g., around long build times. Actually, so much of it, that the impact of incremental steps is underestimated, as they are not taking the pain completely away. With a cost and impact model, the value of incremental steps would be much more appreciated. Such a model is also helpful to demonstrate progress and is an important driver for prioritizing further improvements.

Conclusions

We haven’t seen a company yet where investing into more efficient builds was not leading to a significant return on investment. The most successful software teams on the planet are the ones with an efficient build infrastructure.

Hans Dockter is founder and CEO of Gradle, the company behind the Gradle open source build tool and commercial products that help development teams accelerate software delivery. You can follow him on Twitter at @hans_d.