A modern and productive development environment should include both binary repositories and build caches. These terms can be defined as follows:
Binary repositories store binary artifacts along with the metadata in a defined directory structure, conceptually similar to a source code repository. The metadata describes the binary software artifact and includes information such as dependencies, versioning, and build promotions.
Build caches store intermediate build outputs at a lower level of granularity and allow builds to fetch these outputs from the cache when it is determined that inputs have not changed, avoiding the expensive work of regenerating them.
Because they share the goal of making the build process more efficient and because certain features overlap, it is not uncommon to assume that one can fully replace the other or mistakenly conclude that both target the same issues.
As you will see, binary repositories make the build process more efficient by improving reliability, while build caches focus on accelerating build speed. Both can play a needed and complementary role. Let’s analyse both tools so that we can better understand how exactly each tool achieves these objectives and contributes to an efficient development environment.
Note: It’s also common for developers to confuse build caching with the local .m2 directory that Apache Maven(tm) provides. The .m2 directory is simply an imperfect local mirror of a number of remote repositories. A build cache goes beyond that.
Binary Repositories Improve Build Reliability
Binary repositories from vendors like Sonatype (Nexus) or JFrog (Artifactory) are basically a long-lived storage for artifacts produced by a build. In most cases, this consists of binaries produced by the build or “deliverables”. By extension it can also store things like produced Docker images. In general, a remote repository stores final artifacts which are produced at the end of the build lifecycle.
The goal of a binary repository is primarily to provide reliable and consistent access to binary artifacts of other software projects during the build process of a software project. Specifically, a binary repository provides long-term, reliable storage of binary artifacts which can be used by downstream consumers. It does this by acting as a “proxy” between different software components leveraging the concept of dependency management.
If we use the Java ecosystem to better understand what this means, a binary artifact can typically be a JAR file generated during the compile step of a build. The build step required for this is natively supported by most Java build tools (i.e. the ‘compile’ goal in Maven or the JAR task in Gradle). These build tools also support natively the interaction with local or remote binary repositories to download and consume the already built binary artifacts typically referred to as dependencies.
Because repositories can be distributed and are not always co-located, build tools like Gradle and Maven offer optimizations to avoid repetitively downloading the same binary dependencies from the same external repositories. This is the role of .m2 (Maven) and modules-2 (Gradle) directories. While the strategies for “caching” differ, both tools actually try to reduce the impact of using a remote repository in a build by reducing the number of remote calls.
Is it correct to assume that a binary repository is all that is needed to ensure maximum build assembly efficiency? Let’s see why not.
Build Caches Improve Build & Test Performance
Building software consists of a number of steps, like compiling sources, executing tests, and linking binaries. We’ve seen that a binary repository helps when such a step requires an external component by downloading the artifact from the repository rather than building it locally.
However, there are many additional steps in this build process which can be optimized to reduce the build time. An obvious strategy is to avoid executing build steps which dominate the total build time when these build steps are not needed. Most build times are dominated by the testing step. While binary repositories cannot capture the outcome of a test build step (only the test reports when included in binary artifacts), build caches are designed to eliminate redundant executions for every build step. Moreover, it generalizes the concept of avoiding work associated with any intermediate step of the build, including test execution, compilation and resource processing. To some extent it’s comparable to a pure function. That is, given some inputs such as source files and environment parameters we know that the output is always going to be the same. As a result, we can cache it.
Build caching is supported natively by some build tools. For example, local and remote build caching are core features of the Gradle build tool. In the case of Maven, there is no native concept of build caching. However, local and remote build caching is available to Maven users leveraging extensions available in Gradle Enterprise. The remote build cache implementation in Gradle Enterprise for Gradle and Maven build tool users reduces build times dramatically across teams and organizations regardless if executed in a continuous integration environment or on a remote developer’s machine.
Improve CI builds with a remote build cache
When analysing the role of a build cache it is important to take into account the granularity of the changes that it caches. Imagine a full build for a project with 40 to 50 modules which fails at the last step (deployment) because the staging environment is temporarily unavailable. Although the vast majority of the build steps (potentially thousands) succeed, the change can not be deployed to the staging environment. Without a build cache one typically relies on a very complex CI configuration to reuse build step outputs or would have to repeat the full build once the environment is available.
Some build tools don’t support incremental builds properly, for example, when the outputs of a build started from scratch are different from the results of two subsequent builds where the second one uses the results of the first one. As a result, it is crucial for correctness to rebuild from scratch in this scenario.
With a build cache, only the last step needs to be executed and the build can be re-triggered when the environment is back online. This automatically saves all of the time and resources required across the different build steps which were successfully executed. Instead of executing the intermediate steps, the build tool downloads the outputs from the build cache, avoiding a lot of redundant work
Share outputs with a remote build cache
Those are not the only advantages of a remote build cache. One of the most important advantages of a remote build cache is sharing build outputs. In most CI configurations, for example, a number of pipelines are created. These may include one for building the sources, one for testing, one for publishing the outcomes to a remote repository, and other pipelines to test on different platforms. There are even situations where CI builds partially build a project (i.e. some modules and not others).
However, most of those pipelines actually share a lot of intermediate build steps. All builds which perform testing require the binaries to be ready. All publishing builds require all previous steps to be executed. And because modern CI infrastructure also means executing everything in containerized (isolated) environments, significant resources are wasted by repeatedly building the same intermediate artifacts.
A remote build cache greatly reduces this overhead by orders of magnitudes because it provides a way for all those pipelines to share their outputs. After all, there is no point building something that is already available in the cache.
Because there are inherent dependencies between software components of a build, introducing a build cache dramatically reduces the impact of exploding a component into multiple pieces.
Make local developers more efficient with remote build caches
It is common for different teams within a company to work on different modules of a single large application. In this case, most teams don’t care about building the other parts of the software. By introducing a remote cache developers immediately benefit from pre-built artifacts when checking out code. Because it has already been built on CI, they don’t have to do it locally.
Introducing a remote cache is a huge benefit for those developers. Consider that a typical developer’s day begins by performing a code checkout. Most likely the checked out code has already been built on CI. Therefore, no time is wasted running the first build of the day. The remote cache provides all of the intermediate artifacts needed. And, in the event local changes are made, the remote cache still leverages partial cache hits for projects which are independent.
In addition, Gradle Enterprise provides features like cache replication, which allows build cache nodes to be closer to the developer location. This reduces the overhead of fetching from the cache, which is particularly important for distributed teams.
Summary and Conclusion
Remote binary repositories and build caches are two different, complementary parts of a modern development infrastructure. The main focus of a remote binary repository is to facilitate long-term integration of software, by providing a reliable storage for binaries produced by a build.
|Binary repositories||Build cache|
|Interoperability||Consumer Agnostic||Build Tool Specific|
|Output Granularity||Final Build Step||All Build Steps|
|Output Consumer||Next Builder/Same Project||Another Project/Dependencies|
Build caches focus on optimizing the whole development experience. Although there is no long-term storage guarantee, build caches store all the intermediate steps of a build and allow finer-grained composition of CI builds.
A great way to learn more about build caches is to check out this video.