Developer Productivity Engineering Blog

Comparing the Roles of Binary Artifact Repositories & Build Caches in Making Build Processes More Efficient

Introduction

A modern and productive development environment should include both binary artifact repositories, sometimes referred to as dependency caches, and build caches. These terms can be defined as follows:

Binary artifact repositories store binary artifacts along with the metadata in a defined directory structure, conceptually similar to a source code repository. The metadata describes the binary software artifact and includes information such as transitive dependencies, versioning, and build promotions.

Build caches store incremental build phase outputs at a lower level of granularity and allow builds to fetch these outputs from the cache when it is determined that inputs have not changed, avoiding the expensive work of regenerating them.

 

Because  they share the goal of making the build process more efficient and because certain features overlap, it is not uncommon to assume that one can fully replace the other or mistakenly conclude that both target the same issues. These solutions are by no means mutually exclusive and in fact should be viewed as complementary.

As you will see, artifact repositories improve the build process by increasing security and reliability, while build caches focus on accelerating build speed. Let’s analyze both tools so that we can better understand how exactly each tool achieves these objectives and contributes to an efficient development environment.

Note: It’s also common for developers to confuse build caching with the local .m2 directory that Apache Maven™ provides. The .m2 directory is simply a local mirror of a number of remote repositories, sometimes called a local dependency cache. A build cache goes beyond that pattern.

Binary Repositories Improve Build Security and Reliability

Binary artifact repositories from vendors like Sonatype (Nexus) or JFrog (Artifactory) are basically a long-lived storage for artifacts produced by a build which can be deployed anywhere. In most cases, these repositories house binaries produced by the build – called “deliverables.” By extension it can also store things like produced Docker images. In general, a binary artifact repository stores final artifacts which are produced at the end of the build lifecycle.

The primary goals of a binary artifact repository are two-fold.  First, many organizations’ security policy will not allow software dependencies to be gathered from sources that are external to the corporate network. Since public dependency repositories are outside the control and scope of most security policy, using these dependencies inside build automation opens up a business’s attack surface.  These repositories can be (and sometimes are) hijacked and malware can be put in place of legitimate dependency code.  By locating these repositories inside a corporate network and scanning all contents according to policy, security engineers can enforce tighter control over the software assets produced for customers.

Additionally, artifact repositories provide reliable and consistent access to binary artifacts of other software projects during the build process of a software project. Specifically, an artifact repository provides long-term, reliable storage of binary artifacts which can be used by downstream consumers. It does this by acting  as a “proxy” between different software components which are taking advantage of dependency management. This can help tremendously with distributed developer teams at scale, as multiple repository instances can be distributed regionally between development teams ensuring optimal network activity..

If we use the Java ecosystem to better understand what this means, a binary artifact can typically be a .jar file generated during the compile step of a build. The build step required for this is natively supported by most Java build tools (i.e. the ‘compile’ goal in Maven or the ‘jar’ task in Gradle). These build tools also support natively the interaction with local or remote binary repositories to download and consume the already built artifacts typically referred to as dependencies. 

Because repositories can be distributed and are not always co-located, build tools like Gradle and Maven offer optimizations to avoid repetitively downloading the same binary dependencies from the same external repositories. This is the role of .m2 (Maven) and modules-2 (Gradle) directories. While the strategies for “caching” differ, both tools actually try to reduce the impact of using a remote repository in a build by reducing the number of remote calls.

While certainly a recommended pattern, use of a binary artifact repository is just part of what’s needed to ensure maximum build assembly efficiency.

Build Caches Improve Build & Test Performance

Building software consists of a number of steps, like compiling sources, executing tests, and linking binaries. We’ve seen that a binary artifact repository helps when such a step requires an external component by downloading the artifact from the repository rather than building it locally.

However, there are many additional steps in this build process which can be optimized to reduce the build time. An obvious strategy is to avoid executing build steps which dominate the total build time when these build steps are not needed.

Most build times are dominated by the testing step. While binary repositories cannot capture the outcome of a test build step (only the test reports when included in binary artifacts), build caches are designed to eliminate redundant executions for every build step. Moreover, it generalizes the concept of avoiding work associated with any incremental step of the build, including test execution, compilation and resource processing. The mechanism itself is comparable to a pure function. That is, given some inputs such as source files and  environment parameters we know that the output is always going to be the same. As a result, we can cache it and retrieve it based on a simple cryptographic hash of the inputs.

Build caching is supported natively by some build tools. For example, local and remote build caching are core features of the Gradle build tool. In the case of Maven, there is no native concept of build caching. However, local and remote build caching is available to Maven users leveraging extensions available in Develocity. The remote build cache implementation in Develocity for Gradle and Maven build tool users reduces build times dramatically across teams and organizations regardless if executed in a continuous integration environment or on a remote developer’s machine.

Improve CI builds with a remote build cache

When analyzing the role of a build cache it is important to take into account the granularity of the changes that it caches. Imagine a full build for a project with 40 to 50 modules which fails at the last step (deployment) because the staging environment is temporarily unavailable. Although the vast majority of the build steps (potentially thousands) succeed, the change can not be deployed to the staging environment. Without a build cache one typically relies on a very complex CI configuration to reuse build step outputs or would have to repeat the full build once the environment is available.

Some build tools don’t support incremental builds properly. For example, outputs of a build started from scratch may vary when compared to subsequent builds that rely on the initial build’s output. As a result, to preserve build integrity, it’s crucial to rebuild from scratch, or ‘cleanly,’ in this scenario.

With a build cache, only the last step needs to be executed and the build can be re-triggered when the environment is back online. This automatically saves all of the time and resources required across the different build steps which were successfully executed. Instead of executing the intermediate steps, the build tool pulls the outputs from the build cache, avoiding a lot of redundant work

Share outputs with a remote build cache

One of the most important advantages of a remote build cache is the ability to share build outputs. In most CI configurations, for example, a number of pipelines are created. These may include one for building the sources, one for testing, one for publishing the outcomes to a remote repository, and other pipelines to test on different platforms. There are even situations where CI builds partially build a project (i.e. some modules and not others).

Most of those pipelines share a lot of intermediate build steps. All builds which perform testing require the binaries to be ready. All publishing builds require all previous steps to be executed. And because modern CI infrastructure means executing everything in containerized (isolated) environments, significant resources are wasted by repeatedly building the same intermediate artifacts.

A remote build cache greatly reduces this overhead by orders of magnitudes because it provides a way for all those pipelines to share their outputs. After all, there is no point recreating an} output that is already available in the cache.

Because there are inherent dependencies between software components of a build, introducing a build cache dramatically reduces the impact of exploding a component into multiple pieces, allowing for increased modularity without increased overhead.

Make local developers more efficient with remote build caches

It is common for different teams within a company to work on different modules of a single large application. In this case, most teams don’t care about building the other parts of the software. By introducing a remote cache developers immediately benefit from pre-built artifacts when checking out code. Because it has already been built on CI, they don’t have to do it locally.

Introducing a remote cache is a huge benefit for those developers. Consider that a typical developer’s day begins by performing a code checkout. Most likely the checked out code has already been built on CI. Therefore, no time is wasted running the first build of the day. The remote cache provides all of the intermediate artifacts needed. And, in the event local changes are made, the remote cache still leverages partial cache hits for projects which are independent. As other developers in the organization request CI builds, the remote cache continues to populate, increasing the likelihood of these remote cache hits across team members.

In addition, Develocity provides features like cache replication, which allows build cache nodes to be closer to the developer location. This reduces the overhead of fetching from the cache, which is particularly important for distributed teams.

Summary and Conclusion

Binary artifact repositories and build caches are two different, complementary parts of a modern development infrastructure. The main focus of a remote binary repository is to facilitate long-term integration of software, by providing secure and reliable storage for binaries produced by a build.

Artifact repositories Build cache
Focus Security and Reliability Performance
Interoperability Consumer Agnostic Build Tool Specific
Output Granularity Final Build Step All Build Steps
Output Consumer Dependency User Build Task with Matching Inputs
Storage Long-term, guaranteed Configurable


Build caches focus on optimizing the whole development experience. Build caches store all the incremental steps of a build and allow finer-grained composition of CI builds.

Next Step

A great way to learn more about build caches is to check out this video.