How do we create a fast and reliable CI process and pipeline using the tools that we built ourselves?
In this recorded webcast, Gradle Enterprise team leads Etienne Studer and Luke Daley demonstrate how we speed up our own CI pipeline by using our own products, including the build cache, build scans, performance dashboard, and integration with major CI providers like TeamCity.
For your convenience we included the webinar transcript below.
Transcript: Optimize your CI Pipeline
Etienne: Hi, my name is Etienne. I’m from Switzerland. I am one of the Gradle Enterprise leads together with Luke. I’ve been with Gradle for close to four years. I’ve been doing software development for about 20 years.
Luke: I’m Luke. I’m all the way from Australia. I’ve been an engineer at Gradle for about seven years now. I spent a lot of time working on Gradle itself on the early years. For the last few years, I’ve been working on the Gradle Enterprise project with Etienne.
Today’s webinar is all about how we approach having a fast and reliable CI pipeline and process. We actually do it with the tools that we build ourselves.
Etienne: So let me tell you just a little bit more about our development process, so you have a bit more context of why it is so important for us to have an optimized CI pipeline.
On our team, we have close to 15 developers that all push to master. We push a lot, and we push several times a day. All of these changes are continuously deployed to one or more development environments. So it is important for a developer once he pushes changes to get quick feedback about whether that change was sound or not, so he can fix it if needed before he has moved or she has moved on to something else. It’s also important for the other developers to know if changes that are pushed by somebody else are sound so they know whether they can pull from master or push to master. So in short, quick feedback is really essential for us to stay in the flow and to make quick progress. In fact, quick feedback we can only achieve if we have a very fast and reliable CI environment.
Etienne: So what do we actually build? We are not surprisingly building Gradle Enterprise. Gradle Enterprise consists of various components that we’re all building in an automated fashion. We have several apps that we build and deploy. We have documentation. We have supporting tools that we’re building. We also have the website. All these components live in a mono repo. That mono repo has about 85 projects, about two and a half thousand files, and around six thousand tests. A lot of these tests are integration and functional tests to make sure the components play well together. We also have, and we’ll come back to that throughout the webinar, a lot of cross version testing that we need to do.
Etienne: Because we have different Gradle versions to support against different plugin versions, the build plugin version, and even different JDKs. So all this together gives us a matrix of more than 20,000 tests that we need to run for every commit to verify things are working properly across this matrix. In general, we have a lot of automation in place. So all the deployments, changes, they are deployed automatically to different environments. Also, build environments themselves are set up in an automated fashion. We also have the automated ID set up. So Luke, why don’t you give a bit of an overview of our general approach to CI?
Luke: So we have one key principle with how we approach CI and managing it is to keep it simple. Inside the Gradle build now, build logic, we have the definition of what depends on what and what needs what. That’s the right place for it. So how we manage CI on downstream, we try to keep it as simple as possible and keep the complexity out of it. We have many small builds. So even though we have a single mono repo, we have many, many CI plans or jobs. They’re all relatively small. We build those in parallel. So we massively parallelize that. Our CI pipeline, we have lots of build agents.
So we have many machines, our filing machines, that can build the stuff. That’s essential for making it fast. We also heavily rely on the build cache. That’s probably an understatement. This whole approach wouldn’t work if we didn’t use the build cache. We’ll come back and explain why that is and how that works for us. So that really makes us fast. We use build scans all day every day to help us debug when there’s problems, fix things when we break things, and optimize this process and keep it fast. So that’s generally how we approach it.
What is a build cache? If you haven’t come across this before in Gradle, the general principle is it makes your builds faster by reusing outputs from previous builds. So if you don’t have something like this, what you end up doing is building the same thing multiple times. So this avoids that by if the same thing is already there just reusing it. So we avoid a lot of redundant work, which frees up build time to build other useful things. What’s important about this, if you’ve been using Gradle for a while this may sound familiar in the form of what we call incremental build. A Gradle will avoid something that you’ve just built. But the build cache takes this a step further and reuses things from other machines, as well. So we’re going to build it once on one machine, then every other place that that is needed downstream the CI pipeline, we reuse that way. So you can think about it as incremental build, but taking it even further and more effective when you’re distributing work across multiple machines.
So what are build scans? If you haven’t come across build scans yet, bill scans are a shareable record of what happened during a build. So that contains a lot of diagnostic information. If the bill fails, a lot of information about why it failed and all kinds of things to help reason about that. The performance profile is a significant part of it and there are many aspects to build performance and build scan represents a lot of those and lets you really dig in and make things fast and keep things fast, which is critical. Which we’ll come back to you. There’s much more, as well. So build scans are created by having a Gradle plug-in, which you add to your build.
Then at the end of every build, this build scan is uploaded to a central server and stored. You can then access it later and share it. We’re going to be looking at build scans in depth and talking more about exactly how we use them. So we mentioned that we build Gradle Enterprise and work on Gradle Enterprise. Well, Gradle Enterprise is build cache and build scans. So while this is what we’re effectively building, we are also using build cache and build scans in our own development process. One, for dog feeding reasons. So we can use it and get early feedback. But it’s also just as a generic software team trying to build a product critical to our approach. It’s very important there. Gradle Enterprise is server software that you install on premises. So if you’re using Gradle, Gradle Enterprise gives you build cache and build scans for your builds. OK, so let’s see it all in action.
Luke: First thing I want to mention is that for our CI tool, we use JetBrains TeamCity. We really like it. We’ve been using it for a long time. Generally, we’re quite big fans of the products and tools that JetBrains make. We use a lot of them and we like TeamCity. It works well for us. So here is effectively the root of our CI tree for the Gradle Enterprise project. As you can see, there are quite a few builds here. They’re all relatively small. Going to expand some of these out and see the different sections here. So roughly, each project within the tree has its own build plan and then even further as well than that. So what I’m looking at here with these groups is that we break up test suites into multiple groups and split them up and run them across multiple machines in parallel to keep things fast. So that’s one of the techniques we have.
Etienne mentioned that we do a lot of cross version testing. With this particular product, we support many different Gradle versions, different versions of the build scan plug-in over time, different operating systems, JDKs. So we have a big matrix of tests that we have to run. It’s a lot of tests. Those are real functional tests that are quite slow. So breaking them up and distributing them is key. Also, making them fast with the build cache is also key. This is roughly what it looks like. We’re going to talk a little more about how we deal with all these projects in a second. Let me just go into this one here. So here’s a particular build. You can see here in that status box in TeamCity, we’re using a plug-in for TeamCity that integrates with build scans that give you that build scan link very prominently here. So if we run a build and we want to go and have a look in depth at the results or debug or something like that, we’ll come in here and go straight to the build scan. So for all of our CI builds, we have these scans. It’s really detailed information that we go and use to optimize and debug.
One thing I want to point out is that we also have these build scans for our local builds, as well. All of the builds that we’re running day to day produce these scans. We can go and look at it with the same tooling, the same interface, the same approaches for debugging and optimization there. So with that many builds and things to configure, potentially, I’m just going to talk a bit about how we manage that aspect of it.
Etienne: So as you’ve learned from Luke, we have a few things that are key to us. We have a lot of builds, and they’re very small, and we can run them in parallel. We need them to be fast. So let’s look at these two things. Like you’ve seen here on TeamCity, we have a lot of builds. It would be impossible for us to configure this all via UI. It’s more than 120 builds and if we also take up ranges into account, we’re ending up with more than 250, almost 300, builds. There’s no way we could manage all these builds in a UI clickety click fashion. So what TeamCity offers us is a DSL that allows us to programmatically configure and describe the builds that we want to run on TeamCity
Luke: Let me show you that in action. For that, let’s take a look at some build scan builds. We have here some cross-version tests. It’s actually 18 of them, 9 that run against Linux and 9 that run against Windows. In each of those builds, runs a certain set of Gradle versions. We have to support Gradle version from 2.0 all the way to 4.10. So we group all these Gradle versions into nine groups and we then run those Gradle versions, those cross-version tests, on two different operating systems. So how do we configure this using the Kotlin DSL approach that we have available in TeamCity? You can see an extract here. I’ll get us a bit closer here. We’re just iterating over the operating systems we want to support, Linux and Windows in this case. Then within each operating system, we create nine builds, each TeamCity build that we create. We then configure what we want to do, what do we want to run, what are the parameters to passing. So with this little amount of code, you have code completion, et cetera available, as well. We can describe 18 builds. But we could as easily, of course, describe 25 builds if we wanted to split this up into more groups. Or 26 groups if you wanted to have 13 for each operating system. It’s a very concise way, a very maintainable way, and also a way that allows us to be very consistent in how we describe those builds.
Luke: This has been almost revolutionary for us using this approach. Apart from just the sheer power of being able to express so many builds concisely, having it be in code, we actually keep this in our same source repository as our code, having this defined. So when we make a change to the build to add a new module, change something, dependencies split up some more test groups, that same code review that is introducing those changes contains our CI configuration changes, as well. So we review that as one thing. Plus having all of the version control and all the rest of the tooling around this. So whatever CI tool you’re using, check if it has this kind of capability of being able to express its configuration in some kind of data or code format. If your tool doesn’t, move to one that does. Because this is just so much better than having to configure things manually through the user interface and trying to manage that. I cannot imagine going back.
Etienne: This was definitely was a big step forward for us. Then we apply common coding practices. So we extract things into multiple files, extract things into different classes. So ultimately, it all lives in our TeamCity folder. It’s part of the repo like Luke said. It’s also checked in and TeamCity picks that up as it configures the builds.
So we have described a lot of builds, more than 120. But now we also need the agents that can actually run those builds. Otherwise, we have a bottleneck of running those builds that would potentially be paralyzing.
Etienne: The way we do this, we use Salt from SaltStack. It’s an open-source tool that allows us to provision agents in a very concise form. Again, you can describe the configuration, the deployment, and then we can easily spin up more agents. So right now I think we have a bit more than 100 agents live. But if we want to add more because we have more builds to run at the same time, we can easily spin up more. So we both scale on how we described the builds, but also on running those builds on agents.
Luke: This is another critical aspect of running builds at scale and fast builds. You just need to be able to scale horizontally with little effort in deploying another 20 build agents. Because it isn’t a big deal for us. It’s so much cheaper to add more machines than have the software developers waiting. It just makes great economic sense. So you really want to remove the bottleneck of having any manual work in creating and configuring those machines. So you can just add as many as you need to make it fast.
There are various scenarios where we wanted to upgrade the agents. New postquest version, something like that, and doing that manually across 100 agents. We wouldn’t do it or it would take us a long time. We would miss some agents, have inconsistent behavior. All right, so the second aspect once we have a lot of agents that can run a lot of builds is how do we make those builds fast. There is a key concept to that that I would like to explain here a little bit. The way we have structured our CI, it’s simple. But it starts with a seed build. The purpose of that seed build is to run those Gradle tasks that we know are also called by downstream builds, by multiple downstream builds.
Etienne: Why is this beneficial? So when those seed builds run those tasks, they put their output into a cache, the Gradle build cache. Then the downstream builds that run the same tasks for the same commit, so it’s the same sources to compile for example. They can reuse what has been put into the cache by the seed build. So instead of all those downstream builds rebuilding the same things, recompiling for example the same things, they can just reuse what’s in the cache. So the only overhead that is left and is just extracting or getting it from the cache and extracting it on the local machine. But there’s even a bit more that we can do.
Etienne: So the seed build not only pushes through the cache for downstream builds to consume. The seed build itself can reuse what it has put into the cache. So if you make another change, you push it, and the task you’re running is not affected by those changes. The previous build has already put the artifact into the cache. The seed build can, as well, reuse it. The same is true for the downstream builds. If they run some tests and the next time they’re asked to run, that that build is asked to run if the testing changed, the sources didn’t change, the depends didn’t change, it can also reuse what has been put into the cache.
Luke: Just one point I want to make there that may not be clear is that we don’t have any complex trigger rules or anything like that with that CI setup. So any commit that triggers the entire pipeline with all of those 20,000 tests. But any given change is unlikely to run all of those because we’re using the build cache to avoid doing things like executing tests on a component that isn’t changed, we run far fewer of those. So we don’t have to reason about, OK, if we change this thing, then we have to run these tests or any of that. We just do everything and allow the build cache to avoid doing work that actually isn’t necessary. Because it’s hasn’t actually changed.
Etienne: We used to do this differently in the beginning. What I remember is we had all these trigger rules and sometimes they were wrong, or they were right at some point but they weren’t over time. So we wouldn’t pick up changes or we picked up changes that were not necessary. So again, it was like not easy to maintain.
Luke: They may be right at some point, but our code is always evolving. We’re changing the structure and adding new dependencies. We couldn’t keep those trigger rules up. So this is part of what I meant at the start was that we just keep it simple at that level. Every change builds everything. But the build cache allows us to do that reasonably and not have every change take days to get through the CI pipeline.
Etienne: What we also stop doing once we switch to this approach is we’re not promoting artifacts anymore the way we used to. We used to build one artifact early on once and then just pass it on through the pipeline to avoid rebuilding it and making sure we’re exactly testing against the same artifact. But now the cache has become our medium to promote the artifact, so to speak.
Etienne: But there are even more benefits to this approach. We’ve seen the whole CI process. We can make the builds in CI very fast, but we can also benefit as a local developer. So if we have somebody in the USA or in Australia, it doesn’t matter.
Luke: Our team is actually geographically distributed. So we’ve got developers in Europe, in the US, in Australia. That’s where I come from, you may have picked up from accent. So what Etienne’s about to get onto here really relates to that.
Etienne: That’s true – a totally remotely set up. So if I’m building locally or any developer is building locally, they can also benefit from consuming artifacts that have been put into the cache by that seed build. We’ll see that in action in a second. We’re not actually pushing through the cache as local developers, because typically the changes you made locally nobody else will benefit from. Because it’s very unlikely that the other developers made exactly the same changes.
Now we can also have a whole network of build-cache nodes. Where they can replicate cache artifacts from one cache node to another. That becomes very useful, I would say especially for the poor people in Australia with low bandwidth and high latency. Because we can now have a remote cache node in Australia in closer proximity to where Luke is living or our other Australian people, rather than having to connect to a cache node that is, for example, in Germany or in San Francisco.
That’s offered by Gradle Enterprise, this cache replication. They can also have a preemptive cache, meaning if the seed build pushes something through the cache from that cache node, it’s automatically published to all the other cache nodes. So overnight while Luke might be sleeping, seed builds are running, cache artifacts are propagated all the way to his cache node in close proximity. I guess when you’re making coffee in the morning, you run your build. Before you even finish your coffee, the build’s already ready.
By the time I then pull the changes and go to build, that is in the cache and I get it straight away. It’s quite a big file. So with my poor internet connection at home, it takes about two or three seconds to download. But that’s a significant saving on 60 seconds, I’m paying 3 seconds. This is happening multiple times a day. So that’s an example of it. It makes a real difference. With the replication, the light and scene speed here is critical. So that’s why having the different nodes in the different geographical areas is important. We’re running many, many builds a day.
Luke: Even saving 200 milliseconds off a download or connection if you’re doing it that many times a day, it adds up to seconds. Speed is everything. It’s incredibly cheap. We just have some cheap AWS instances running the cache node. You don’t need big servers or anything like that to have as many of these as you need.
Etienne: All right, let’s see that in action from TeamCity. We have at the very top here the seed build. We call it build and sanity check. That’s the build that populates the cache for those tasks that we know are also being run downstream. We also have a second seed build in this case. Development build cache seed. I guess you can derive from the name, that is a build that is focused on populating the cache with artifacts that we know local developers are also running those tasks, so they can consume and reuse those tasks.
Luke: We also have some optimizations in our build to make things slightly faster when doing local builds. So we sacrifice a small amount of accuracy to make things faster there. That produces different cache out effects as well. So we have, due to that situation, publish and making sure that we are building those exact artifacts on CI and then seeding with that, as well.
Etienne: All right, and then we can look at some downstream build. For example, I’ll take the server. So these are all going to be called downstream of the seed build. So let’s take a look at one explicit one, which is that the server component which has been run here. So let’s see what happened in that build. Of course, I’m going to use the build scan to do this, because it gives me a much richer model and much deeper insights of what happened.
If I just go to the timeline, I can see all the tasks that we’re running as part of that server build. Now let’s see what was actually taken from the cache. Because we expect this downstream build to be leveraging things that were put into the cache by the seed build. We can see all these tasks here were using output taken from the cache. Either because they put it in themselves that build or because the seed build put it in there. So let’s take a look at a specific one. If I open it up, see some more details. I can see it was very short to run, around 25 milliseconds. That’s true for almost all these tasks in here that were taken from the cache. They take pretty much no time to run, because they just have to get the artifact from the cache.
Etienne: But now let’s take a look, how did it get there? So just via build scans, I can click here and I end up on the Gradle build that was actually producing that output and putting it into the cache. So we can see here, we’re again on the same task naturally. Because it’s the same task that we’re consuming from now. We can see it took actually nine seconds to run this task. So if we run this task once in the seed build for nine seconds and then we have 5, 10 builds that are consuming that, we’re saving 10 times 9 seconds already on the total duration to get that change through the pipeline.
Luke: Can you just go back, Etienne, to the producer scan. This is really a key point. One thing that people often question or wonder about when using the build cache. Instead of doing something like promotion is how do I know what’s going on. Let’s say I have some concern or question about a particular artifact or output and then it’s being reused from cache, how do I get back to the root cause and trace that?
Now with build scans, because build scans understand when it reads something from the cache, credit that link to OK, this is the build scan for that build, I can now go and find out everything about that build that created that thing originally that I was reusing. So I can trace back to the source of that. So we’ve not had an issue on our project where we’ve actually needed this, but it’s really good peace of mind. If we did have a question about something, one of the artifacts that we are reusing, we have that information. With build scans, we can trace that back. That’s a key thing. Like I said, we haven’t actually had to use it, but it’s a necessary part of relying so much on the build cache that we can really understand what happened and where this was reused from.
Etienne: Of course, having so much information about the origin build is really important, too. So for example, if you can go through to there and say well, which machine did this run on?” we just go to a separate section of the build scan infrastructure here, it was on that CI agent, that machine, had that operating system. We have all really rich information about the origin there. That’s a key part of scans. So we are the concept of custom links, which we’re going to get back to a little bit later. You can also see that this was indeed a seed build that ran this task, put it into the cache, and then the downstream server build was able to consume from the cache. All right, so we know how to make things fast now. But now the challenge is how do we keep it fast.
Luke: Right, so as you already mentioned, our build and our project are always evolving. We’re adding new functionality, new modules, things are changing, adding new tests. Getting it fast once doesn’t mean it’s fast forever. So keeping on top of things and understanding is key. So one of the tools that we use to do this is what we call a performance dashboard. So here’s another part of the build scan application where you can see all of the builds that are happening, all the incoming scans. What I want to do is have a look at all of the CI builds for Dot Com Project, which is Gradle Enterprise. We needed a name before we had actually sorted out the name, so that’s why it’s called Dot Com. I’m going to say, well, here are my CI builds. So everything that is built on CI, we give it a certain tag, a build scan tag. Using this, we can find all those builds.
Luke: Up here, I can say I want to have a look at the performance of those things over time. So here’s the performance dashboard for all of our CI builds. So just a bit about what we’re seeing here, this main graph here, we’ve got a couple of things. The white circle here is showing us the actual build time. The gray bar here is showing effectively the potential build time. But that’s all time that we avoided or saved due to different avoidance measures, mostly the build cache. We’ll come back and talk about that again in a second.
Luke: The green bar here is how much time we actually spent executing tasks in a serial manner. You can’t quite see it, but right at the tips here there are some little yellow bits. That’s indicating how much overhead the build cache is adding. So there are some pathological cases with the build cache if you have a really bad network connection, the overhead can be non-negligible. That’s something you want to do something about. Or if you’re producing artifacts with millions of files, the time spent to actually package those things up and send them around can be considerable. So this is something you want to keep an eye on and ensure it’s low. What we’re seeing here is that there’s quite a bit of variation.
Luke: Also a couple of things I want to point out. This is the entire set of builds. You can see I’m moving this little window here. We have a sort of detail looking here. So let me go here. If I look at one of these builds, I’m thinking, OK that’s interesting. Something about that is something I want to know more about. I click on that particular entry, it tells me what the build was, what tasks it ran, and the tags here. When it started and a breakdown of the actual individual values for that build. I can, of course then, if I want to find out more about that, go through to the actual build scan for that build. So for each of these metrics, so build time, how much time is spent executing tasks, we can actually dig in a little bit deeper. So if I’m just interested in how long did things actually take. So have a look at the build time here and this is giving me an idea of what’s going on in CI. So you see by the general shape of this graph where many of the builds are very, very fast. I know from looking at this data over a lot of time, that fastness is caused by being able to avoid the work by using the build cache. So if we look in here, I can see, OK, there are some spikes here. Have a look in the section here.
Luke: We suddenly get some builds up around 15 minutes in this case, rather high. So I can see that the shape of this anomaly, see spikes, and then go potentially choose to do something with that. Come back to what we would then do to go look into this and make things faster. But I just want to dig a little bit more into the performance dashboard first. So I can break up build time into two different sections. So the whole configuration phase of the Gradle build and the task execution. The configuration is something that runs the start of every build and something you really want to optimize and get down on, so I can isolate that aspect and have a look at it. Also just the task execution, as well, and separate that one.
Luke: Within the task execution, I can break that down into different categories, as well. So I can see how much time are we spending on tasks that aren’t cache-able. How much are we spending on tasks that are cache-able but that we missed, there wasn’t a pre-built version of that in the cache. How much time are we spending on cache hits, which is roughly how much time a task is taking where we’re able to reuse that from cache.
Luke: Also just the things that are already up to date or non-actionable tasks in Gradle as a separate category. We can also break down the avoidance savings. The avoidance savings are a measure of, an estimate of time that we reduce the build by using different avoidance measures. So one of those is Gradle’s incremental build functionality, which has been around a long time. Then we can look at how much time are we saving from the local build cache and the remote build cache.
Luke: One more thing I want to point out is that these numbers here are mean numbers. So we’re taking the average of the entire set. But to really get an understanding, you want to have a deeper look at those numbers. So if you hover over it here, we see a percentile breakdown.
Luke: Let’s have a look at the P 75 value here. That’s telling me that 75% of builds are saving less than seven minutes. But 25% of builds are saving more than seven minutes of their total build time. That’s pretty considerable. It’s an interesting thing to look at the total time saved here. It’d say over this period we’re looking at, which I know is about a week, we’ve saved 38 days of execution time via avoidant saving mechanisms. This is a week’s worth of work. If it took us 38 days to build all that stuff, then we wouldn’t be getting very far.
Etienne: Luke, I think it’s worth mentioning for those not so familiar, that even if we don’t run a task, we know how long it took when it actually ran. So for up to dates as well as from cache.
Luke: Very good point. By adding that up, we have that. So we saw before that when we reuse something from the cache, we have the link to know which build it came from. So, therefore, knowing that, we can say well, how long did it take when that was originally built. The difference between how long that took originally and how long it took when it was reused is effectively what we’ve saved. That’s how that’s measured.
Luke: So we can see if we break that down, we’re using the local build cache and the remote build cache for our CI builds. If you’re using disposable builds and you don’t have any kind of persistent disk across builds, you’re going to be using just the remote cache. It’s really kind of personal preference. We use a local build cache. It doesn’t make a significant performance difference for us, because the network connection between where we run the CI builds and our remote build cache is very fast. Being that where people who also develop Gradle, we’re using the local build cache and the CI builds just helps us kick the tires in that a little bit. So it’s not something you necessarily need to do if you have a reasonable connection between your CI builds and your remote build cache. You can use just the remote build cache.
Luke: See here we’re saving an average of two minutes here. If we look at the percentile distributions, that’s how it works out. So a couple of things about our particular build. So if I was wanting to think about can I make things faster with these builds, I’m just going to refine this search a little bit so we can look at a bit more homogeneous data. Because this is all different types of our CI builds that have different performance profiles. So let me just go and have a look at one of the components. This starts to look a bit more homogeneous now. So this is the same thing being built over time as changes are coming through. So we see that the total potential build time is about 10 minutes. The total height gives us an indication of that. We’re really saving quite a bit by these avoidance mechanisms. This graph is mostly gray, which means it’s dominated by the savings, which is great.
Luke: In the cases where we actually are executing things, what can we do here? What opportunities do we have? So especially when you’re getting started with the build cache, and even over time as things change, what you really want to make sure of is that you’re not spending a lot of time on tasks that aren’t potentially cache-able. Because then you just have no chance of getting the benefits of the build cache. So this is something we keep an eye on. In our particular case, we’ve spent effort optimizing this build and made our expensive tasks cache-able and worked on that. But as I said, things change over time. Regressions are introduced that change a situation. So by looking at these numbers regularly and keeping an eye on what’s going on, we can keep it optimized and keep it working well. Of course, we can spend time looking at how much time we’re spending on build cache misses. If this number was particularly high, we could then go and dig in and look at this and say, well, can we break things up into smaller chunks that change less. There’s a bunch of things you can potentially do there, as well. How much time are we spending on build cache hits?
Luke: If this was really high, it’d be an indication we have some kind of cache overhead problem. But this is really quite reasonable. Time and up to date. So one other thing I just want to point out is another thing that this graph gives us an indication of. The overall graph here is how effective parallelization is for this build. This number here also indicates that, as well. This metric of serial task execution, you can think about it as all of the tasks the build had to execute just flattened out and how long would that have taken. So if you de-parallelize it, how long would those tasks have taken? This number here, 1.4x, is telling us that on average if we weren’t using parallel builds, the builds would be 1.4 times longer. We’ll see why that’s the case in a second. This graph is also giving us an indication that that number is not so big here in the fact that these indicators here of total bill time line up very well with the serial task execution time.
So there’s not a lot of parallelism going on. But this is a single component. So I wanted to have a bit more of a look at this, we can go through to the build scan, have a look at the timeline. Start looking at the tasks that were executed and reasoning about why is it running in parallel or not. Let’s just have a look at something that is more parallel. So if we look at the seed build, which actually builds many more components. This situation is a little bit different. So here we have the build time marker being quite a lot less than the serial task execution time, which is indicating we’re getting quite good parallelism. We see here the mean number for this metric is 3.6x. So we’re reducing, the build time is nearly a quarter of what it would be otherwise. We can again see that on the timeline here, we’re getting reasonable parallelization.
Luke: Especially when you’re starting out with the build cache, a key metric to pay attention to is this non-cacheable metric here. So how much time am I spending on non-cacheable tasks? If this number is high, that’s a good indication that there’s some fruit to be harvested there by going and making those things cache-able. Take an example build here. A processor might go through with a build scan to try and work out what’s happening. Or what to do next, how to optimize. In the performance section of a build scan, I get many things. But we start with a high level breakdown of the build time of this build. So overall, this took four minutes. I can see that the time to actually bootstrap Gradle here was rather high. I know that in this case, we’re probably spending quite a bit of time on compiling cotton scripts and some other things. I can also reason about that this was actually the first build of this project with the demon. So we’re building with a cold demon here. So I know those numbers are about right for that particular scenario.
Luke: This whole configuration profile section is really detailed information. I can dig into how long each build script or plug-in took, which is a whole talk in itself. I want to give the go on then, but we’re talking about optimizing for the build cache. We get into this task execution section here. We get a breakdown of the tasks that were executed during the build and what kind of category they had with regards to performance. So hearing you say that, OK, well out of the tasks that we executed in this case, we executed 111 tasks that weren’t cache-able. This was 24% of the tasks and that took up to a minute, which is actually interesting. This is a build with a higher number for this than what we’re used to. So if I want to see exactly what those tasks are and try and work out is there any juice in here in trying to make these things faster. If I click this link, it’s going to take me back through to the task timeline, which we’ve seen a couple times with some filters applied. So it’s saying show me all the tasks that weren’t cacheable. So I can see these highlighted in the representation here. They’re in this list. In a typical process, I would then follow it and say show me, out of those tasks, the longest task that wasn’t cacheable. Build scan tells me here, we have a task here that ran for 40 seconds and it wasn’t cacheable because of some overlapping inputs. Actually, this is interesting. This is a new problem that must have popped up quite recently.
I’m discovering this as I talk to you. Which is good. If any rest of the team is watching, can someone log an issue to get this fixed, please. What I’m seeing here is that we’re spending 40 seconds on a task that might otherwise be cacheable. But because of a configuration reason, using the build cache requires tasks to be configured a certain way. It takes some fine tuning. It’s telling you there’s a problem here. So there are some overlapping inputs. We’ve got two tasks running to the same location. It’s also telling us which property of that task is problematic. So we can then go back to the build, look at that, separate these two tasks, and potentially save ourselves quite a lot of time, 40 seconds per one of these builds. So there are many other aspects to performance optimizations with scans. That’s some of the key ones. You can dig into aspects of build cache, how that was used, the dependency downloads, how long those were taking. Dependency resolution. We have some sections that are taking longer et cetera. That’s a whole other session we can do on that just in itself. Having things fast is great and essential, but like everybody I hope, we’re not perfect and we make mistakes. We break things. So we also need to fix those, as well. Etienne, talk a bit more about how we ensure that we fix things fast. Yeah.
In an ideal world, we would have an optimized CI and just have them fast built. But the reality is things go wrong. They often go wrong in ways that we didn’t expect and anticipate. Then it really comes down to how fast can we identify the problem and fix the problem. Let’s take a look at what do we need to really be fast at debugging and fixing those items. So what we need is more than just a log to figure out what went wrong in CI. We need a rich remodel. We need deeper insight so we can quickly reason about and understand what went wrong and draw conclusions and fix. Also, when we have a CI issue that doesn’t happen locally, which I guess everybody has experienced in their career, it can be quite a pain, because then you have to log into the box, rerun the build, hope that you can reproduce it, maybe add some debug information. If you’re unlucky, you won’t even have access to the box. So we want to be able to still figure out what’s going on or what happened if we cannot log into that box. We also want to add specific data about that build that will help us in case something goes wrong. Having that extra context about that specific build will help us when something goes wrong. Of course, the best help is is oftentimes just asking a colleague for a pair of eyes looking at it. We need a way to easily share with somebody else what happened, what do we need, how can we fix it. That’s where build scans really shine, as well.
Etienne: Because they capture, like Luke said in the beginning, they capture what happened in the build and we can use all that data to then reason about what went wrong and how to fix it. Let me give you a few concrete examples. I have picked out a few builds that went wrong. I’ll just go through them and highlight different things. So you’ve already seen, we can quickly go to the build scan from the CI build. In our team, we really never use the CI tool, in that case, TeamCity to investigate issues, right? The first thing we share is the build scan. It really helps the scheduling and execution for us, but the feedback and diagnostics we get from build scans. Yeah, and I find it’s a really good split, as well. TeamCity does what it’s best at and build scan does what it’s best at. So here we have a build. It failed, something went wrong in the compilation of Groovy. We can easily see that. Sometimes the log does help us.
In this case, it’s not too bad to go to the log and find out what went wrong. But what’s really nice here is we don’t have to dig through the whole log, which might be ten thousands of lines of log. We can just go to that task and say give me the log for explicitly that task. That also works in the case of parallelism, where we have intertwining log output, which is pretty much impossible detangle just by looking at pure console output. You can get faster to the root of the problem and then start solving it. Which, of course, in this case, it’s an easy fix. But it’s also about easy understanding of what went wrong. Here we have another build that failed. Typically we don’t even check what went wrong here. We just go to the scan and get a rich review of what went wrong. We can see some functional tests failed for the plugin for 4.10 and also some tests for 4.9. OK, so let’s take a look at what went wrong with these tests. We see a very rich view here to look into how many tests were run, how long did it take. We can also filter for certain tests. But in this case, what we are interested in is the failed tests. So we can easily see here what went wrong with those tests.
Because we can choose a single test, and again see the output just for that particular test. It’s not part of the general log, but it’s really the log for that test. We can then understand what happened and fix it. So in this case, something didn’t work with the dependency resolution. Now we have different ways to address this one. We go through the dependencies and we start investigating how was the dependency pulled in. Let me just take one example, just taking any dependency here. How was Jackson pulled in? We can see that, who required it, so it itself didn’t pull any more dependencies. But others depended on it. We can also see from what repository was it pulled. That can also help us when we want to debug dependency resolution issues. So that’s another example of how we can approach debugging an issue. Again, something went wrong. In this case, we see something went wrong with Checkstyle. This is a code check tool that we have set up in our build with certain rules. What’s pretty visible here right away is that the report we get from the log, the first thing it’s telling us, well go to this other file. So now we have to go to this other file, which is not that easy on TeamCity and figure out what went wrong. We want to have a faster way to identify the issue and fix it. So what we have here is we added a link that takes us directly to the Checkstyle report. So what happens here is we run the build on CI, we create that report, and we capture it as an artifact in TeamCity. Then we just link to that artifact.
The way we link to this via Custom Link. So that’s something we add to every scan that is published. I’m just going to say the Gradle build scan plug-in that you apply to your build provides an API that anytime during the build, you can say here’s a link to something, here’s a tag. Here’s a custom value. It’s a tool kit and whatever makes sense for you, whatever kind of linking, you can easily add that to your build. So take a look. I click on that link, now I see that Checkstyle report that was created, captured by TeamCity as an artifact. I can now go to the one that had an error and I can see what happened. Already a bit better than having to look at the log and potentially even going to the box to find out what was written in that file. But we can do even better. What I’m showing here for Checkstyle would work with any other code checking or even any other tool that you’re running that captures some kind of metrics. That is custom values.
Etienne: Like Luke said, in the plugin, we can capture any kind of key-value pairs, not just links. But also key-value pairs and packs. What we do is whenever we run Checkstyle and something goes wrong, a warning, an error, we capture that as a custom value. So we are here, we have only one issue. If we have multiple ones, we list them all as individual key-value pairs.
Here we can see right away which file and what went wrong. So we’re now at the fastest way to identify the problem. We saw Checkstyle failed, we go to custom values and we can see what went wrong. These Checkstyle issues are relatively common for us. We have fairly strict code formatting rules and this tool enforces it. It happens quite a lot and wanting to know what the actual problem is fast makes it obviously faster for us to fix it. With the report link, we first did that, that was an improvement. But that report is still pretty noisy. I still have to work pretty hard to actually find the thing that I want. But putting in that custom values, we know OK, go to that. We’ve concisely represented it how we need for us to be able to then go and action it. It was one of those things that was happening often, we’re spending too much time resolving it. But using custom values and the extensibility of build scans sort of make it faster for us to resolve it.
The mechanism we applied here, I think we applied for some other tool, as well. Of course whatever tool we use, like I said, as long as you get to the information of what went wrong, you can add it to custom values and then quickly surface that data. Then the last example I want to show you. What can be very powerful is you run a build, it fails. So there is a strong inclination to say well, let me check the last time it passed and compare what was different to last time I ran it when it was successful. Just like to briefly show that, I’m taking two builds here. We have one test that failed here. They’re running some tests. Then the next build succeeded. So obviously something changed between those two builds. In this case, we want to figure out what it was. So we can mark those builds and we can compare.
Then we get a comparison of different data that we capture, so like infrastructure on the switches and different custom values, different dependencies, and also different task inputs. Armed with that information, we can then find out what was different between those two builds and hopefully that gives us an indication of how to fix it. So these are just some samples that help us to really quickly identify when something went wrong on CI. Why did it go wrong and how can we then fix it. So with that said– You want to wrap up? By using these tools and these approaches, what we get out of it is fast feedback. That’s the most important thing.
When I commit a change and I get it into Master where we’re integrating all the time, if that has broken something or impacted something in a negative way, I find out about that fast. Because of course, the longer that takes, the longer that it takes to resolve. So if I find out about that in five minutes, I’m much more likely to fix it faster than if it’s 30 minutes later when I’ve already moved on and context switched. So fast feedback, as we all know, in software development is absolutely critical. Another thing which we didn’t touch on too much, but this has also been really transformative for us as well, is having a consistent debugging experience for CI builds and local builds.
So we see a lot is that people have their tools in front of them for local builds and they know how to deal with certain things there. Then it goes wrong on CI and then they’re lost. Having build scans and having that same way of getting into the information and understanding and having it work consistently in both environments makes us much faster and more effective at doing that. So I think that that’s a key point. Maybe to that point, personally when I have a build that fails locally, I just click the scan and look at what went wrong. I don’t even use the command line means that Gradle gives me to explore. Yeah, I do the same. Actually, there is one thing we had touched on, but didn’t go into in-depth. These build scan links are so shareable. So we’re often in our slack channel that we use in the development team and passing these links around all the time to get help.
Being able to do that for local builds and CI builds in the same way has the same benefits. With all the information inside that the build scans give us, having a real sense of understanding of what’s going on with build performance is also key. Instead of operating on intuition and having conversations where we’re guessing about how things are going and not basing those conversations on facts, that can just lead you to making wrong decisions. So we have a really good understanding of what the performance profile of our build and the CI in particular, what it’s like at any point in time and feel confident that we’re on top of it or we not understand the situation and then make good prioritization decisions and optimization. The simple CI configuration, we talked a bit about this at the start. But this is something I really value, in that we don’t have to think much about how our CI works. A change goes through, we build everything, I’m confident that that works. The logic is very simple. I know that it adapts and evolves with us, because it’s so simple. We don’t have to change it. I really value that of this approach. Also the thorough verification.
Part of that is we test everything, every commit. So I’m not concerned that we’re missing things. Because we don’t have those complex triggers set up or anything like that. So I feel it’s very well tested. The faster local builds as Etienne and I were talking about how we arrange our use of the build cache. That’s something as somebody who runs a lot of local builds every day, I particularly appreciate saving all that time when I’m building stuff. That’s what that gives us. At the end of the day, that allows us to develop better software and do it faster. So just a bit of a recap on the key points. Our whole approach to CI testing and delivery is predicated on the bill cache now. We wouldn’t be able to do this approach of just build everything on every change if we didn’t have that. We would just be waiting days for every change, which is just not going to work. It’s not enough to just make it fast once. You have to keep it fast. Optimization and a clean house is a continuous investment and a very worthwhile investment. We know that it makes us faster and more effective. It gets our product to market faster. Being faster at fixing, having all that information as our team pointed out, all those techniques. We just resolve issues better and have less guessing about what’s going on.
So the first question that we had is how do you manage the build distribution when you have dependencies. I’ll have to guess here. My interpretation of this question is if you have many small repositories and you’re not using a mono repo, how’s that work. You will still either need a way to express that many builds and the configuration between them. But the principle of using the build cache to be able to reuse the artifacts that were previously built still applies. You’ll still be saving all of that build time. It’s just that the interface of sharing things is slightly different. In terms of the performance aspect, the difference between mono repo and multiple repo isn’t particularly significant. The reason why we choose mono repository is that we really value the early integration.
There’s no latency between changes in one component affecting the other. Somebody just checked in a change to an upstream component 10 seconds ago when I built, it’s right there. So it’s just reducing the latency for us and because we can also do it fast with the build cache, that’s why we chose that approach. I would say that in a multi repository situation, I would be even more reliant on build scans because I’m probably going to be less familiar with all of those different projects. So therefore, when I am looking at issues, I’m going to want to be able to find out more about those. But the general answer is it’s largely the same. It doesn’t fundamentally change the dynamic of whether you have mono repo or multiple repositories. We have another question here. Can you give an example for seed build and downstream build.
What is the difference between those. So I mean, I’ve tried to give an example during my demo. So we run all the compile tasks in the seed build. Then the downstream builds that also need to compile those same components to them, for example, test against them, they can just reuse what has been put into the cache. Really, the purpose of the seed build. Because that was the question, what is the difference. The seed build, its purpose is to push stuff into cache while the downstream build are then the consumer of what has been put into the cache. So the seed build isn’t strictly necessary. We could just make it completely flat and just build everything.
The reason why we use the seed build is it can shave a minute, a minute and a half off those downstream builds just in case they run at the same time. The compilation hasn’t already happened. So you’ve got two things running at the same time. So by doing a lot of the common work upfront, and in our case, we’ve optimized that particular build and we do it in parallel on a single machine. It’s about 3 and 1/2 minutes. Because we reuse that so much downstream, that’s worthwhile for us to do. So it’s not quite a micro-optimization. It’s a bit more significant than that. But it’s not a requirement if you want to get started with the build cache. Another question related to that, what typically goes into group and splitting them up for parallelization. Sure, so we have two basic strategies here, depending on the type of test suite. So if it’s a test suite where it’s cross version testing, so we’re testing a bunch of different Gradle versions or plug-in versions, we will curate some groups. So we have a group that is the latest Gradle versions, because they’re the ones most likely to break for us.
Then the older versions we will break up into other groups based on size. So there is a kind of mix between hand curation. Then we have some code in our build which just says, OK, we have 30 old Gradle versions. We want to have five buckets break them up and give them these task names. For the cases and things you’re looking at browser tests, which aren’t really cross version tests, we just have a lot of them and they’re slow. We have a small amount of Gradle code that says look at the test files. Same thing, we want to have six buckets. Distribute those, create a task for each bucket. We noticed some of the buckets are getting a bit big now of a particular type of test, we’re now seeing times of 15 minutes or so, which is too much for us, we need to make that a bit smaller. We’ll go in and just change the number of buckets. I think now we have six. Probably jumped up to 15. Then one commit changed. OK, we now 15 buckets.
Change the CI number, as well, in the same commit. Just next time it’s like that. So yeah, we split it on factors like that and also just arbitrarily just break them up into buckets. There’s a nice side effect that we discussed also with flaky tests. So if you split up your tests into multiple builds and you run them and one build fails because of a flaky test, what you do, you rerun the build. In our case, we rerun all the builds. But because nothing has changed, those that were successful, they will just reuse the artifacts that were put into the cache already. So rerunning the whole chain will only then re-trigger the one build that actually had that failing test in it, which is just a subset of the whole suite of tests. So we have quite a lot of browser tests and browser tests are difficult to stabilize. We’re constantly working on it, but that’s a problem we’re constantly battling.
How is the build cache different than using Nexus or Artifactory with snapshots? That’s a good question. There’s a couple of key differences. One, if you’re using something like a binary repository manager like Artifactory or Nexus, you can configure things in such a way using multiple builds and sequencing builds so that you’re reusing things from upstream. But you can really only reuse certain things. So you can reuse the jars as they’re passed down. It doesn’t help you with avoiding tests or avoiding anything else that doesn’t propagate by that mechanism. So while it’s similar, it’s a more for the case of build performance optimization, it’s a much more restricted version.
It allows you to avoid a particular type of thing. It also requires modeling some of the dependency information that you already have in you build on what depends on what out to another tool, as well, whether that’s your CI tool in order to sequence your builds or something else. At a high level, build cache is a more comprehensive performance strategy in that any Gradle task that produces files it does work can potentially be avoided and reused downstream. Test tasks are a good example, too. If the CI runs the test, they might take 15 minutes, as you say. I run it again locally, but I have no changes. Well, I don’t need to rerun the test. Modern builds are about a lot more than just compile and push jars around. There’s source code generation, there are UI assets to build, all of that stuff. It’s simple with build cache. It’s just Gradle doing its thing and then this thing making it fast on the side almost. It doesn’t require a lot of extra machinery.
We don’t have a single cache node. You can have any number of cache nodes. You have the replication, you have the preemptive replication. It just allows you to leverage the performance benefits even more. The next question we have is, how can I make the most out of the build cache on a big Android project? Good question and a very hot question. A couple of things here. I’m not a particular expert on this category, but I can share what I do know. So we’ve been working very hard with Google for a while now on improving the cache-ability of the Android tooling. They’re working very hard, we’re working hard, as well. With the current versions that are available and you’re using the build cache and the Android tooling, definitely check out the Build Cache Fix plugin that we provide.
That patches some of the issues in the Android plugin that make it more cache-able Of course those fixes are being inlined into the Android plugin. I can’t remember the version numbers, but the version that’s coming out now that is currently in beta or RC has those fixes in line. So the situation is improving significantly with the cache-ability of Android builds. I would say keep an eye on those developments. Apart from that, get started with build scans. So one thing we didn’t mention is that while build scans are available in Gradle Enterprise, we also have a free service, scans.gradle.com where you can go and create build scans for free. A lot of the features for looking up your performance profile and seeing which tasks are cache-able and which ones weren’t and why are available in that. So create a build scan, share it with the Android community. If you want to share with us on the Gradle forums, discuss.gradle.org.
Point it out and we can help you further from there. At a high level, that’s the best advice I can give. So another question here is that we have many small repositories instead of one large one. This is the multi repo versus mono repo situation. I kind of touched on this in an earlier question. But the general answer, can you benefit from these tools and these approaches in that same situation, and the answer is yes. Some things are different. But largely, the same issues of wanting to build everything on every change or at least verify everything on every change and do that fast are fundamentally the same. Just having the build split over different repositories is not really, as those things are integrating, it’s largely the same. As I said, we use mono repo because it’s fast enough for us. We’ve been able to optimize it so that it’s fast and we get that immediate feedback on changes. That’s why we take that approach. There’s a question. Do downstream builds trigger publishing to the binary repository or external publishing whatsoever? So we still have some publishing, like the build scan plugin that we built. But the publishing is not motivated anymore for other builds to consume it.
We also have some Docker images that we publish. But like I said, it’s not motivated by publishing. So others can consume it. But just because there are external parties of some form that will consume those published artifacts. Because those things are only being published to then be actually used by whatever they were going to used by, the plugin be used by Gradle builds or the development grants or not. That publishing happens right at the end of our pipeline. We don’t publish at the start before we’ve done all the verification. We only publish when we’re ready to share with the world and it’s passed all the tests. So the next question is does the build caches configurations of the task or the actual execution. Good question, it’s the execution. So what it caches is the work that the task does.
The compile classes, the source code that was generated if it’s something that generates code from WSDL or something. The test results, if it’s running tests. So it’s effectively caching the result of running the tests. So that’s what it is. So it’s the expensive work of a build. That’s what the build does. Another question. Does Gradle Build Scan have a plug-in for Jenkins? Yes, there is one. There were some contributions from a person working actually at Gradle. Use it, it works well. Does TeamCity also show the same information about failed tests? I cannot give a conclusive answer. It definitely shows some information, as well. But you definitely have more structure when you look at it in build scans. You have the searching capabilities. You have the per test log. So definitely, the metrics around how many tests, how long did they take, and so on.
There’s more we want to do in terms of aggregate analysis and cross tests. So the next question is do you publish the final artifacts to some binary repository on Nexus or use Gradle Cache all the way. Yeah, I think we addressed that in the previous question. So yes, we do but right at the end of the build not before. Performance optimizations are reused during. So for those that heard a lot about build scans now but have never used it, if you just want to give it a first spin, go to the Gradle Command line, which is an Android project or Java, doesn’t matter. Run your Gradle command, build command, whatever it is. Add the scan, and it’s going to publish a first build scans to scans.gradle.com.
You will see exactly the information that we showed in scans here. We’re not hiding any information from the non-enterprise version. So I think that’s a great, great way to get started. Take it from there. Well, that’s all the questions we received. So thank you for watching. If you want to learn more about Gradle Enterprise, you can go to gradle.com for more information about build scans, build cache, how all that works there. The next webcast that Gradle will be doing is about maximizing developer productivity with Gradle Enterprise, which is really a training session. So a lot of the things that we really touched on in scans today and kind of breezed past, we’ll be going much more in-depth and really getting into the nitty-gritty of how all that works. So if you found this interesting today, it’s definitely a webinar you will get something out of, is how to really get you more out of scans.
Yeah, it’s really recommendable. It’s very use case-oriented. So people really can see how they can actually benefit in their everyday lives from build scans and Gradle Enterprise in general. So thank you again.
Luke: Thank you very much. Goodbye.