Get deep observability and boost build performance. Watch the video to learn about our free trial.

All Blog Posts
March 27, 2026

From 0.1% green builds to near-zero flakiness: How Apache Kafka overhauled its CI

By Lindsey Bonner

A little over a year ago, fewer than 0.1% of Apache Kafka's CI builds came back fully green. That's not a typo. With 60 committers, hundreds of contributors, and up to 100 commits landing per week, the project's CI was a patchwork of hope and retry buttons. Builds took anywhere from 1.5 to 8 hours, flaky tests were everywhere, and developers had collectively learned to stop looking at the results.

David Arthur—Apache Kafka PMC member, Senior Staff Software Engineer at Confluent, and the person who originally wrote Kafka's Gradle build—finally decided he'd had enough. What followed was a systematic overhaul with Develocity, compressing build times to a predictable 2-hour window, reducing build failures caused by flaky tests from 90% to just 1%, and shifting ownership of build health back to the community.

If you prefer watching a video to reading a post, check out our webinar recording on this topic.

Kafka's test suite is massive—over 34,000 JUnit tests, split between unit and integration tests. Run them serially on an average machine, and you're waiting 4 to 7 hours.

Before August 2024, everything ran on ASF-managed Jenkins infrastructure with a classic controller-worker setup. Hoping to compress build times, the first thing the team did was crank up Gradle parallelism. While it helped in some cases, in most it actually made things worse. Build times became wildly inconsistent, with the fastest around 1.5 hours, the average still at 4 hours, and some of the slowest builds hitting 8-hour timeouts. Increased flakiness also crept in with the concurrency, leading to timing issues, garbage collection hiccups, and race conditions during shutdown.

So what was going on? Noisy neighbors. Jenkins relies on operating-system-level resource isolation, which means that other builds running on the same hardware compete for CPU, memory, and I/O resources. You can't set a reasonable timeout when "healthy" builds take anywhere from 80 minutes to 8 hours.

"Totally green builds were actually really rare."

— David Arthur (watch at 8:05)

Kafka's PR workflow required three Jenkins jobs to pass. When David looked at the data, fewer than 0.1% of PR builds achieved that. In practice, contributors would re-run builds five or six times, cobble together one green result from each job across separate runs, and merge on trust. For a project processing 40-100 commits per week, this wasn't sustainable.

The catalyst was David's work on KRaft—the effort to replace ZooKeeper with Kafka's own internal metadata management. After months of submitting pull requests and fighting broken builds, he reached a breaking point:

"I came out of that process just absolutely frustrated and confounded with the state of our build, and I decided to fix it."

— David Arthur (watch at 12:06)

His first instinct was caching. Flip it on to skip redundant work and builds get faster. But Gradle's build cache needs persistence between builds, and that wasn't feasible on the ASF Jenkins infrastructure. That constraint pushed him to experiment with GitHub Actions, primarily to test whether its built-in caching could work with the Gradle cache.

David came up with a plan to prototype GitHub Actions:

  • Build an MVP
  • Run it alongside Jenkins for 30 days
  • Collect data
  • Let the community decide which to continue using

But David didn't yet have a way to collect that data. That's where Develocity came in. As an Apache project, Kafka has free access to Develocity. Develocity's Build Scan® feature provided the team with a structured, persistent record of every build, including performance trends, test outcomes, and system-level metrics. By using the tags within Develocity to filter Jenkins builds from GitHub Actions builds, they could draw a direct comparison. No more manually clicking through Jenkins builds and jotting down times!

The results were dramatic. Jenkins build times ranged from 1 hour 20 minutes to 8 hours. On GitHub Actions, everything compressed into a narrow band between 1 hour 37 minutes and 1 hour 59 minutes.

Counterintuitively, the GitHub Actions runners were weaker machines with only 4 GB of memory and 4 CPU cores, compared with Jenkins workers with hundreds of GB of RAM and enterprise-grade processors. The team hadn't even realized how small the machines were until Build Scan data from Develocity actually revealed the runner specs.

So what was really going on? The answer was container-level resource isolation. Each GitHub Actions job ran in its own container with dedicated resources, eliminating the "noisy neighbor" problem entirely. The team didn't change anything about the build itself, only where it ran. Stability and predictability proved far more valuable than raw compute power.

Consistent 2-hour builds on a small machine beat unpredictable 1- to 8-hour builds on a powerful one.

With builds now stabilized, the team turned to caching to further cut build times. Their strategy was to populate the Gradle cache from trunk builds and reuse it on pull requests, so only the changed code needed to be rebuilt. After some trial and error on how best to ensure deterministic outputs to avoid cache misses, the team achieved a consistent, predictable improvement.

But there was more work to be done. In serial terms, 2 hours 3 minutes of total execution time disappeared from a 7-hour 45-minute build. So there was a gap: the serial savings suggested 25% improvement, while wall-clock improvement was only 15%. David dug into the discrepancy using Build Scan data and found that Kafka's modules were too coarse-grained. A handful of very large test tasks dominated overall build time—a classic bin-packing problem. With only four Gradle workers, one always ended up overloaded.

Build times dropped from about 2 hours 3 minutes to 1 hour 58 minutes, with the median improvement around 15%. In the best cases, such as documentation-only changes, the cache saved 75-80% of build time.

The team is now actively decomposing modules, separating API contracts from runtime implementations so changes to internal behavior don't force downstream modules to rebuild. This exercise is good software design in its own right, plus it makes caching significantly more effective. They also don't use caching on the mainline branch, running all tests on trunk to catch behavioral regressions that contract-level caching might miss.

After the migration from Jenkins to GitHub Actions, flaky test failures on the mainline branch dropped from roughly 90% to 23%. This signified a massive improvement, but it wasn't good enough. Nearly a quarter of trunk builds were failing, and failing builds couldn't populate the cache, creating a vicious cycle.

Here are some hard-won lessons from what the team did next:

The first step was enabling automatic retries through the Develocity Gradle plugin, with a new policy that each test gets one retry, capped at 10 total retried failures per build. Without the cap, a systemic problem could grow and silently double build time.

David believes that when developers get accustomed to clicking retry, they stop examining failures altogether. Automated retries keep builds green while generating the data needed to identify problem tests. The system does the retrying; humans do the analysis.

Once retries were generating data, Develocity's flaky test reports made it straightforward to sort tests by flakiness rate and find the worst offenders. The team adopted an aggressive disable-and-fix policy: any test with flakiness above 10% is immediately quarantined, documented with a Jira ticket, and moved to a separate quarantine suite.

This approach does three important things:

  1. Removes the impact from the main build immediately
  2. Gives developers time to write a proper fix without racing against the clock (no more quick patches that don't actually resolve the root cause)
  3. Keeps the quarantined tests running, so Develocity can confirm when a fix actually works, because "It works on my machine" doesn't cut it for flaky test verification

Even if you fix every flaky test today, new code introduces new flaky tests. That's the cycle that had plagued Kafka for years: periodic bug bashes, followed by gradual regression.

The team successfully broke this cycle with a new test staging workflow:

  • Every newly added test is automatically quarantined for 7 days and run as part of a separate "new test" suite.
  • During that window, the team monitors for flakiness.
  • If a new test proves flaky, it gets redirected to the quarantine suite before it ever reaches the main build.

The result has been positive, with the build-failure rate dropping from 23% due to flaky tests to under 1%. In the last 90 days, only 3 of 600 mainline builds have failed due to flaky tests. Each failure is now also treated as an incident to be investigated and resolved, something that was impossible when failures numbered in the hundreds.

David learned that tooling alone doesn't sustain build health. It takes the whole community.

"I very strongly believe in no broken windows."

— David Arthur (watch at 44:20)

The broken windows theory applies directly to CI. When 90% of builds failed, everyone ignored failures. Now that green builds are the norm, the community actively maintains that standard. Explicit, documented policies for handling flaky tests and failed builds mean every contributor knows the process, not just the handful of people who work on infrastructure.

Equally important to David is a blameless culture. When someone breaks the build, the response is, "Can you help me fix this?" rather than blame. This fosters shared ownership, and David hasn't had to touch the build in nearly a year because the community has taken over to keep everything running.

To sum it all up, here are David's most valuable lessons learned from Kafka's CI modernization, which likely apply to any project you manage or contribute to:

  • Resource isolation matters more than raw compute
  • Caching effectiveness depends on project structure
  • Flaky tests require systematic workflows rather than periodic heroics
  • Sustained build health depends on community culture as much as tooling

None of this would have been possible without Develocity as the observability layer throughout. Build Scan data made the Jenkins-to-GitHub Actions comparison objective and undeniable. Task-level cache insights revealed exactly where Build Cache misses were happening and why. And Flaky Test Detection surfaced the worst offenders while providing the evidence needed to confirm that fixes actually worked...because "it works on my machine" doesn't cut it when you're managing 34,000 tests across a distributed team.

If your builds are where Kafka's were a year ago, Develocity gives you the same data-driven foundation to turn things around.

Is GenAI stressing your Continuous Delivery pipeline?

GenAI Will Stress Your Continuous Delivery Pipeline whitepaper

Share this blog post