Measure, don't guess: Making observability the key to performance tuning your software delivery

This blog post is inspired by my DPE Summit 2025 presentation: Measure, Don’t Guess: Observability as the Key to Performance Tuning Software Delivery.

If you’re senior engineering staff or leadership in a large software organization, you know the drill. We are constantly seeking ways to make our engineering teams faster, more efficient, and generally less miserable. But how often do we actually stop and measure before we start tuning?

I’ve asked engineers this question before: Have you ever tried to performance-tune an application without benchmarking or profiling first? We all nod—we’ve done it. But the real question is, have you ever done it successfully?

The risk of optimizing without data is severe. It’s an old quote, but it still rings true today:

“More computing sins are committed in the name of efficiency without necessarily achieving it than for any other single reason, including blind stupidity.” (A Case Against GoTo, William Wulf, 1972)

My old boss, Martin Thompson (creator of the high-throughput, low-latency technology Aeron), used to tell me, “Measure, don’t guess”. And that principle—taking measurements first to establish a base state before attempting to improve performance—applies not just to optimizing application performance in production, but to optimizing the entire developer experience.

Let’s apply that powerful performance engineering mindset to our internal processes and our path to production.

The pain of the path to production

Before we fix anything, we need to acknowledge the pain points frustrating our development teams. These aren’t just minor irritations; they’re major productivity killers that overload our continuous delivery pipelines.

The pain often manifests in the following ways (and we know this firsthand, because we see it at Gradle):

Waiting for builds: We spend too much time waiting for builds to pass. Or worse, to fail. “Fail fast” is key to high performance.
Troubleshooting failures: We waste developer time by manually digging through logs to pinpoint the root cause of a failure.
Flaky tests: This is one of my pet peeves. Flaky tests drive me insane.
Security debt: We leave security until late in the process, hoping someone else will flag what needs to be fixed, by which time it’s time consuming, difficult, and expensive to fix.

When faced with slow delivery cycles, long queues, and frustrated engineers, leadership often turns to the “quick fix.” So let’s talk about what happens next.

The myth of throwing hardware at the problem

We’ve all been there: the CI pipeline is clogged, builds are slow, and the instantaneous solution is simply to throw more hardware at the problem.

Jeff Atwood, co-founder of Stack Overflow, once argued:

“…when does it make sense to throw hardware at a programming problem? As a general rule, I’d say almost always.” (Jeff Atwood, Hardware is Cheap, Programmers are Expensive, 2008)

He may have had a valid point in the pre-cloud era—buying a server was cheaper than wasting developer time. But in the age of hyperscale cloud infrastructure, throwing hardware at the problem is almost too easy. We have to ask: did it solve the problem? Was it cost-effective? And most critically, how can you tell?

The data often shows that resource constraints aren’t the real issue. Hans Dockter’s keynote at last year’s DPE Summit revealed a staggering truth: 90% of CPUs in CI are unused, but we still have queuing issues. If 90% of your resources are sitting idle, your bottleneck isn’t hardware! It’s efficiency or process, and you need visibility to be able to determine which

We’re living in an era where software complexity has outpaced our understanding. As Charity Majors, CTO and founder of observability platform Honeycomb.io, noted:

“We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood.” (Charity Majors, Observability is a Many-Splendored Definition, 2020)

This lack of understanding is precisely why we need to stop relying on guesswork and instead adopt observability across the entire continuous delivery pipeline.

Moving beyond monitoring with observability

We generally monitor performance in production—that’s where we focus our metrics, looking at application performance and user impact. But monitoring production only covers a small portion of the deployment loop, and our modern CD pipelines are enormously complicated matrices running on many different agents.

We need to observe the path to production as well in order to improve productivity. Any time code is sitting around, waiting to be delivered, it’s simply waste in Lean terms.

This is where the distinction between monitoring and observability becomes vital. As Charity Majors states in her Observability Manifesto:

Monitoring is about known unknowns. We already know there’s an issue, or potential for an issue, so we’re going to check in on that particular thing by setting actionable alerts.
Observability is about unknown unknowns. We don’t know where the issues are because we’ve never been able to look deeply enough to find them. This empowers you to ask brand new questions and then explore wherever the data takes you.

With true observability, you don’t need to know the questions ahead of time—instead, you can use the data you’ve gathered to explore and discover hidden inefficiencies.

The scientific approach to developer experience

The solution to our pipeline pains is straightforward: If you cannot measure it, you cannot improve it. We need to use the scientific method to improve developer experience, recognizing that the ability to experiment is critical for innovation and business value.

The scientific approach for performance tuning your pipeline is simple:

Measure: Establish your base state.
Form a hypothesis: Based on your measurements, hypothesize the bottleneck.
Implement a change: Crucially, implement one change. If you change five things at once, you won’t know which one was actually effective.

How do we measure?

Tools like Develocity offer the ability to track metrics across your continuous delivery pipeline. This includes tracking build counts, trends in build time, failure rates over time, and identifying those frustrating flaky tests. More importantly, you gain visibility across multiple projects, helping technical management spot issues like unused or rogue dependencies in CI projects you didn’t even realize existed.

We’re also moving into an era where Agentic AI can query observability data, turning complex build failure data into actionable explanations—an incredible boost to troubleshooting speed.

Seeing what the data reveals

Once you begin measuring, you start seeing startling things that immediately contradict some of your long-held assumptions:

1. The build is “just big”

When large, complicated enterprise applications have slow builds, the assumption is often: “It’s just big. That’s why it takes a long time”.

But when you measure at a granular level, you often find that dependency resolution is actually dominating your build time.

Data shows that 30% to 40% of all CI build time at large organizations is spent downloading dependencies. This is a massive chunk of time dedicated to a task that should be relatively easy to speed up by implementing effective caching.

To fix the problem, you first have to actually understand the nature of the problem itself. Better observability is what delivers that understanding. Otherwise, you’re writing it off as “just big”, or you’re wasting efforts trying to solve the wrong problem.

2. Flaky tests are “random”

Another frustrating assumption is that flaky tests are simply random failures—an unavoidable, unpredictable consequence of a complex system. So we just shrug and rerun the build, right?

No! Observability proves otherwise:

Measurement can reveal that a small number of tests are responsible for most flakiness, giving you a focused target for repair.
Failures might align not with randomness, but with increased load or specific, faulty infrastructure.

Flaky tests are hard to fix precisely because you have to find the underlying problem. Unfortunately, the human consequence of chronic flakiness is far worse than the wasted cycles. A study on developer experience found that developers who experience flaky tests more often are more likely to take no action in response to them.

“…our analysis revealed that rerunning the failing build and attempting to repair the flaky tests were the most common actions. Our findings also suggested that developers who experience flaky tests more often are more likely to take no action in response to them.” (Surveying the Developer Experience of Flaky Tests, Owain Parry; Gregory M. Kapfhammer; Michael Hilton; Phil McMinn, 2022)

When developers lose trust in the tests themselves, you’re then investing significant effort in running tests that are ultimately ignored, so what’s the point?

3. Security is “fine”

Finally, we often assume our pipeline is secure until disaster strikes. Without proper visibility, you won’t know if an unexpected dependency is sneaking into a project that you didn’t even realize was being built in CI.

The critical truth is that without observability, you can’t know your pipeline is secure. Observability allows you to verify that dependencies are what you expect, providing the kind of pervasive security required today.

Observability is the foundation for DORA metrics

Ultimately, all this measurement relates directly to business outcomes that engineering teams care deeply about—specifically, DORA metrics.

Observability of your continuous delivery pipeline can help you improve all four key DORA indicators:

Shorter builds lead to faster Lead Time.
Reliable pipelines lead to higher Deployment Frequency.
Catching and fixing flakiness and failures lowers your Change Failure Rate.
Faster troubleshooting lowers your Mean Time to Recover (MTTR).
Complete visibility into your builds, tests, and dependencies delivers the Pervasive Security and audit readiness we need.

On the Develocity team here at Gradle, we used to talk about acceleration and troubleshooting as pillars of improvement, but we now understand that observability is the essential foundation for everything.

Start measuring today

If you’re serious about performance, productivity, and compliance in your large software organization, you need to start by measuring what is actually there. So if you’re not already doing it, my key takeaway is this: Start measuring things that matter to your path to production today.

Stop guessing and start measuring. It all starts with observability.

Learn more about our observability platform: Develocity 360.

Learn More

Run a FREE Build Scan

DPE University

Events & Webinars

Measure, don’t guess: Making observability the key to performance tuning your software delivery