Developer Productivity Engineering Blog

Preventing Flaky Tests from Ruining your Test Suite

Do you hate flaky tests? If your application interacts with browsers, external services, or has asynchronous behavior, it’s likely your team has suffered from flaky tests.

Non-deterministic tests are a ruinous infection that wastes developer time and reduces confidence in your test suite.

In this webinar Eric Wendelin, Analytics lead at Gradle shows you:

  • How to prevent flaky tests from hindering your development lifecycle
  • New flaky test detection in Gradle and Maven build scans
  • Reports and analysis in Develocity to aid you in prioritizing and eliminating flaky tests

ARVE Error: No oembed src detected

Preventing Flaky Tests from Ruining your Test Suite
Preventing Flaky Tests from Ruining your Test Suite

Thanks to everyone that attended the webinar!

Transcript

My name is Eric Wendelin. I am the lead Analytics engineer for Gradle. Today I’m going to give a 45 minute session on “Preventing flaky tests from ruining your test suite”. After that I will take questions for 15 minutes.

If you suffer from flaky tests, you are not alone. Even Google has reported that they have a “continual rate of 1.5% of all test runs reporting a flaky result”. Everyone is suffering from flaky tests but there are not many good or standard solutions for rectifying them exist yet. So today…

Objectives

I want to show you tools and techniques to help you avoid wasted time and reduced confidence in your testing that arises from flaky tests.

I will show you methods to prevent unnecessary failed builds, but that won’t be enough, so we’ll look at tools we’ve developed at Gradle that help you proactively detect, prioritize, and fix flaky tests.

But first, I want to explain why I believe that flaky tests are an area worth investing in.

Flaky Test Definition

A Flaky Test is a test that reports success and failure given the “same” execution environment.

Flaky tests are costly

Now obviously, flaky tests waste time directly by causing unnecessary build failures. It saddens me to see generally excellent engineers in rerun-and-suffer mode, because their flaky tests often hold their CI changes hostage.

Less obvious is the degree to which flaky tests poison your culture by reducing confidence in your test suite. Even great engineers and teams are not immune to flaky tests if they are not tracked or fixed.

By this way, this chart I made from Gradle’s own release pipeline 4 years ago.

Reporting is the vehicle through which we can change our culture

As soon as we could quantify the problem, engineers and leaders stepped up to make changes.

Immediately thereafter we worked to reduce flaky tests and we saw significant improvement even after fixing the worst 10% of tests.

Visibility informed our decisions to act

..and you know what? After we saw improvement the team’s culture of testing also improved. New flaky tests became quick targets and as an ultimate result the team now builds 4x more often.

Alright, I’ve said my piece about why you should eliminate flaky tests, I’ll begin to tell you how.

Detecting flaky tests

I want to note up front that I’m taking a JVM-oriented approach today because I think that fits the audience best. But many of the concepts apply to most sophisticated applications using mainstream tech stacks.

Common causes of test flakiness

We will start by studying what makes tests non-deterministic.

  • Time: Use of non-monotonic clock
  • Isolation: Failure to cleanup DB tables or filesystem
  • Other types of resource leaks such as file handles or server connection leaks
  • Infrastructure: services, browsers, even dynamic library dependencies
  •  

Poll on build failure emails

Flaky test mitigation strategies

When you encounter a flaky test, what do you do? You get an email and you know the test had nothing to do with your changes. People just want to ship their thing. Very few can be bothered to investigate “someone else’s” test failure. Ignoring the failure is what most engineers do, but it’s not a mitigation strategy.

Now, if the change is really blocking you, you might “temporarily” disable the test and you might even file an issue to fix it. Disabling the test will surely unblock you, but what you’ve probably done is reduce test coverage that your team will need in the future.

Some teams use JUnit Retry and similar test-framework specific mechanisms. This is likely to unblock you but it is just as reactive as disabling the test, because you have to make some change every for every flaky test you discover.

Other teams execute retries at the build level. This is likely to prevent a blocked CI pipeline because it proactively retries newly-introduced flaky tests, but is very dangerous if your team does not track and fix flakiness. Later I will talk about tracking and fixing flaky tests.

Common methods of detecting flakiness

In my research I have found the following to be common methods engineers and teams use to detect that a flaky test.

  • Sometimes it’s obvious to engineers that a test is flaky just by looking at the exception type and message
  • Some systems track git commits and will mark a test as flaky if a test executed in multiple builds against the same commit has different outcomes
  • Specialized systems can take a git patch and a failed test, and determine whether the patch could have caused the test failure
  • Most organizations run a test multiple times and compare the test outcomes
  • Finally, some systems try to guess at flakiness by analyzing the test outcome history regardless of other factors

Flaky test detection mechanisms should avoid false positives

It is paramount that whatever detection method you employ you avoid false positives as much as possible. False negatives are undesirable but not deal-breakers.

What happens when your flaky detection system is itself flaky? You ignore it!

Analyzing test outcome history is not a reliable heuristic for predicting flakiness.

Similarly, multiple builds from the same git SHA are often not sufficiently similar to reliably make a “flaky” determination. A CI machine itself may cause flakiness in a number of tests if it has a bad disk or bad network configuration.

An ideal solution may be a combination of static analysis, stack trace parsing, and rerunning tests.

  • Proving that a test failure is mutually exclusive to a given changeset is extremely difficult to do reliably since teams and projects are highly diverse and always changing.
  • Similarly, semantic analysis of exception type and message is difficult to do broadly.
  • Nearly every flaky test detection system I’ve come across has test retry in some form. It’s simple to understand and its predictions are reliable.

Detecting flakiness using retries

Which is why I advocate for detecting flakiness using retries and tracking it.

New Gradle Test Retry Plugin

Test retry has been built into Maven surefire and failsafe for awhile, and Gradle has recently introduced a Test Retry plugin.

There are 3 especially neat things about the Gradle’s new Test Retry plugin:

  1. You can control whether the build passes or fails after flaky test encountered. This means that you can detect flaky tests AND NOT silently ignore them if you so choose.
  2. You can disable test retry after a discrete number of test failures. You won’t waste time if there is a major problem causing many tests to fail.
  3. Tests are retried at the method level or finer. Maven retries all failed classes

Flaky test reporting in build scans

Reporting for multiple test executions isn’t handled well in most test reports yet, however.

JUnit, Jenkins, even the Gradle CLI will show duplicate test executions and does not have a concept of a flaky test, so it is difficult to report on flaky tests.

While we work with other tools to make flaky tests a first-class concept, we have improved Gradle and Maven build scans to collate multiple executions of the same test and mark a test as FLAKY when there is at least 1 each of failed and passed outcomes.

You can click on failed test executions to inspect their streams.

Test retry by itself is just a fancy way of ignoring tests.

As I said before you also have to prioritize fixing the root causes of flakiness.

So let’s move on to tracking and fixing flaky tests.

Poll: What data do you currently collect about your flaky tests?

Prioritizing flaky tests using Develocity

I’m going to demonstrate to you how Develocity will help you track and fix flaky tests for Maven and Gradle projects.

This is Develocity. It allows you to collect and analyze persistent and sharable build scans for all Maven and Gradle builds, including local ones. With this data you can drastically improve your build performance and reliability. I’m going to take a deep dive into the test analytics we’ve recently introduced and show you just how effective test retry paired with analysis can be.

Let’s start by looking at all of the build failures on CI for the past 10 days or so by going to the Failures Dashboard in Develocity. There have been several hundred failures but they have significantly reduced the past few days. Develocity does semantic analysis on failures and clusters failures by their root cause. You can see that the overwhelming majority of failures are caused by test failures.

Now let’s look at the Tests Dashboard in order to understand why that is. The first thing we see is about 100 builds with test failures per day up until a few days ago, at which point they nearly stop. But then if you look at the chart on the right you see that flaky tests were at 0 but all of a sudden there are several dozen builds with flaky outcomes per day. Can you guess what happened?

The overwhelming majority of build failures were caused by flaky tests! When the Test Retry plugin was adopted, all of these builds provisionally passed and now we see a prioritized list of test classes that had the most flaky outcomes in our selected set of builds. Can you just imagine: 3% of all of your builds failing randomly due to flaky tests. Remember that CI pipelines typically have many stages… the likelihood of you getting an email for a test failure you didn’t cause was very high. No more.

But now the real work begins because we have to fix these flaky tests. Let’s click on a class to investigate. Flaky tests in the same class are often related, so we group them together. Now we see 2 methods here are flaky the same number of times. I would guess that they are related, and something common to both is causing flakiness. Let’s click on one to analyze the builds it’s participated in.

Guidance for fixing test flakiness

I wish I had a silver bullet for actually fixing flaky tests, but I don’t.

What I do have for you is approaches to common causes of flakiness that I hope will inspire you when you are fixing flaky tests.

Run unstable tests separately

Once you discover flaky tests, you still want them to run, but perhaps you run them in separate jobs to avoid disrupting otherwise good changes.

Using Gradle you can quarantine tests using test filters. Here’s an example Gradle Kotlin DSL script that will exclude quarantined tests on normal test runs but will run them in a separate task.

An advantage of this is that you could run a small set of quarantined tests very, very often and more quickly detect when a test has stabilized.

Improper time handling

System.currentTimeMills() is not monotonic! Use System.nanoTime() for calculating durations. Keep in mind, though, that System.nanoTime is a more expensive call.

Use and wrap java.time.Clock during testing for deterministic time management.

Lack of isolation

Failure to clean up database — run your tests inside a DB transaction and roll back at end of the test.

Other types of leaks — poor error handling in setup/teardown is all too common. In-memory file systems may help prevent file leaks that would otherwise affect other tests.

Dependency on flaky infrastructure

Flaky external services — Run tests that integrate with external services separately or consider Wiremock to record and playback request/responses.

Dynamic library dependencies — Use Gradle dependency locking or otherwise avoid dynamic dependency declarations. This also has the benefit of making your builds reproducible.

Performance profiling tests — Use Mann-Whitney U test with high confidence value. Be aware of non-obvious differences between test runs/instances.

Funny story: we once experienced performance test instability due to high CPU temperatures.

Conclusion

In summary, retrying your tests in the exact same environment is a simple and effective way to ease the direct pain caused by unnecessary failed builds due to flaky tests.

To ensure app stability and a good culture around testing, you need to prioritize and fix flaky tests.

Develocity will help you by collecting test history across all builds, not just CI, and giving you powerful analysis tools to understand how many flaky tests you have, how often each one is unstable, common traits between flaky runs and differences from stable runs.

Don’t forget about Gradle’s new officially supported Test Retry plugin. If you’d like to learn more, check out my blog posts on flaky test analysis on the Gradle blog.

I hope you found this session inspiring and informative. Thanks for your time.