Developer Productivity Engineering Blog

Do You Regularly Schedule ‘Flaky Test Days’?

Each Develocity engineering project team is responsible for monitoring their own flaky tests. When enough flaky tests have accumulated in the test suite, the team meets for a ‘Flaky Test Day’ to identify and fix flaky tests. In this post, we explain the importance of having flaky test days, when to schedule them and some of the Develocity development team’s favorite best practices for identifying, prioritizing and fixing flaky tests.

What’s a flaky test?

Tests that are found to be non-deterministic, or exhibiting inconsistent results between multiple developers and/or build environments in practice, are considered to be flaky tests.  Flaky tests are a bane to developer productivity, especially when they linger undetected in an organization’s test suite causing often unreported stress and frustration for engineers.  Flaky tests make it more difficult for developers to separate signal from noise, meaning that real bugs are harder to detect.  Unreliable tests can even be the result of flaky production code in some cases, widening the area that needs to be investigated to find the root cause.

Why is a Flaky Test Day needed?

There’s only so much time in a day and an engineer’s time is highly valuable. It makes sense to address the most impactful flaky tests and the ones that are causing the most issues. However, this can result in an accumulation of lower-priority flaky tests. Flaky tests can be time-intensive to fix and become a burdensome chore for development teams.  Like most burdens, sharing the load with a team, as well as the knowledge and discovered fixes, can make the work a lot easier.

This is where the flaky test day comes in. By scheduling a day to focus on nothing else, the team experiences cascading positive productivity impacts from this investment for many future development cycles. Burning down this technical debt and not letting it further expand is the goal of flaky test day along with giving the dev teams a reliable starting point for on-going development and testing.

A flaky test does not always mean a problem with what the test does. It can also mean a problem in your production code.” 

Etienne Studer, VP Engineering, Develocity

When should a dedicated flaky test day be scheduled?

This depends on your engineering team’s release cycle. The Develocity team schedules flaky test days as needed, when accumulated flaky tests as a whole begin to noticeably impact developer productivity. This is usually done between release cycles and is an eight-hour session that focuses exclusively on identifying and fixing flaky tests that couldn’t be addressed or simply weren’t prioritized before the last release.

By keeping this cadence, the team ensures that all flaky tests are addressed before starting a new release cycle. This timing is very important as the elimination of flaky tests from the build prior to working on another release ensures that the team is continuously working from a reliable starting point.

“Flakiness reduces your trust in the test” 

Jim Hurne, Develocity Engineering Team

The role of flaky test day in the overall flaky test management strategy

When a Gradle Build Tool or Develocity release is being worked on under normal conditions, tests that are very flaky or have a significant impact on developer productivity are the tests that get prioritized for troubleshooting. If a test is ‘very flaky,’ has become flaky recently, or has an immediate and noticeable impact on productivity, the team will address it immediately.  If it’s not flaky often enough or felt in an impactful way, it may or may not get noticed.

The Develocity team determines the impact/severity of such flaky tests based on the following criteria: 

  • A test exhibiting new flaky behavior
  • Surface area of code that is being impacted
  • Frequency that the test is executed 
  • Complaints from engineers on flakiness

So, what happens to flaky tests that are only flaky 5% of the time? Or what about one that’s 20% flaky, but not frequently executed or in an impactful area of the code? These tend to get put on the backburner where they accumulate until flaky test day.

Using Develocity to prioritize and fix flaky tests 

Jim Hurne, Senior Software Engineer from the Develocity Engineering Team, says that his favorite feature in Develocity is the ability to search for the test class you are working on and use the Test Dashboard filtering to view all tests that have failed or are identified as flaky within that test class. Then you can prioritize those based on percentage of flakiness and start the investigation and fixing process. 

“What gets measured gets improved.”

Peter Drucker, Business Luminary

To illustrate this process, here is the Test Failure Analytics dashboard in Develocity:

If flakiness spikes, it can be investigated and attacked. It’s easy to start drilling down into a set of flaky tests:

You can then drill down on one flaky test, and identify the probable root cause (highlighted):

From there you can search for a particular error and discover useful information or insights on how best to troubleshoot the flaky test:

In the above example, the reason this test was flaky is because of the view port on CI agents. Consider that this feedback was for troubleshooting one single flaky test. Imagine if your whole engineering team focused on troubleshooting flaky tests for one day! What improvements would your engineering team see if a dedicated flaky test day was implemented at your organization?

The Bottom Line: Flaky Test Days Pay Off

Flaky tests are poison to the overall productivity of a team, and are often overlooked if individual ones don’t create enough impact to be addressed. Over time, this debt can accumulate into a test corpus that’s simply not trusted by the development team, leading to developer frustration and even reduction in code quality when these tests are ignored. Just addressing the most impactful flaky tests is usually not a comprehensive enough strategy. The overall flakiness of the build system will compound, so taking what amounts to just a few days a year to address these issues proactively has consistently proven to be well worth the time.   

Learn more