Dealing with Flaky Tests in CI/CD: A Comprehensive Resolution Guide

There is nothing more frustrating than a build that fails for no apparent reason, only to pass five minutes later after you hit ‘Re-run’. I’ve spent countless hours staring at GitHub Actions logs, trying to figure out why a test failed in the CI environment but passed perfectly on my local MacBook Pro. This is the hallmark of a flaky test—a test that provides non-deterministic results without any change to the code.

If you’re looking for a flaky tests in CI/CD resolution guide, you’ve likely already realized that ignoring them is a dangerous game. Flaky tests create a ‘crying wolf’ effect where developers start ignoring actual regressions because they assume the failure is just ‘the usual flakiness’. This guide is based on my experience managing pipelines for distributed systems, where race conditions and network latency are constant enemies.

The Fundamentals: Why Tests Flake

Before we can fix the problem, we have to understand the ‘why’. In my experience, flakiness almost always stems from one of three categories: asynchronicity, shared state, or environmental instability.

1. Asynchronicity and Race Conditions

This is the most common culprit in frontend and integration tests. You’re waiting for an API call to return or a DOM element to appear, and you’ve used a fixed timeout (e.g., sleep(2000)). On a fast local machine, 2 seconds is plenty. In a congested CI runner, the API might take 2.1 seconds, and your test fails.

2. Shared State and Order Dependency

Tests should be atomic. However, if Test A creates a user in a database and Test B expects that database to be empty, the outcome depends entirely on the order in which the test runner executes them. If you start sharding tests in GitHub Actions to speed up your suite, you often uncover these hidden dependencies because tests are now running in parallel across different machines.

3. Environmental Leaks

CI runners are often resource-constrained. CPU throttling or memory limits can cause timeouts that don’t happen locally. I’ve seen cases where a test fails only when it runs on a specific runner image because of a missing system dependency or a slight difference in the Node.js runtime version.

Deep Dive: Strategies for Isolation

The Quarantine Pattern

The first rule of flaky test resolution is: do not let flaky tests block the pipeline. Once a test is identified as flaky, I move it to a ‘quarantine’ suite. This suite still runs, but its failure does not break the build. This allows the team to maintain velocity while a developer investigates the root cause.

Stress Testing (The Loop Method)

How do you prove a test is flaky? I use a simple bash loop to run the suspect test 100 times in a row. If it fails even once, you’ve confirmed the flakiness.

# Example: Running a specific Jest test 100 times
for i in {1..100}; do
  npm test tests/user_service.test.js || break
  echo "Attempt $i: Pass"
done

Terminal output showing a flaky test failing intermittently during a bash loop

If the loop breaks, the test is flaky. If it passes 100 times, the issue might be related to the CI environment rather than the test logic itself.

Implementation: Fixing the Flakiness

Replacing Sleep with Polling

Stop using sleep(). Instead, use a polling mechanism that checks for a condition and times out only after a reasonable ceiling. Most modern frameworks like Playwright or Cypress do this by default, but for custom integration tests, I recommend a utility like this:

async function waitForCondition(conditionFn, timeout = 5000) {
  const start = Date.now();
  while (Date.now() - start < timeout) {
    if (await conditionFn()) return true;
    await new Promise(resolve => setTimeout(resolve, 100));
  }
  throw new Error("Condition not met within timeout");
}

Database Isolation per Test

To solve shared state issues, avoid sharing a single database instance. I prefer using database transactions that roll back after every test or using unique IDs (UUIDs) for every piece of data created during a test run. This ensures that Test A and Test B never touch the same row.

When you implement these fixes, you’ll likely notice your total test time increasing. To mitigate this, I highly recommend learning how to reduce CI/CD build time for tests by optimizing your setup scripts and caching dependencies.

Principles for Long-Term Stability

Deterministic Data: Never rely on Date.now() or random number generators in assertions. Mock your clocks and seeds.
Hermetic Environments: Use Docker to ensure the CI environment is an exact mirror of your local setup.
Fail Fast, Fail Loudly: When a test fails in quarantine, it should trigger a low-priority alert to remind the team it needs fixing.

Tooling Recommendations

Depending on your stack, these tools can help you track and resolve flakiness:

Tool	Best For	Key Feature
TestRetry	General CI	Automatically retries failed tests (use sparingly!)
Playwright	E2E Testing	Auto-waiting and trace viewers for debugging
Datadog CI Visibility	Enterprise	Tracks flakiness trends across thousands of builds

While retrying tests is a common “quick fix,” be careful. If you rely on retries too much, you’re just hiding the flakiness rather than resolving it, which eventually leads to a bloated, slow pipeline.

Final Thoughts

Resolving flaky tests is a game of detective work. It requires a shift in mindset from “it works on my machine” to “it must work deterministically under load.” By isolating the tests, using polling instead of sleeps, and ensuring state isolation, you can reclaim the trust your team has in your CI/CD process.