Distributed Load Testing with Locust: Scaling Your Performance Tests to Millions of Users

I’ve been in a few situations where a performance test looked great on my local machine, only for the production environment to buckle under actual load. The problem wasn’t the code—it was the test. My single-machine Locust setup had hit a CPU bottleneck, meaning I wasn’t testing the server; I was testing the limits of my own laptop. This is where distributed load testing with locust becomes non-negotiable.

If you’ve already compared locust vs jmeter for performance testing, you know that Locust’s event-driven nature makes it incredibly efficient. However, even with Gevent, a single Python process can only handle so many concurrent users before it starts dropping requests or reporting skewed latency. To simulate truly massive scale, you need to distribute the load.

The Challenge: The Single-Node Bottleneck

In a standard Locust run, one process handles both the web UI and the simulation of users. As you scale to thousands of users, the overhead of managing those simulated users consumes the CPU. When the CPU hits 100%, the ‘think time’ between requests increases, and your response times appear higher than they actually are. You’re no longer measuring your API’s latency; you’re measuring your test runner’s struggle.

To solve this, Locust uses a Master-Worker architecture. The Master node manages the state, collects statistics, and provides the Web UI, while multiple Worker nodes execute the actual Python code and generate the traffic. This decoupling allows you to scale horizontally by simply adding more workers.

Solution Overview: The Master-Worker Pattern

The beauty of distributed load testing with Locust is its simplicity. You don’t need complex configuration files for every node; you just need the same locustfile.py present on all machines. The Master coordinates the workers via ZeroMQ, ensuring that the total user count is split evenly across the fleet.

Core Architecture Components

The Master: The brain. It doesn’t generate load. It tracks failures, response times, and manages the user count.
The Workers: The muscle. They run the User classes and send the actual HTTP requests to your target.
The Target: The application or kubernetes clusters you are trying to stress test.

Implementation: Setting Up Distributed Load

I’ll walk you through a basic setup using two terminal windows (simulating two different machines), but the logic remains identical whether you’re using AWS EC2 instances or a local cluster.

Step 1: The Locustfile

Create a simple locustfile.py. Ensure this file is identical on the Master and all Worker nodes.

from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 5)

    @task
    def load_homepage(self):
        self.client.get("/")

    @task
    def load_about(self):
        self.client.get("/about")

Step 2: Start the Master

On your primary machine, start Locust in master mode. Notice the --master flag:

locust -f locustfile.py --master

Step 3: Start the Workers

On your worker machines (or separate terminals), run the following command. Replace <master-ip> with the actual IP address of your Master node:

locust -f locustfile.py --worker --master-host=<master-ip>

Locust Web UI showing connected worker nodes and real-time request statistics

As shown in the image above (if applicable to your architecture), the workers will now check in with the master. Once you open the Web UI at http://localhost:8089, you’ll see the number of active workers listed on the main page.

Advanced Techniques for High-Scale Testing

Once you move past a few workers, you’ll run into network and OS limits. In my experience, these are the three most critical optimizations for distributed load testing with Locust:

1. FastHttpUser for Maximum Throughput

If you don’t need the full feature set of the requests library (which HttpUser uses), switch to FastHttpUser. It uses geventhttpclient and is significantly faster, often allowing a single worker to handle 5-10x more users.

from locust import FastHttpUser, task

class HighPerformanceUser(FastHttpUser):
    @task
    def fast_request(self):
        self.client.get("/")

2. Tuning the OS (ulimit)

By default, Linux limits the number of open file descriptors (sockets). When running thousands of users, you’ll hit the Too many open files error. I always run this on my workers before starting the test:

ulimit -n 65535

3. Headless Execution for CI/CD

For automated pipelines, avoid the UI. Run the master in headless mode to define the duration and user count explicitly:

locust -f locustfile.py --master --headless -u 1000 -r 100 --run-time 10m

Case Study: Scaling to 50k Concurrent Users

I recently helped a client test a promotional landing page expected to hit 50k concurrent users. Using a single large instance wasn’t enough. We deployed a distributed cluster of 10 t3.medium instances on AWS.

The result? We discovered that while the app servers held up, the Load Balancer’s connection draining was misconfigured, causing 502 errors during scale-up. We wouldn’t have seen this with a single-node test because the ramp-up wasn’t aggressive enough to trigger the LB’s threshold.

Common Pitfalls to Avoid

Network Saturation: Ensure your worker nodes aren’t all on the same subnet if you’re testing network throughput. You might be saturating the VPC gateway rather than the app.
The ‘Master’ Bottleneck: While the master doesn’t generate load, it does process all the stats. If you have 100+ workers, the master’s CPU might spike. Consider increasing the master’s resources.
Mismatched Code: If the locustfile.py on a worker is different from the master’s, the worker will fail to connect or produce erratic results. Use a git submodule or a shared S3 bucket to sync files.