When I first started working with microservices, debugging a single failed request felt like trying to find a needle in a haystack—except the haystack was spread across five different servers and three different programming languages. Standard logs weren’t enough because they lacked a common thread. This is why I shifted to distributed tracing. In this step by step guide to distributed tracing with Jaeger, I’ll show you how to implement a system that allows you to visualize the entire lifecycle of a request as it travels through your architecture.
Distributed tracing isn’t just about finding errors; it’s about understanding latency. If a page load takes 2 seconds, is the bottleneck in the database query, the authentication middleware, or a slow third-party API? Jaeger gives you the answer visually.
Prerequisites
Before we dive into the implementation, ensure you have the following installed and configured on your machine:
- Docker & Docker Compose: We’ll use these to run Jaeger without manual installation.
- A basic Microservices Project: I’ll use a simple Node.js and Python setup, but the concepts apply to any language.
- OpenTelemetry SDKs: Since Jaeger is fully compatible with OpenTelemetry, we will use OTel for instrumentation. If you’re new to this, I highly recommend reading my introduction to OpenTelemetry for developers to understand the standard.
Step 1: Deploying the Jaeger Backend
The fastest way to get started is using the “all-in-one” Docker image. This combines the agent, collector, and UI into a single container, which is perfect for development and testing.
docker run -d --name jaeger
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411
-p 5775:5775/udp
-p 6831:6831/udp
-p 6832:6832/udp
-p 5778:5778
-p 16686:16686
-p 14268:14268
-p 14250:14250
-p 9411:9411
jaegertracing/all-in-one:latest
Once the container is running, you can access the Jaeger UI at http://localhost:16686. At this stage, the dashboard will be empty because we haven’t sent any traces yet.
Step 2: Instrumenting Your Application
To make your apps “traceable,” you need to instrument them. While you can use Jaeger-specific libraries, I always use OpenTelemetry (OTel) because it prevents vendor lock-in. Here is how I set up a Node.js service to send spans to Jaeger.
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
}),
traceExporter: new JaegerExporter({
endpoint: 'http://localhost:14250',
}),
});
sdk.start();
In my experience, the most critical part of this setup is the SERVICE_NAME. If you give multiple services the same name, your traces will overlap in the UI, making it impossible to distinguish between them.
Step 3: Propagating Context Across Services
Tracing only works if the trace-id is passed from Service A to Service B. This is called Context Propagation. When Service A calls Service B via HTTP, it injects a header (usually traceparent). Service B then extracts this header and starts its own span as a child of the original trace.
As shown in the image below, this creates a parent-child relationship that Jaeger visualizes as a gantt chart, allowing you to see exactly where time is being spent.
Step 4: Analyzing Traces in the Jaeger UI
Now, trigger a few requests in your application. Go back to http://localhost:16686, select your service from the dropdown, and click “Find Traces.”
When you click on a trace, you’ll see the timeline view. Look for “long bars”—these represent slow operations. You can click on a span to see the tags and logs associated with it, such as the HTTP status code or specific error messages. This integrates perfectly with best practices for centralized logging in microservices, as you can use the trace_id to jump from a log entry directly to the visual trace.
Pro Tips for Jaeger Production Setups
- Sampling Rates: Don’t trace 100% of requests in production. It will overwhelm your storage and add latency. Start with a 1% or 5% sampling rate.
- Use a Collector: In production, don’t send spans directly from the app to the Jaeger backend. Use the
Jaeger Collectoror theOpenTelemetry Collectoras a middleman to buffer and batch data. - Custom Tags: Add business-specific tags (e.g.,
customer_idororder_type) to your spans. This allows you to search for traces affecting a specific high-value customer.
Troubleshooting Common Issues
| Issue | Likely Cause | Solution |
|---|---|---|
| No traces appearing in UI | Incorrect exporter endpoint or firewall blocking port 14250. | Verify connectivity with curl and check the app logs for export errors. |
| Traces are fragmented (multiple traces for one request) | Context propagation is missing between services. | Ensure the OTel propagation headers are being sent and received in HTTP calls. |
| High CPU usage on app | Tracing overhead due to 100% sampling. | Implement a ParentBased sampler to reduce data volume. |
What’s Next?
Now that you have a basic step by step guide to distributed tracing with Jaeger implemented, you should look into integrating your traces with metrics. While Jaeger tells you why a request was slow, a tool like Prometheus tells you how many requests are slow. Combining these three pillars—logs, metrics, and traces—is the gold standard of modern observability.
Ready to level up your infrastructure? Check out my other guides on OpenTelemetry and log aggregation to build a bulletproof monitoring stack.