Jaeger is an open-source distributed tracing system that helps you monitor and troubleshoot requests as they flow through a microservices architecture. In simple terms, Jaeger shows you where time is spent in a request and which services were involved.
Distributed tracing is crucial in microservices environments because it allows you to:
- Identify performance bottlenecks
- Debug and troubleshoot issues across services
- Understand the flow of requests through your system
- Optimize your application’s overall performance
Jaeger’s compatibility with OpenTelemetry — a collection of tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data — makes it an even more powerful choice for developers. This compatibility ensures that you can easily integrate Jaeger with a wide range of applications and services.
While Jaeger does add some overhead to your application, it’s designed to be lightweight. The impact can be minimized through proper sampling strategies and configuration. In most cases, the performance impact is negligible compared to the benefits of distributed tracing.
With Jaeger you can:
- Monitor and troubleshoot distributed workflows
- Identify performance bottlenecks
- Track down root causes
- Analyze service dependencies
Jaeger vs Zipkin
Jaeger and Zipkin are both open-source distributed tracing systems. Jaeger offers adaptive sampling, a scalable architecture, and strong OpenTelemetry support, which makes it a common choice for new projects.
This guide walks you through running Jaeger with Docker or Docker Compose, instrumenting an app with OpenTelemetry, and using the Jaeger UI.
Prerequisites
- Docker (or Docker Engine + Docker Compose) installed. See How to install and use Docker in Ubuntu 22.04 if needed.
- For the Python example: Python 3 and
pip.
Setting up Jaeger (all-in-one)
Jaeger provides an all-in-one container image (jaegertracing/all-in-one) which bundles the collector, query, and in-memory storage. This is perfect for local development and demos (not production).
Confirm Docker is installed and running:
| |
Pull and run the Jaeger All-in-One image:
| |
Run the container (foreground; use -d to run in the background):
| |
To run in the background, add -d before the image name.
Run with Docker Compose
Create docker-compose.yml:
| |
Start Jaeger with Docker Compose v2 (docker compose) or v1 (docker-compose):
| |
This runs the all-in-one configuration: collector, query service, and in-memory storage in one process. Data is not persisted across restarts; use a proper storage backend for production.
- Open the Jaeger UI at
http://localhost:16686(orhttp://<server-ip>:16686if remote). Allow port 16686 in your firewall if you need external access. - Send test spans using the OpenTelemetry example below or your own instrumented app.
Jaeger User Interface
Open http://localhost:16686 in your browser. The UI includes:
Service Map: A visual representation of the services and their interactions.Tracing: A detailed view of individual traces, including the request and response data.Span: A detailed view of individual spans, including the request and response data.Query: A search interface for querying tracing data.
Tip: After you send a few spans, select your service in the dropdown and click Find Traces.
Ports exposed by the Jaeger all-in-one container
The all-in-one container exposes several ports. The most commonly used ones are:
16686/tcp: Jaeger UI (Query service)4317/tcp: OTLP gRPC receiver (OpenTelemetry)4318/tcp: OTLP HTTP receiver (OpenTelemetry)14268/tcp: Jaeger collector HTTP (Thrift)14250/tcp: Jaeger collector gRPC6831/udpand6832/udp: Jaeger agent receivers (legacy Jaeger clients)9411/tcp: Zipkin compatibility endpoint
If you’re using OpenTelemetry, you’ll typically send spans to 4317 or 4318 and view traces in the UI on 16686.
Jaeger Architecture
Jaeger’s architecture consists of several key components:
Client Libraries: These libraries are used to instrument your application code. They create spans and send them to the Jaeger Agent.Agent: A network daemon that listens for spans sent by the client libraries. It batches and sends them to the Collector.Collector: Receives traces from the Agent and runs them through a processing pipeline. It then stores them in a storage backend.Query: A service that retrieves traces from storage and hosts a UI to display them.UI:A web interface for searching and analyzing traces.
Data flows from your instrumented application through the client libraries to the Agent, then to the Collector, and finally to storage. The Query service retrieves this data from storage to display in the UI.
Jaeger supports multiple storage options, including Cassandra, Elasticsearch, and in-memory storage (for development). The choice of storage depends on your scalability needs and existing infrastructure.
Sampling plays a crucial role in Jaeger’s architecture. It allows you to control the amount of tracing data you collect, which is essential for managing performance and storage costs in high-traffic systems.
Instrumenting an application (recommended: OpenTelemetry)
To instrument your application, prefer using the OpenTelemetry SDK and export spans to Jaeger via OTLP.
Example (Python) exporting spans to Jaeger over OTLP gRPC (localhost:4317):
| |
You should then see my-service in the Jaeger UI.
Install dependencies:
| |
Advanced Instrumentation Techniques
Advanced techniques you can use with Jaeger:
Custom Samplers: Create samplers that make intelligent decisions about which traces to sample based on your specific needs.Baggage: Use baggage to pass data along the entire trace, which can be useful for correlating information across services.Multiple Spans: Create and manage multiple spans within a single trace to represent different operations or sub-operations.Logging Integration: Integrate Jaeger with your logging system to enhance debugging capabilities.
Jaeger in a Production Environment
When deploying Jaeger in production, consider the following:
Scalability: Each component (Agent, Collector, Query) can be scaled independently. Use load balancers to distribute traffic.Storage: Implement a production-ready storage backend like Elasticsearch or Cassandra. Ensure proper sizing and configuration for your expected data volume.Security: Set up secure communication between Jaeger components using TLS. Implement authentication and authorization for the Jaeger UI.Sampling: Implement appropriate sampling strategies based on your traffic patterns and tracing needs. Dynamic sampling can help balance data collection and system performance.
Analyzing Traces with Jaeger UI
The Jaeger UI provides powerful tools for analyzing traces:
- Use the search functionality to find relevant traces based on service, operation, tags, or duration.
- Examine the trace timeline to understand the relationships between spans and identify long-running operations.
- Inspect span details, including tags and logs, to gather context about each operation.
- Use the comparison view to analyze multiple traces side by side and identify patterns or anomalies.
Best Practices for Effective Jaeger Implementation
To get the most out of Jaeger:
- Follow consistent naming conventions for services and operations to make searching and filtering easier.
- Implement appropriate sampling strategies to balance data collection and system performance.
- Ensure proper error handling and logging within instrumented code to provide context for issues.
- Regularly review and optimize your tracing implementation to ensure it continues to meet your needs as your system evolves.
Service Performance Monitoring (SPM) with Jaeger
Service Performance Monitoring (SPM) is a feature in Jaeger that provides service-level and operation-level aggregation of key metrics like request rates, error rates, and durations (latencies). It helps identify interesting traces (e.g. high QPS, slow or erroneous requests) without needing to know the service or operation names up-front.
How does SPM work?
SPM works by aggregating span data from Jaeger to produce RED (Request, Error, Duration) metrics. The OpenTelemetry Collector’s Span Metrics Processor is used to generate these metrics from the incoming traces.
Key features of SPM
- Service-level and operation-level aggregation of request rates, error rates, and latencies (P95, P75, P50)
- “Impact” metric computed as the product of latency and request rate to highlight high-impact operations
- Pre-populated Trace search with relevant service, operation and lookback period for interesting traces
Accessing SPM in Jaeger
The SPM feature can be accessed from the “Monitor” tab in the Jaeger UI. It requires Jaeger 1.57 or later.
Limitations
SPM is still an experimental feature and may have further changes in the future.
It typically requires an OpenTelemetry Collector configured with the span metrics processor, and a metrics backend to store/query the generated metrics.
Verifying the setup
- Container:
docker ps(ordocker compose ps) should show the Jaeger container with ports 16686, 4317, 4318, etc. mapped. - UI: Opening
http://localhost:16686should load the Jaeger UI; the Service dropdown may be empty until you send traces. - Traces: After running the Python example (or any OTLP sender), select your service name (e.g.
my-service) and click Find Traces to see spans.
Summary
You ran Jaeger using the all-in-one Docker image (standalone docker run or Docker Compose), with the UI on port 16686 and OTLP receivers on 4317 (gRPC) and 4318 (HTTP). You instrumented a small Python app with the OpenTelemetry SDK and exported spans to Jaeger. For production, deploy the collector, query, and storage as separate components and use a persistent backend (e.g. Elasticsearch, Cassandra).
For more monitoring stack options, see