Optimizing Elasticsearch Query Performance for Large Logs: 10 Practical Tips

There is nothing more frustrating than running a critical production debug query and watching the loading spinner rotate for thirty seconds. When you’re dealing with terabytes of data, optimizing Elasticsearch query performance for large logs isn’t just a ‘nice-to-have’—it’s the difference between resolving an incident in five minutes or an hour.

In my experience managing ELK stacks for microservices, I’ve found that most performance bottlenecks aren’t caused by a lack of hardware, but by how the data is indexed and how the queries are structured. Whether you’re using a self-managed cluster or Elastic Cloud, these ten tips will help you slash your latency.

1. Use Filter Context Instead of Query Context

One of the most common mistakes I see is using match or term queries inside a must clause for everything. In Elasticsearch, query context calculates a relevance score (_score), which is computationally expensive and useless for logs. Logs are binary: either they match the criteria or they don’t.

Always wrap your log filters in a filter block. Filters are cached by Elasticsearch, making subsequent identical queries nearly instantaneous.

{
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Comparison of Query vs Filter context in Elasticsearch showing cached results

2. Implement Data Streams and ILM

Large log volumes lead to massive indices, which slow down everything. Instead of one giant index, use Data Streams combined with Index Lifecycle Management (ILM). By rolling over indices based on size or age, you ensure that the active index remains small enough to fit in the filesystem cache.

This approach is a cornerstone of best practices for centralized logging in microservices, as it allows you to move older logs to ‘warm’ or ‘cold’ nodes with cheaper storage.

3. Avoid Leading Wildcards

Querying *error is a performance killer. Elasticsearch must scan every single term in the inverted index to find matches. If you need to find patterns within strings, consider using the wildcard field type (introduced in newer versions) or, better yet, analyze your logs at ingestion time to extract key terms into their own fields.

4. Optimize Your Mappings (Disable Dynamic Mapping

Dynamic mapping is convenient, but it often creates unnecessary text and keyword multi-fields for every single log key. For large logs, explicitly define your mappings. If you don’t need to perform full-text searches on a field, map it as keyword only. This reduces disk usage and speeds up aggregations.

5. Use the ‘Runtime Fields’ Sparingly

Runtime fields are powerful for adding logic without re-indexing, but they are calculated at query time. If I’ve noticed a query slowing down due to a runtime script, the solution is almost always to move that logic to the ingestion pipeline (Logstash or Ingest Node) so the value is indexed physically.

6. Leverage the `exists` Query

When searching for logs that contain a specific metadata field, don’t search for a wildcard or a null value. Use the exists query. It is highly optimized to check the inverted index for the presence of a field without scanning the document values.

7. Limit the Number of Returned Fields

Fetching the full _source for thousands of large log documents puts a massive strain on network I/O and memory. Use _source filtering to return only the fields you actually need for your analysis.

{
  "_source": ["message", "timestamp", "trace_id"],
  "query": { ... }
}

8. Optimize Shard Size and Count

Too many small shards create overhead; too few large shards make recovery slow and queries sluggish. I generally aim for shard sizes between 10GB and 50GB. If your shards are 200GB+, your query performance will tank as Elasticsearch struggles to merge results from massive segments.

9. Use Point-in-Time (PIT) for Deep Pagination

Using from and size for pagination in large log sets is a recipe for an OutOfMemoryError. As the offset increases, Elasticsearch must load and sort all preceding documents. Instead, use the PIT API combined with search_after to maintain a consistent view of the data without the performance penalty.

10. Monitor Your Slow Logs

You can’t optimize what you can’t measure. Enable the index.search.slowlog setting to capture queries that exceed a certain threshold. This allows you to identify the exact problematic queries and analyze them using the _explain API.

If you find that even after these optimizations, Elasticsearch is too heavy for your log volume, you might want to explore Grafana Loki vs ELK stack for logs to see if a label-based indexing approach fits your needs better.

Common Mistakes to Avoid

Over-indexing: Indexing every single field in a JSON log, even the ones you’ll never query.
Ignoring Refresh Intervals: Keeping the default 1s refresh interval on high-write log indices, which forces constant segment merges. Increase this to 30s or 60s for better write throughput.
Using Script Queries: Using Painless scripts inside a query for basic filtering.

Measuring Success

To verify if your changes worked, I recommend tracking three key metrics in your monitoring tool:

Query Latency (p95/p99): Are the tail latencies dropping?
Search Rate: Can the cluster handle more requests per second?
JVM Heap Usage: Did the filter caching reduce the pressure on the Garbage Collector?

By focusing on how data is stored and accessed, you can keep your logging infrastructure lean and responsive. Ready to dive deeper into your infrastructure? Check out my other guides on Cloud and Infrastructure optimization.