When I first started dealing with high-velocity telemetry data, I tried using a traditional relational database. It worked—until I hit 100 million rows. Suddenly, simple aggregations that should have taken milliseconds took minutes. That’s when I moved to ClickHouse. If you are looking for a clickhouse tutorial for time series data, you’ve come to the right place.
ClickHouse isn’t just another database; it’s a columnar store designed specifically for OLAP (Online Analytical Processing). For time series data—where you have a timestamp, a few dimensions, and many metrics—this architecture is a game-changer. While you might be considering the best database for real-time analytics, ClickHouse usually wins on raw read speed for massive datasets.
Prerequisites
- Basic knowledge of SQL.
- A running instance of ClickHouse (Docker is the fastest way to start).
- A dataset or a stream of events (JSON or CSV).
Step 1: Designing the Schema for Time Series
In ClickHouse, the way you define your table determines your performance. The most critical part is the MergeTree engine family. For time series, we almost always use MergeTree or its specialized variants.
Here is how I set up a typical metrics table for server monitoring:
CREATE TABLE server_metrics (
timestamp DateTime64(3, 'UTC'),
metric_name LowCardinality(String),
host_id UInt32,
value Float64
) ENGINE = MergeTree()
ORDER BY (metric_name, host_id, timestamp);
Why this design?
- LowCardinality(String): I use this for
metric_namebecause there are only a few dozen unique metric names but millions of rows. This saves massive amounts of disk space. - ORDER BY: This is the primary key. By putting
metric_nameandhost_idfirst, ClickHouse can skip entire granules of data when you filter by a specific metric or host.
Step 2: Efficient Data Ingestion
One mistake I see often is inserting data row-by-row. Never do this in ClickHouse. It creates too many small parts on disk, leading to “Too many parts” errors.
Instead, use batch inserts. If you are using Python, use the clickhouse-connect library to send chunks of 10k-100k rows. If you have a high-volume stream, I recommend using the Kafka engine. For those exploring local alternatives, a DuckDB vs MotherDuck comparison might show you how to handle smaller local files before pushing them to a production ClickHouse cluster.
Step 3: Querying Time Series Data
The power of ClickHouse is in its specialized functions. For time series, you’ll spend most of your time with toStartOfHour(), toStartOfMinute(), and the avg/max aggregations.
Here is a query to get the average CPU usage per hour for the last 24 hours:
SELECT
toStartOfHour(timestamp) AS hour,
avg(value) AS avg_cpu
FROM server_metrics
WHERE metric_name = 'cpu_usage'
AND timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;
As shown in the visualization below, the way ClickHouse processes this is by reading only the timestamp and value columns, ignoring everything else, which is why it’s so fast.
Pro Tips for Production
- Use TTLs: Time series data grows indefinitely. Use
TTL timestamp + INTERVAL 30 DAYto automatically delete old data. - Materialized Views: If you have a dashboard that refreshes every second, don’t query the raw table. Create a
SummingMergeTreematerialized view to pre-aggregate data. - Compression: Try the
ZSTD(1)codec for columns with high redundancy to save more space.
Troubleshooting Common Issues
Issue: “Too many parts” error
This happens when you insert data too frequently in small batches. Solution: Increase your batch size or use a buffer table.
Issue: Slow queries on large tables
Check if your WHERE clause matches your ORDER BY key. If you filter by timestamp but it’s the last element in your ORDER BY, ClickHouse has to scan more data than necessary.
What’s Next?
Now that you’ve mastered the basics of this clickhouse tutorial for time series data, I recommend looking into ClickHouse Keeper for clustering and high availability. You can also explore integrating Grafana to visualize your ClickHouse data in real-time.