Using DuckDB for Business Intelligence: A Deep Dive into In-Process Analytics

For years, the ‘correct’ way to handle business intelligence was to shove everything into a massive cloud data warehouse. I’ve spent countless hours configuring clusters and watching credit consumption climb in Snowflake or BigQuery, only to realize that for 80% of my daily analysis, I was paying for compute power I didn’t actually need. This is why I started experimenting with DuckDB for business intelligence.

DuckDB is an in-process SQL OLAP database. Unlike PostgreSQL or MySQL, it doesn’t run as a separate server process; it lives inside your application. For anyone building a modern data stack for startups, this shift in architecture is a game-changer. It allows you to perform complex aggregations on millions of rows of data directly on your laptop or a small VPS, without the latency or cost of a cloud round-trip.

The Challenge: The ‘Cloud Warehouse Tax’

The primary friction in traditional BI is the movement of data. To get a simple report, data usually flows from a production DB $\rightarrow$ ETL tool $\rightarrow$ Cloud Warehouse $\rightarrow$ BI Tool. Every step introduces a point of failure and a cost center. When I first looked at my monthly bills, I realized I was spending more on storing static historical data than on the actual analysis.

This is where the need to reduce Snowflake storage costs becomes a priority. But the real solution isn’t just optimizing storage—it’s questioning whether you need a permanent, always-on cluster for intermittent analytical queries.

Solution Overview: Why DuckDB?

DuckDB brings the power of a columnar store (like Redshift or ClickHouse) to a single-file format. In my experience, the magic of using DuckDB for business intelligence lies in three specific areas:

Zero Installation: It’s a pip install or a binary download. No server to manage.
Parquet Integration: It can query Parquet files directly on S3 or locally without ‘importing’ them into a table first.
Vectorized Execution: It processes data in batches, making it incredibly fast for the types of aggregations (SUM, AVG, GROUP BY) that define BI.

Techniques for High-Performance BI

To get the most out of DuckDB, you have to move away from the ‘row-based’ mindset. Here is how I implement a high-performance BI layer using Python and DuckDB.

Benchmark chart showing query response times: DuckDB vs traditional cloud warehouse for local Parquet files

Direct Parquet Querying

Instead of loading data into a database, I treat my S3 bucket as the database. As shown in the technical setup, you can query remote files with minimal overhead:

import duckdb

# Initialize connection
con = duckdb.connect()

# Install and load the httpfs extension for S3 access
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("SET s3_region='us-east-1';")

# Query a Parquet file directly from S3
result = con.execute("""
    SELECT 
        category, 
        SUM(revenue) as total_rev 
    FROM read_parquet('s3://my-bi-bucket/sales_2023/*.parquet') 
    GROUP BY 1 
    ORDER BY 2 DESC
""").df()

print(result)

Integrating with Evidence or Streamlit

DuckDB doesn’t have a built-in UI, but it pairs perfectly with lightweight BI tools. I’ve found that combining DuckDB with Evidence.dev (which uses SQL and Markdown) creates a ‘BI-as-code’ workflow that is significantly faster to iterate on than dragging and dropping widgets in Tableau.

Implementation Strategy

If you’re moving toward using DuckDB for business intelligence, I recommend a hybrid approach. Don’t delete your warehouse immediately. Instead, follow this migration path:

The ‘Local First’ Tier: Use DuckDB for ad-hoc exploration of raw Parquet/CSV exports.
The ‘Edge’ Tier: Deploy DuckDB inside a Lambda function or a small container to serve pre-aggregated data to a dashboard.
The ‘Archive’ Tier: Move historical data out of your expensive warehouse and into Parquet files on S3, querying them via DuckDB when needed for year-over-year audits.

Here is a comparison of how this changes the typical data flow:

Feature	Traditional Cloud BI	DuckDB-Powered BI
Infrastructure	Managed Cluster (High Cost)	In-process (Near Zero Cost)
Data Movement	Heavy ETL/ELT Pipelines	Direct Query on Files
Latency	Network dependent	Local disk/RAM speed
Maintenance	DBA/Data Engineer required	Dev-centric (SQL/Python)

Pitfalls to Avoid

DuckDB is powerful, but it’s not a silver bullet. In my tests, I encountered a few roadblocks you should be aware of:

Concurrency: DuckDB is designed for analytical workloads, not transactional ones. It’s not meant to be the backend for a website with 1,000 simultaneous users writing data.
Memory Limits: While it can handle datasets larger than RAM using ‘out-of-core’ processing, performance drops significantly if you’re constantly swapping to disk.
Persistence: Since it’s in-process, you need to be mindful of how you manage the .db file if you aren’t just querying raw files.

For those of you managing complex environments, remember that the goal is efficiency. Whether you are optimizing a startup data stack or managing enterprise assets, the key is placing the compute as close to the data as possible.

Ready to optimize your analytics? If you’re still struggling with cloud costs, check out my guide on reducing Snowflake storage costs to save money while you transition to a more modular architecture.