For years, the standard workflow for local data analysis in Python was simple: load a CSV into Pandas and hope your RAM didn’t bottom out. But as my datasets grew into the tens of millions of rows, Pandas started to crawl. That’s when I discovered DuckDB.

In this python duckdb tutorial, I’ll show you how to leverage this “SQLite for Analytics” to perform complex SQL queries directly on your local files without the overhead of setting up a full Postgres or Snowflake instance. Whether you are exploring python for data engineering vs data science or just trying to speed up a local script, DuckDB is a game-changer.

Prerequisites

Before we dive in, make sure you have Python 3.8+ installed. You don’t need to install a separate database server because DuckDB is an in-process library. You’ll only need a few packages:

Step 1: Installation and Connection

Getting started is incredibly simple. Since DuckDB is an embedded database, installation is just a pip command.

pip install duckdb pandas polars

Now, let’s establish a connection. You can choose between an in-memory database (volatile) or a persistent database file (stored on disk).

import duckdb

# In-memory connection (everything is lost when the script ends)
con_mem = duckdb.connect(database=':memory:')

# Persistent connection (saved to a file)
con_file = duckdb.connect(database='my_analytics.db')

Step 2: Querying Files Without Importing

One of the most powerful features I’ve used in DuckDB is the ability to run SQL directly on top of CSV or Parquet files without formally “importing” them into a table first. This bypasses the slow ETL process entirely.

import duckdb

# Query a CSV file directly
results = duckdb.query("SELECT * FROM 'data/sales_2025.csv' WHERE region = 'North' LIMIT 5")
print(results.df()) # Convert to Pandas DataFrame immediately

If you’re working with massive datasets, I highly recommend using Parquet files. As I’ve noted in my polars vs pandas comparison 2025, columnar storage is essential for performance. DuckDB reads Parquet files with incredible efficiency.

Step 3: Integrating with Pandas and Polars

You don’t have to choose between SQL and DataFrames. DuckDB can “see” Pandas DataFrames and Polars DataFrames in your Python memory and query them as if they were SQL tables.

import duckdb
import pandas as pd

df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})

# DuckDB treats the variable 'df' as a table name
res = duckdb.query("SELECT name FROM df WHERE age > 28").df()
print(res)

As shown in the image below, this seamless integration allows you to use SQL for heavy lifting (aggregations/joins) and Pandas for final formatting or visualization.

Visual representation of DuckDB querying a Pandas DataFrame in memory
Visual representation of DuckDB querying a Pandas DataFrame in memory

Step 4: Creating Persistent Tables and Views

While querying files directly is great for exploration, creating tables is better for repeated analysis. I usually use CREATE TABLE AS to materialize a cleaned version of my raw data.

con = duckdb.connect('my_analytics.db')

# Create a table from a CSV
con.execute("CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_data.csv')")

# Create a view for a common report (does not store data, just the query)
con.execute("CREATE VIEW monthly_summary AS SELECT date_trunc('month', order_date), sum(amount) FROM sales GROUP BY 1")

Pro Tips for Maximum Performance

Troubleshooting Common Issues

Error: “Out of Memory”
Even though DuckDB is efficient, very large joins can still eat RAM. I solve this by setting a memory limit: con.execute("SET max_memory = '8GB'").

CSV Parsing Errors
If your CSV has weird encoding or delimiters, avoid read_csv_auto and explicitly define the parameters in read_csv('file.csv', delim='|', header=True).

What’s Next?

Now that you’ve mastered the basics of this python duckdb tutorial, I suggest exploring DuckDB’s extensions. You can install the httpfs extension to query files directly from an S3 bucket without downloading them first.

If you’re feeling bold, try integrating DuckDB into a Streamlit dashboard. The combination of a fast OLAP engine and a reactive frontend is the fastest way to build internal data tools today.