DuckDB: Embedded Analytics Database Tutorial
Welcome to the world of DuckDB, the in‑process analytics engine that feels like SQLite for data‑science workloads. It runs directly inside your Python process, eliminates the need for a separate server, and can query CSV, Parquet, and even Pandas DataFrames with blazing speed. In this tutorial we’ll walk through installation, core concepts, and three hands‑on examples that showcase DuckDB’s power in real‑world scenarios.
What Makes DuckDB Different?
DuckDB is designed for analytical workloads, not transactional ones. It stores data column‑wise, enabling vectorized execution that dramatically speeds up scans, aggregations, and joins. Because it’s embedded, you get the simplicity of a library call and the performance of a native database engine.
Another standout feature is its seamless integration with the Python data‑science stack. You can query a Pandas DataFrame as if it were a table, and the result comes back as a DataFrame without any manual conversion. This makes exploratory analysis feel natural and eliminates the “data‑movement” bottleneck that often plagues ETL pipelines.
Installing DuckDB
The installation process is as simple as any other Python package. Open a terminal and run:
pip install duckdb
DuckDB also offers pre‑compiled wheels for Linux, macOS, and Windows, so you won’t need a compiler. If you’re using Conda, the command is equally straightforward:
conda install -c conda-forge duckdb
Once installed, you can import the library and start a connection with a single line of code.
Basic Queries – Your First DuckDB Session
Let’s dive into a minimal example that creates an in‑memory database, loads a CSV, and runs a simple aggregation. This pattern is the backbone of many data‑science notebooks.
import duckdb
import pandas as pd
# Create an in‑memory connection
con = duckdb.connect(database=':memory:')
# Load a CSV directly into DuckDB (no need to pre‑load into Pandas)
con.execute("""
CREATE TABLE sales AS
SELECT *
FROM read_csv_auto('data/sales_2023.csv')
""")
# Run an aggregation query
result = con.execute("""
SELECT region, SUM(revenue) AS total_revenue
FROM sales
GROUP BY region
ORDER BY total_revenue DESC
""").fetchdf()
print(result)
Notice how the read_csv_auto function automatically infers column types, so you can start querying immediately. The fetchdf() call returns a Pandas DataFrame, letting you blend SQL with Python’s rich visualization ecosystem.
Why This Matters
In many projects, the first step is to explore a raw CSV file. Traditionally you’d load it into Pandas, then write Python loops for aggregates—slow and memory‑hungry. DuckDB’s approach lets you offload the heavy lifting to its columnar engine, keeping your notebook responsive even with multi‑gigabyte files.
Working Directly with Pandas DataFrames
Often you already have a DataFrame in memory and want to run complex SQL without converting it to a temporary table. DuckDB makes this effortless using the register method.
import duckdb
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
'user_id': np.arange(1, 10001),
'signup_date': pd.date_range('2023-01-01', periods=10000, freq='H'),
'plan': np.random.choice(['free', 'basic', 'premium'], size=10000),
'revenue': np.random.exponential(scale=20, size=10000).round(2)
})
con = duckdb.connect()
con.register('users', df) # Register the DataFrame as a virtual table
# Complex query with window functions
query = """
SELECT
user_id,
plan,
revenue,
AVG(revenue) OVER (PARTITION BY plan) AS avg_plan_revenue,
SUM(revenue) OVER (ORDER BY signup_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7h_rev
FROM users
WHERE revenue > 0
"""
result_df = con.execute(query).fetchdf()
print(result_df.head())
The register call creates a virtual table that points directly to the DataFrame’s memory. When you run the query, DuckDB reads the data column‑wise, applies the window functions, and returns a new DataFrame—all without copying the original data.
Pro tip: Registering a DataFrame is cheap, but re‑registering the same object repeatedly can add overhead. Keep a persistent connection alive for the duration of your notebook session to maximize performance.
Analyzing Parquet Files at Scale
Parquet is the de‑facto storage format for analytics because of its columnar layout and built‑in compression. DuckDB can query a directory of Parquet files as if they were a single table, enabling “SQL on files” without any ETL.
import duckdb
# Assume a folder with many daily Parquet files: data/2023/*.parquet
con = duckdb.connect()
# Directly query the folder; DuckDB will parallelize the scan
query = """
SELECT
date_trunc('day', event_timestamp) AS day,
COUNT(*) AS events,
AVG(event_value) AS avg_value
FROM read_parquet('data/2023/*.parquet')
WHERE event_type = 'click'
GROUP BY day
ORDER BY day
"""
daily_metrics = con.execute(query).fetchdf()
print(daily_metrics)
Behind the scenes, DuckDB opens each Parquet file, reads only the columns required for the query, and pushes down predicates (the WHERE clause) to the file level. The result is a fast, low‑memory aggregation even when the total dataset spans dozens of gigabytes.
Real‑World Use Case: Clickstream Analytics
Imagine an e‑commerce platform that logs every user interaction to Parquet files stored on S3. A data engineer can spin up a Jupyter notebook, point DuckDB at the S3 bucket (using the s3:// URI), and instantly compute conversion funnels, bounce rates, or A/B test results without provisioning a separate Redshift or BigQuery cluster.
Performance Tuning – Getting the Most Out of DuckDB
DuckDB is fast out of the box, but a few knobs can push performance even further. The most common adjustments involve memory allocation and parallelism.
- PRAGMA threads: Controls the number of CPU threads used. By default DuckDB uses all available cores, but you can limit it in shared environments.
- PRAGMA memory_limit: Caps the memory usage to avoid OOM crashes when working with massive datasets.
- PRAGMA enable_progress_bar: Turns on a progress bar for long‑running queries, useful during interactive exploration.
Example of setting these pragmas before a heavy query:
con.execute("PRAGMA threads=4")
con.execute("PRAGMA memory_limit='8GB'")
con.execute("PRAGMA enable_progress_bar")
These settings are session‑level, so they won’t affect other parts of your application. Remember to benchmark with and without changes; sometimes the default “use all cores” is already optimal.
Pro tip: When querying compressed Parquet files, the CPU often becomes the bottleneck, not I/O. Increasing threads yields the biggest gains, but watch your system’s overall load to avoid contention with other services.
Embedding DuckDB in Production Pipelines
Beyond notebooks, DuckDB can be a lightweight analytics engine inside ETL jobs, micro‑services, or even desktop applications. Its zero‑configuration nature means you can bundle it with a Docker image or a PyInstaller executable without worrying about external dependencies.
Consider a data‑quality micro‑service that receives JSON payloads, writes them to a temporary Parquet file, runs a series of validation queries, and returns a JSON report. The entire workflow fits in a few hundred lines of Python and runs in under a second for typical payload sizes.
Below is a skeleton of such a service using FastAPI:
from fastapi import FastAPI, UploadFile
import duckdb
import tempfile
import pandas as pd
app = FastAPI()
con = duckdb.connect(database=':memory:')
@app.post("/validate")
async def validate(file: UploadFile):
# Save uploaded file to a temporary location
with tempfile.NamedTemporaryFile(delete=False, suffix=".parquet") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
# Run validation query
query = f\"\"\"
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN amount < 0 THEN 1 ELSE 0 END) AS negative_amounts
FROM read_parquet('{tmp_path}')
\"\"\"
result = con.execute(query).fetchdf().to_dict(orient='records')[0]
return {"validation": result}
This pattern demonstrates DuckDB’s ability to act as a “SQL engine on the fly,” handling ad‑hoc validation without a persistent database.
Advanced Feature Spotlight – User‑Defined Functions (UDFs)
DuckDB supports Python UDFs that let you embed custom logic directly in SQL. This is handy for tasks like text processing, custom date parsing, or machine‑learning inference.
import duckdb
import re
# Define a simple UDF to extract domain from a URL
def extract_domain(url: str) -> str:
match = re.search(r'https?://([^/]+)/?', url)
return match.group(1) if match else None
con = duckdb.connect()
con.create_function('extract_domain', extract_domain, [str], str)
# Use the UDF in a query
df = con.execute("""
SELECT
url,
extract_domain(url) AS domain,
COUNT(*) AS visits
FROM read_csv_auto('data/web_logs.csv')
GROUP BY url, domain
ORDER BY visits DESC
LIMIT 10
""").fetchdf()
print(df)
The UDF runs in the same process, so there’s no network latency. For heavier workloads you can also write UDFs in C++ and compile them as extensions, but the Python approach is more than enough for most analytics tasks.
Pro tip: Keep UDFs pure and side‑effect‑free. DuckDB may call them multiple times for the same row during query optimization, so non‑deterministic functions can produce surprising results.
Real‑World Success Stories
- FinTech fraud detection: A startup uses DuckDB to scan daily CSV exports of transaction logs, compute risk scores with a Python UDF, and feed the results into a real‑time alerting system—all without spinning up a separate analytics warehouse.
- Scientific research: Researchers analyzing petabytes of climate data store intermediate results as Parquet files on a shared filesystem. DuckDB lets them run ad‑hoc statistical queries from a Jupyter notebook, dramatically reducing the time from data acquisition to insight.
- Marketing attribution: An agency aggregates clickstream data from multiple ad platforms stored as CSVs. By loading them into DuckDB, they can join on user identifiers, calculate multi‑touch attribution models, and export the final report to Excel in minutes.
Best Practices Checklist
- Prefer columnar file formats (Parquet, Arrow) over row‑based CSV for large datasets.
- Use
read_csv_autoorread_parquetdirectly in SQL to avoid unnecessary data copies. - Register DataFrames only once per session; reuse the same connection for multiple queries.
- Set
PRAGMA threadsbased on your environment’s CPU budget. - Leverage UDFs for domain‑specific logic, but keep them deterministic.
- Profile long queries with
EXPLAIN ANALYZEto spot bottlenecks.
Conclusion
DuckDB bridges the gap between the simplicity of a local library and the performance of a full‑blown analytics warehouse. Whether you’re exploring a single CSV, joining millions of rows across Parquet files, or embedding SQL into a production micro‑service, DuckDB offers a fast, zero‑config solution that integrates naturally with the Python ecosystem.
By mastering the basics—installation, in‑memory queries, DataFrame registration, and file‑based scans—you’ll unlock a workflow that keeps data close to the code, reduces latency, and scales gracefully as your datasets grow. The pro tips and real‑world examples in this tutorial should give you a solid foundation to start replacing heavyweight ETL pipelines with lightweight, DuckDB‑powered analytics.