GlareDB: Query Multiple Data Sources with SQL
GlareDB is a modern, in‑process analytics engine that lets you run ANSI‑SQL queries across a variety of data sources—CSV files, Parquet, relational databases, even cloud storage—without moving the data. Think of it as a lightweight data fabric that lives inside your Python process, giving you the power of a data warehouse while keeping latency low. In this post we’ll walk through installing GlareDB, wiring up multiple sources, and writing queries that span them, all with practical code you can copy‑paste today.
What is GlareDB?
At its core, GlareDB embeds an SQLite‑compatible engine written in Rust, but it extends the SQL dialect with table‑valued functions (TVFs) and foreign data wrappers. This means you can treat a remote PostgreSQL table, a local CSV, or a Parquet file as if they were ordinary SQL tables, and join them together in a single SELECT statement.
Because the engine runs in‑process, there’s no network hop between your Python code and the query planner. The result is sub‑second query latency for exploratory analytics, while still supporting the full power of joins, window functions, and aggregates.
Installing and Starting GlareDB
The first step is to install the Python client. GlareDB ships as a pip package that bundles the native binary for your platform.
# Install the client
pip install glare-db
# Verify installation
import glare
print(glare.__version__) # e.g., 0.7.3
Once installed, you can spin up an in‑memory instance or persist data to a local directory. For most development work, an in‑memory DB is sufficient.
import glare
# Create an in‑memory GlareDB connection
conn = glare.connect(":memory:")
# Or persist to a file on disk
# conn = glare.connect("my_glaredb.db")
Pro tip: When you expect to reuse the same schema across sessions, point GlareDB at a persistent file. This avoids re‑registering data sources each time you start your notebook.
Registering Multiple Data Sources
GlareDB uses catalogs to group data sources. You can register a CSV, a Parquet folder, or a remote relational database with a single line of Python. Below we register a local CSV and a PostgreSQL database.
Registering a CSV File
# Assume we have a CSV of sales data
csv_path = "data/sales_2023.csv"
conn.execute(f"""
CREATE EXTERNAL TABLE sales (
order_id BIGINT,
product_id BIGINT,
quantity INT,
price DOUBLE,
order_date DATE
)
USING csv
OPTIONS (
path '{csv_path}',
header 'true',
delimiter ','
)
""")
The CREATE EXTERNAL TABLE statement tells GlareDB to treat the CSV as a virtual table. No data is copied; the engine reads rows on demand.
Registering a PostgreSQL Database
# PostgreSQL connection details
pg_conn_str = "postgresql://user:password@localhost:5432/warehouse"
conn.execute(f"""
CREATE EXTERNAL TABLE customers (
customer_id BIGINT,
name TEXT,
email TEXT,
signup_date DATE
)
USING postgresql
OPTIONS (
url '{pg_conn_str}',
schema 'public',
table 'customers'
)
""")
Behind the scenes GlareDB opens a connection pool to the remote PostgreSQL instance. The table definition must match the remote schema, but you can also let GlareDB infer columns by omitting the column list.
Querying Across Sources
Now that we have sales (CSV) and customers (PostgreSQL) registered, we can write a single SQL query that joins them. GlareDB will push down predicates to each source where possible, minimizing data movement.
query = """
SELECT
c.customer_id,
c.name,
SUM(s.quantity * s.price) AS total_spent,
COUNT(*) AS orders_count
FROM customers c
JOIN sales s
ON c.customer_id = s.customer_id
WHERE s.order_date >= DATE '2023-01-01'
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC
LIMIT 10
"""
for row in conn.execute(query):
print(row)
The result is a list of the top ten customers by revenue, calculated on‑the‑fly from two distinct backends. Notice how the WHERE clause filters the CSV rows before the join, reducing the amount of data that needs to be streamed from disk.
Fetching Results into a Pandas DataFrame
Most data‑science workflows prefer a DataFrame. GlareDB’s Python client integrates seamlessly with Pandas.
import pandas as pd
df = pd.read_sql_query(query, conn)
print(df.head())
Under the hood, GlareDB streams rows via SQLite’s cursor API, so memory usage stays low even for large result sets.
Real‑World Use Cases
1. Marketing Attribution – A marketing analyst often needs to combine click‑stream logs stored in Parquet on S3 with CRM data in Snowflake. By registering the Parquet folder and the Snowflake connector, a single query can attribute revenue to marketing campaigns without ETL.
2. IoT Sensor Fusion – Imagine sensor readings in CSV files on an edge device and a central PostgreSQL table that stores device metadata. GlareDB lets you join live sensor data with device attributes to compute anomaly scores in real time.
3. Financial Reconciliation – Auditors frequently compare transaction logs in CSV (exported from a legacy system) against a modern ledger in MySQL. A federated query can highlight mismatches, saving hours of manual spreadsheet work.
Performance Tips
While GlareDB is fast out of the box, you can squeeze extra performance by aligning data formats and leveraging push‑down capabilities.
- Columnar formats win. Parquet and Arrow files enable vectorized reads, drastically reducing CPU overhead compared to row‑oriented CSV.
- Predicate push‑down. Write filters that reference only columns present in the external source; GlareDB will translate them into native filters (e.g., PostgreSQL WHERE clauses).
- Batch size tuning. For large remote tables, adjust the
fetch_sizeoption in the connector to balance latency and memory usage.
Pro tip: When joining a massive remote table with a small local CSV, place the CSV on the left side of the join (i.e., FROM local_csv lc JOIN remote_table rt) so GlareDB can broadcast the small side and avoid costly shuffles.
Advanced Features
GlareDB isn’t just a query engine; it also supports user‑defined functions (UDFs) written in Python, allowing you to embed custom logic directly in SQL.
Creating a Python UDF
# Define a simple UDF that classifies order size
@conn.udf
def order_category(quantity: int) -> str:
if quantity >= 100:
return "bulk"
elif quantity >= 10:
return "regular"
else:
return "small"
# Use the UDF in a query
category_query = """
SELECT
order_id,
order_category(quantity) AS size_category,
price
FROM sales
WHERE order_date = DATE '2023-07-15'
"""
df_category = pd.read_sql_query(category_query, conn)
print(df_category.head())
The UDF runs in the same Python process, so there’s no network hop. For heavy computations you can offload work to a compiled extension or a separate process pool.
Materialized Views for Repeated Workloads
If a particular federated join is executed frequently, you can materialize it as a regular table. This caches the result locally, turning a multi‑source query into a fast, single‑source lookup.
# Create a materialized view (actually a table) from the join
conn.execute("""
CREATE TABLE top_customers AS
SELECT
c.customer_id,
c.name,
SUM(s.quantity * s.price) AS total_spent
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.name
""")
# Subsequent queries hit the local table only
df_cached = pd.read_sql_query("SELECT * FROM top_customers ORDER BY total_spent DESC LIMIT 5", conn)
print(df_cached)
Remember to refresh the table whenever the underlying sources change—GlareDB does not automatically keep materialized tables in sync.
Testing and Debugging Queries
When dealing with multiple backends, it’s helpful to inspect the execution plan. GlareDB exposes EXPLAIN just like SQLite.
plan = conn.execute("EXPLAIN " + query).fetchall()
for line in plan:
print(line[0])
The plan shows which parts of the query were pushed down to PostgreSQL versus executed locally. If you see a full table scan on a remote source, consider adding an index or tightening the filter.
Security and Access Control
Because GlareDB connects directly to external systems, you should manage credentials securely. The recommended approach is to use environment variables or a secrets manager rather than hard‑coding connection strings.
import os
pg_url = os.getenv("PG_URL") # e.g., "postgresql://user:pwd@host/db"
conn.execute(f"""
CREATE EXTERNAL TABLE customers (...)
USING postgresql
OPTIONS (url '{pg_url}', schema 'public', table 'customers')
""")
For production workloads, you can also enable TLS on the PostgreSQL connector and restrict GlareDB’s file system access using OS‑level permissions.
Deploying GlareDB in Production
While GlareDB shines in notebooks, you can embed it in a Flask or FastAPI service to expose an SQL‑over‑REST endpoint. The engine’s low memory footprint (often < 200 MB) makes it suitable for containerized deployments.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import glare
app = FastAPI()
db = glare.connect(":memory:")
class QueryRequest(BaseModel):
sql: str
@app.post("/query")
def run_query(req: QueryRequest):
try:
rows = db.execute(req.sql).fetchall()
return {"rows": rows}
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
Containerize the service with a lightweight base image (e.g., python:3.12-slim) and expose port 8000. This pattern lets downstream applications submit ad‑hoc SQL without needing a full‑blown data warehouse.
Best Practices Checklist
- ✅ Keep source schemas in sync with GlareDB definitions.
- ✅ Prefer columnar formats (Parquet, Arrow) for large files.
- ✅ Use
EXPLAINto verify predicate push‑down. - ✅ Cache expensive joins with materialized tables.
- ✅ Store credentials outside code (env vars, secret manager).
- ✅ Monitor connection pools for remote databases.
Conclusion
GlareDB bridges the gap between scattered data silos and the simplicity of a single SQL interface. By registering CSVs, Parquet, and relational databases as external tables, you can compose federated queries that run fast, stay low‑cost, and avoid cumbersome ETL pipelines. The Python client makes integration painless, while features like UDFs, materialized views, and execution‑plan introspection give you the control needed for production workloads. Whether you’re a data analyst stitching together marketing reports or a developer building an analytics microservice, GlareDB provides a flexible, in‑process engine that scales with your needs.