RisingWave: Stream Processing Database Explained
RisingWave is a cloud‑native, distributed stream processing database that blurs the line between traditional OLAP systems and real‑time analytics engines. It lets you write SQL queries that continuously ingest, transform, and materialize streaming data, all while providing the familiar relational model you already know. In this article we’ll unpack the architecture, walk through a couple of end‑to‑end examples, and explore where RisingWave shines in production workloads.
Core Concepts and Architecture
At its heart, RisingWave follows the compute‑store separation pattern. The compute layer runs stateless streaming operators, while the storage layer persists state in a distributed key‑value store (e.g., etcd or RocksDB) that can survive node failures. This design enables horizontal scaling: you can add more compute nodes to boost throughput without touching the underlying storage.
RisingWave also adopts a SQL‑first philosophy. All streaming logic—windowing, joins, aggregations—are expressed as standard SQL statements. Under the hood, the planner rewrites these statements into a dataflow graph, which the runtime executes continuously as new records arrive.
Materialized Views as Continuous Queries
In RisingWave, a materialized view is the primary way to define a continuous query. Once you create a view, the system automatically keeps it up‑to‑date by processing incoming events in real time. This eliminates the need for separate ETL pipelines or batch jobs.
Because views are versioned, you can safely evolve them with CREATE OR REPLACE VIEW without disrupting downstream consumers. The old version continues to serve existing queries while the new version starts ingesting from the latest offset.
Built‑in Fault Tolerance
- Exactly‑once processing guarantees via checkpointing.
- State snapshots stored in the distributed storage layer.
- Automatic failover of compute actors when a node crashes.
These features let you focus on business logic rather than worrying about data loss or duplicate results.
Getting Started: A Minimal Setup
RisingWave can be launched locally with Docker Compose, making it easy to experiment before moving to a managed cloud deployment. Below is a quick docker-compose.yml snippet that spins up the necessary services.
version: '3.8'
services:
meta:
image: risingwave/meta:latest
ports: ["5690:5690"]
command: ["meta-node", "--listen-addr", "0.0.0.0:5690"]
compute:
image: risingwave/compute-node:latest
depends_on: ["meta"]
command: ["compute-node", "--meta-address", "meta:5690"]
frontend:
image: risingwave/frontend:latest
ports: ["4566:4566"]
depends_on: ["meta", "compute"]
command: ["frontend", "--meta-address", "meta:5690"]
After running docker compose up -d, you can connect to the SQL interface on port 4566. The official CLI (rwsql) or any PostgreSQL‑compatible client works out of the box.
First Example: Real‑Time Clickstream Analytics
Imagine you run an e‑commerce site and want to monitor page‑view events in real time. Each event contains a user ID, page URL, and a timestamp. We’ll ingest these events, compute a rolling count of page views per URL, and expose the result via a materialized view.
Step 1: Create a Source
RisingWave supports a variety of connectors. For this demo we’ll use the built‑in KAFKA source. Replace the placeholder values with your Kafka broker details.
CREATE SOURCE clickstream
FROM KAFKA
(
'bootstrap.servers' = 'localhost:9092',
'topic' = 'page_views',
'group.id' = 'risingwave_clicks'
)
WITH (
format = 'json',
json_schema = '{
"type":"object",
"properties":{
"user_id":{"type":"string"},
"url":{"type":"string"},
"event_ts":{"type":"string","format":"date-time"}
},
"required":["user_id","url","event_ts"]
}'
);
Step 2: Define a Continuous Aggregation
We’ll create a tumbling window that counts page views every minute. The WATERMARK clause tells RisingWave how late data can arrive before it’s considered too old.
CREATE MATERIALIZED VIEW page_view_counts AS
SELECT
url,
TUMBLE_START(event_ts, INTERVAL '1' MINUTE) AS window_start,
COUNT(*) AS views
FROM clickstream
WHERE event_ts >= NOW() - INTERVAL '1' HOUR -- keep only recent data
GROUP BY
url,
TUMBLE(event_ts, INTERVAL '1' MINUTE)
EMIT WATERMARK FOR event_ts AS event_ts - INTERVAL '30' SECOND;
The view now continuously updates as new Kafka messages flow in. You can query it just like a regular table:
SELECT url, window_start, views
FROM page_view_counts
ORDER BY window_start DESC
LIMIT 10;
Step 3: Visualize the Results
Because RisingWave speaks PostgreSQL, you can connect a BI tool (e.g., Metabase or Superset) directly to the frontend endpoint. Set the refresh interval to a few seconds, and you’ll see a live dashboard of the hottest pages.
Pro tip: Enable STREAMING_BATCH_SIZE in the source definition to control how many records are pulled per poll. A size of 500–1000 usually balances latency and throughput for typical clickstream workloads.
Second Example: Fraud Detection on Financial Transactions
Financial institutions need sub‑second detection of anomalous transactions. RisingWave’s support for event‑time windows, joins, and user‑defined functions (UDFs) makes it a solid fit for such use cases.
Schema Overview
- transactions: stream of incoming payments (account_id, amount, timestamp).
- account_profiles: static table with risk scores per account.
We’ll flag any transaction that exceeds three standard deviations of the account’s average amount over the past hour.
Step 1: Ingest Transaction Stream
CREATE SOURCE transactions
FROM KAFKA
(
'bootstrap.servers' = 'kafka:9092',
'topic' = 'transactions',
'group.id' = 'rw_fraud'
)
WITH (
format = 'json',
json_schema = '{
"type":"object",
"properties":{
"account_id":{"type":"string"},
"amount":{"type":"number"},
"tx_ts":{"type":"string","format":"date-time"}
},
"required":["account_id","amount","tx_ts"]
}'
);
Step 2: Load Account Profiles
Static data can be loaded via CREATE TABLE ... WITH (connector = 'file') or via an external PostgreSQL table. Here’s a quick CSV example:
CREATE TABLE account_profiles (
account_id VARCHAR PRIMARY KEY,
risk_score INT
) WITH (
connector = 'file',
path = '/data/account_profiles.csv',
format = 'csv'
);
Step 3: Compute Rolling Statistics
We’ll maintain a sliding window of the last hour for each account, calculating mean and variance on the fly.
CREATE MATERIALIZED VIEW account_stats AS
SELECT
account_id,
AVG(amount) AS avg_amount,
STDDEV_POP(amount) AS stddev_amount,
HOP_START(tx_ts, INTERVAL '5' MINUTE, INTERVAL '1' HOUR) AS window_start
FROM transactions
GROUP BY
account_id,
HOP(tx_ts, INTERVAL '5' MINUTE, INTERVAL '1' HOUR)
EMIT WATERMARK FOR tx_ts AS tx_ts - INTERVAL '1' MINUTE;
Step 4: Detect Anomalies
Now join the live transaction stream with the rolling stats and the static risk profile. Any transaction that exceeds avg + 3 * stddev and belongs to a high‑risk account is emitted to an alert sink.
CREATE MATERIALIZED VIEW fraud_alerts AS
SELECT
t.account_id,
t.amount,
t.tx_ts,
p.risk_score,
s.avg_amount,
s.stddev_amount
FROM transactions t
JOIN account_stats s
ON t.account_id = s.account_id
AND t.tx_ts BETWEEN s.window_start AND s.window_start + INTERVAL '1' HOUR
JOIN account_profiles p
ON t.account_id = p.account_id
WHERE
t.amount > s.avg_amount + 3 * s.stddev_amount
AND p.risk_score >= 7;
Finally, push the alerts to a Kafka topic for downstream incident response systems:
CREATE SINK fraud_alerts_sink
FROM fraud_alerts
INTO KAFKA
(
'bootstrap.servers' = 'kafka:9092',
'topic' = 'fraud_alerts'
)
WITH (format = 'json');
Pro tip: When dealing with high‑cardinality keys (e.g., millions of accounts), enable STATE_SHARDING on the materialized view to distribute state evenly across compute nodes and avoid hot spots.
Real‑World Use Cases Beyond the Examples
IoT telemetry: Devices generate millions of sensor readings per second. RisingWave can ingest these streams, perform per‑device windowed aggregations (e.g., average temperature over 10‑second windows), and feed the results into a time‑series database for long‑term storage.
Ad tech bidding: Real‑time bidding platforms need sub‑millisecond latency to evaluate ad impressions. By materializing a view that joins the incoming impression stream with a static campaign table, you can compute bid prices on the fly without a separate microservice.
Log analytics: Centralized logging pipelines often rely on Elasticsearch for search, but aggregations (error rates, latency percentiles) are better handled in a streaming DB. RisingWave can continuously compute these metrics and expose them via SQL, simplifying alerting.
Performance Tuning Guidelines
- Choose the right window type. Tumbling windows are cheap; hopping or sliding windows incur extra state. Use them only when necessary.
- Control watermark lag. Setting a watermark too far behind can cause state bloat. A 30‑second lag works for most near‑real‑time workloads.
- Leverage parallelism. Adjust
compute.parallelismin the config to match the number of CPU cores per node. Over‑parallelizing can increase network chatter. - Compress state. Enable
state.compression = trueto reduce RocksDB footprint, especially for large windowed aggregates. - Monitor backpressure. RisingWave emits metrics via Prometheus. Watch
rw_compute_backpressure_secondsto spot bottlenecks early.
Integrations and Ecosystem
RisingWave’s connector framework follows the source‑sink model used by Flink and Pulsar. Out of the box you get:
- Kafka, Pulsar, Redpanda for streaming ingestion.
- PostgreSQL, MySQL, Snowflake for batch reads.
- Delta Lake, S3, GCS for sink storage.
Beyond the built‑in connectors, you can write custom Rust UDFs or Python UDTFs (user‑defined table functions) to extend the SQL engine. This flexibility lets you embed ML inference directly in the dataflow, reducing the need for separate inference services.
Deploying at Scale
When moving to production, most teams opt for a managed RisingWave offering (e.g., on AWS or GCP) that handles node provisioning, autoscaling, and backup. If you prefer self‑hosting, the recommended topology includes:
- Meta service (clustered for high availability).
- Compute nodes (stateless, auto‑scaled based on query load).
- Frontend service (SQL gateway with TLS termination).
- Object storage for state snapshots (S3, GCS, or Azure Blob).
Use Kubernetes StatefulSet for the meta service and Deployment for compute nodes. Helm charts are provided by the RisingWave community to simplify this process.
Best Practices for Schema Evolution
Streaming schemas evolve as business requirements change. RisingWave supports ALTER SOURCE and CREATE OR REPLACE VIEW without downtime. The key is to maintain backward compatibility for at least one version of the dataflow while you roll out the new logic.
Typical workflow:
- Introduce a new column with a default value using
ALTER SOURCE ... ADD COLUMN. - Update downstream views to reference the new column, but keep the old column in the SELECT list.
- After verifying that all consumers handle the new schema, drop the old column.
Pro tip: Enable schema.registry (e.g., Confluent Schema Registry) for Kafka sources. RisingWave can automatically fetch the latest schema, reducing manual JSON schema management.
Monitoring and Observability
RisingWave ships with built‑in Prometheus metrics covering ingestion lag, state size, CPU usage, and query latency. A typical Grafana dashboard includes panels for:
- Ingestion rate (records/sec) per source.
- Watermark lag (seconds) per materialized view.
- State size per compute node.
- Query response time distribution.
Setting alerts on watermark lag > 60 seconds or state size growth > 20 % per hour helps catch anomalies before they impact SLAs.
Security Considerations
RisingWave supports TLS for both client‑frontend and inter‑node communication. Role‑based access control (RBAC) can be enforced via PostgreSQL‑compatible GRANT statements, limiting who can create sources, views, or sinks.
When connecting to external systems (Kafka, S3), use IAM roles or SASL/SCRAM authentication instead of plain text passwords. Secrets can be injected via Kubernetes Secrets and referenced in the connector configuration.
Future Roadmap (as of 2026)
The RisingWave community is actively working on:
- Native support for Delta Lake as both source and sink, enabling seamless lakehouse integration.
- GPU‑accelerated compute nodes for heavy ML inference workloads.
- Improved multi‑tenant isolation with per‑tenant resource quotas.
- SQL extensions for graph queries, expanding use cases to recommendation engines.
Keeping an eye on these upcoming features can help you plan migrations and take advantage of the latest performance boosts.
Conclusion
RisingWave offers a compelling blend of SQL simplicity, real‑time guarantees, and cloud‑native scalability. By treating streaming pipelines as materialized views, it eliminates the operational overhead of traditional ETL frameworks while delivering exactly‑once processing and fault tolerance out of the box. Whether you’re building clickstream dashboards, fraud detection engines, or IoT analytics, RisingWave lets you stay within the familiar relational paradigm and focus on delivering value. With robust connectors,