Vercel AI SDK 4.0: Streaming AI Responses
Vercel’s AI SDK has become the go‑to toolkit for developers who want to plug large language models (LLMs) into their web apps with minimal friction. With the release of version 4.0, the SDK now supports true streaming of AI responses, letting you deliver token‑by‑token output straight to the UI. This not only makes the experience feel snappier, it also opens the door to new interaction patterns like progressive summarisation, real‑time suggestions, and dynamic UI updates.
Why Streaming Matters
Traditional API calls return the entire model output in one payload, which means users stare at a loading spinner until the model finishes generating text. In contrast, streaming sends each token (or small batch of tokens) as soon as it’s produced. The UI can render partial results instantly, mimicking the way ChatGPT feels on the web. This reduces perceived latency, keeps users engaged, and can even cut down on bandwidth by allowing early termination of the request.
Beyond UX, streaming aligns better with serverless pricing models. Vercel’s edge functions are billed per execution time, so the quicker you can finish a request, the less you pay. By processing tokens as they arrive, you can decide to stop generation early based on business logic, saving compute cycles.
Getting Started: Project Setup
First, create a fresh Next.js app (or any framework that runs on Vercel) and install the AI SDK. The SDK bundles both the client‑side helpers and the server‑side runtime, so you only need one dependency.
# In your project root
npm install @vercel/ai@latest
# or with Yarn
yarn add @vercel/ai@latest
Next, add your API key for the LLM provider (OpenAI, Anthropic, etc.) to the Vercel environment variables. For OpenAI it looks like this:
# .env.local
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Vercel automatically injects .env.local into edge functions, so you’re ready to start coding.
Basic Streaming Example
Let’s build the classic “Ask the AI” endpoint. The new streamText helper abstracts away the low‑level streaming details and returns a ReadableStream that you can pipe directly to the client.
# pages/api/chat.ts
import { streamText } from '@vercel/ai';
import { OpenAI } from '@vercel/ai/providers/openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
export default async function handler(req, res) {
const { prompt } = await req.json();
const stream = await streamText({
model: openai.chat('gpt-4o-mini'), // new model name in 4.0
messages: [{ role: 'user', content: prompt }],
// Optional: stop token, temperature, etc.
});
// Set correct headers for SSE (Server‑Sent Events)
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
});
// Pipe the stream directly to the response
stream.pipeTo(new WritableStream({
write(chunk) {
// Each chunk is a stringified JSON with a 'text' field
const data = JSON.stringify({ text: chunk });
res.write(`data: ${data}\n\n`);
},
close() {
res.end();
},
}));
}
On the client side, you can consume this SSE stream using the EventSource API. Each incoming token is appended to the chat bubble as soon as it arrives.
// components/ChatBox.jsx
import { useState } from 'react';
export default function ChatBox() {
const [messages, setMessages] = useState([]);
const sendPrompt = async (prompt) => {
const source = new EventSource('/api/chat');
source.onmessage = (event) => {
const { text } = JSON.parse(event.data);
setMessages((prev) => {
const last = prev[prev.length - 1];
// Append token to the last message if it's from the AI
if (last?.role === 'assistant') {
last.content += text;
return [...prev.slice(0, -1), last];
}
// Otherwise start a new assistant message
return [...prev, { role: 'assistant', content: text }];
});
};
// Kick off the request
await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
};
// UI rendering omitted for brevity
}
The result is a chat interface that feels instantly responsive, even for long answers.
Handling Partial Responses & Errors
Streaming introduces new edge cases: network hiccups, partial JSON payloads, or early termination by the model. The SDK emits a special finish_reason field when the model decides to stop. You can listen for it and perform clean‑up tasks.
# Enhanced server handler with error handling
export default async function handler(req, res) {
try {
const { prompt } = await req.json();
const stream = await streamText({
model: openai.chat('gpt-4o-mini'),
messages: [{ role: 'user', content: prompt }],
});
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
});
const writer = new WritableStream({
write(chunk) {
const data = JSON.stringify({ text: chunk });
res.write(`data: ${data}\n\n`);
},
close() {
res.end();
},
});
// Forward any errors from the SDK to the client
stream.pipeTo(writer).catch((err) => {
console.error('Streaming error:', err);
res.write(`event: error\ndata: ${JSON.stringify({ message: err.message })}\n\n`);
res.end();
});
} catch (e) {
res.status(500).json({ error: e.message });
}
}
On the front end, you can react to an error event and show a friendly fallback message.
// client side snippet
source.addEventListener('error', (e) => {
const { message } = JSON.parse(e.data);
alert(`Oops! Something went wrong: ${message}`);
});
Real‑World Use Case #1: Live Code Completion
Imagine an IDE extension that offers AI‑driven code suggestions as you type. With streaming, each token can be displayed in the editor immediately, giving developers the impression that the AI is “thinking” alongside them.
Backend logic is almost identical to the chat example, but you tailor the prompt with the current file context and ask the model to return a snippet. The client then injects the tokens into the editor’s suggestion overlay.
# pages/api/complete.ts
import { streamText } from '@vercel/ai';
import { OpenAI } from '@vercel/ai/providers/openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export default async function handler(req, res) {
const { fileContent, cursorPosition } = await req.json();
const prompt = `
You are a helpful coding assistant.
File content up to cursor:
"""${fileContent.slice(0, cursorPosition)}"""
Provide the next few lines of JavaScript code that continue logically.
`;
const stream = await streamText({
model: openai.chat('gpt-4o-mini'),
messages: [{ role: 'user', content: prompt }],
maxTokens: 120,
});
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
});
stream.pipeTo(new WritableStream({
write(chunk) {
res.write(`data: ${JSON.stringify({ token: chunk })}\n\n`);
},
close() {
res.end();
},
}));
}
In the editor extension, each incoming token is appended to a ghost text overlay, letting the user accept or reject the suggestion on the fly.
Real‑World Use Case #2: Incremental Summarisation
Content platforms often need to generate summaries for long articles. Streaming lets you show the summary grow sentence by sentence, which is perfect for a “progress bar” UI that keeps readers informed.
Here’s a concise implementation that chunks the source article into paragraphs, feeds them one by one, and streams the cumulative summary back to the client.
# pages/api/summarise.ts
import { streamText } from '@vercel/ai';
import { OpenAI } from '@vercel/ai/providers/openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export default async function handler(req, res) {
const { article } = await req.json();
// Split into manageable chunks (e.g., 500 characters)
const chunks = article.match(/.{1,500}/g) || [];
// Prepare a system prompt that asks the model to keep a running summary
const systemPrompt = `
You are summarising a long article. After each paragraph you receive, append a concise sentence to the overall summary.
Only output the updated summary after processing each paragraph.
`;
// Create a generator that yields messages incrementally
async function* messageGenerator() {
yield { role: 'system', content: systemPrompt };
for (const chunk of chunks) {
yield { role: 'user', content: chunk };
}
}
const stream = await streamText({
model: openai.chat('gpt-4o-mini'),
messages: messageGenerator(),
// The SDK will automatically handle the streaming of each response
});
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
});
stream.pipeTo(new WritableStream({
write(chunk) {
// Each chunk is the latest version of the summary
res.write(`data: ${JSON.stringify({ summary: chunk })}\n\n`);
},
close() {
res.end();
},
}));
}
On the front end, you render the summary field as it updates, giving readers a live preview of the final abstract.
Performance & Cost Optimisation
Streaming reduces latency, but it can also increase the number of HTTP frames sent. To keep costs low, consider the following tactics:
- Batch tokens: The SDK lets you configure
tokenChunkSize(default 1). Larger chunks mean fewer SSE events. - Early stop: Use the
stopparameter or inspectfinish_reasonto halt generation once you have enough content. - Cache static parts: If a prompt’s prefix never changes (e.g., system instructions), cache the model’s response for that prefix and only stream the variable portion.
Additionally, Vercel’s edge runtime automatically scales horizontally, but keep an eye on cold‑start times for heavy models. Deploying the function to a dedicated region close to your user base can shave off a few milliseconds—enough to feel instantaneous.
Pro Tips
Tip 1 – Use
Content-Type: text/event-streamwisely: Browsers treat SSE as a long‑lived connection. Always setCache-Control: no-cacheandConnection: keep-aliveto avoid proxy buffering.Tip 2 – Graceful fallback: Not all clients support SSE (e.g., older mobile browsers). Detect support and fall back to a traditional
fetchthat waits for the full payload.Tip 3 – Throttle UI updates: Rendering every single token can be janky. Batch UI updates with
requestAnimationFrameor a debounce of 50 ms for smoother animations.Tip 4 – Secure your API key: Even though the SDK runs on the edge, never expose the LLM API key to the client. Keep all model calls inside serverless functions.
Testing & Debugging Streaming Endpoints
When developing streaming routes, the Chrome DevTools “Network” tab shows each SSE event as a separate line item. To inspect raw payloads, right‑click the request and select “View source”. For unit testing, Vercel’s @vercel/edge-runtime library provides a mock ReadableStream you can feed into your handler.
# test/stream.test.ts
import { createRequest, createResponse } from '@vercel/edge-runtime';
import handler from '../pages/api/chat';
test('streams tokens correctly', async () => {
const req = createRequest({
method: 'POST',
body: JSON.stringify({ prompt: 'Hello world' }),
});
const res = createResponse();
await handler(req, res);
// The mock response collects chunks in an array
const chunks = res._getChunks(); // pseudo‑method for illustration
expect(chunks[0]).toContain('Hello');
});
Mocking the LLM itself is also straightforward: replace the OpenAI instance with a stub that yields predetermined tokens. This isolates your streaming logic from external network variability.
Advanced Patterns: Multi‑Modal Streaming
Version 4.0 isn’t limited to pure text. The SDK now supports streaming of images, audio, or even function calls. For example, you can ask the model to generate a base64‑encoded PNG line‑by‑line, streaming each chunk to a canvas element in the browser.
# pages/api/draw.ts
import { streamText } from '@vercel/ai';
import { OpenAI } from '@vercel/ai/providers/openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export default async function handler(req, res) {
const { description } = await req.json();
const stream = await streamText({
model: openai.vision('gpt-4o-vision'), // hypothetical vision model
messages: [{ role: 'user', content: description }],
// Request image output as base64 chunks
outputFormat: 'image/base64',
});
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
});
stream.pipeTo(new WritableStream({
write(chunk) {
// Each chunk is a base64 fragment
res.write(`data: ${JSON.stringify({ fragment: chunk })}\n\n`);
},
close() {
res.end();
},
}));
}
On the client, you concatenate the fragments into a full data URI and set it as the src of an <img> element, creating a progressive drawing effect.
Security & Rate Limiting
Streaming can be abused to flood your edge functions with endless token requests. Implement a per‑user rate limiter (e.g., token bucket) inside the handler before invoking streamText. Vercel’s built‑in Edge Config store is perfect for lightweight counters.
import { get, set } from '@vercel/edge-config';
const LIMIT = 30; // requests per minute
async function enforceRateLimit(userId) {
const key = `rate:${userId}`;
const record = await get(key) || { count: 0, ts: Date.now() };
const now = Date.now();
// Reset every minute
if (now - record.ts > 60_000) {
record.count = 0;
record.ts = now;
}
if (record.count >= LIMIT) {
throw new Error('Rate limit exceeded');
}
record.count += 1;
await set(key, record);
}
Call enforceRateLimit