Building Voice Assistants with Whisper and GPT-4
Imagine speaking to your laptop or phone and getting instant, intelligent replies—just like talking to a human assistant. With OpenAI’s Whisper for robust speech‑to‑text conversion and GPT‑4 for nuanced language understanding, you can craft a voice assistant that feels natural and powerful. In this guide we’ll walk through the core concepts, set up a development environment, and build a working prototype step by step.
Why Whisper and GPT‑4 Make a Perfect Pair
Whisper excels at transcribing spoken language across dozens of languages, handling background noise, and even recognizing speaker accents. It returns clean text that can be fed directly into GPT‑4, which then interprets intent, generates responses, or calls external APIs. The separation of concerns—audio processing versus language reasoning—keeps each component focused and optimizable.
Beyond raw transcription, Whisper provides timestamps and confidence scores, which you can leverage for real‑time feedback. GPT‑4, on the other hand, offers few‑shot learning, function calling, and a deep knowledge base that can answer factual queries, schedule appointments, or control IoT devices. Combining them lets you build assistants that not only hear you but also act intelligently.
Setting Up the Development Environment
First, ensure you have Python 3.10+ installed. We’ll use pip to install the required libraries: openai for API access, torch for Whisper, and pyaudio for microphone capture. If you’re on Windows, you might need the wheel for pyaudio.
# Install core dependencies
pip install openai torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
pip install --upgrade "whisper @ git+https://github.com/openai/whisper.git"
pip install pyaudio
Next, export your OpenAI API key as an environment variable. This keeps the key out of your source code and works across shells.
# Bash (Linux/macOS)
export OPENAI_API_KEY="sk-..."
# PowerShell (Windows)
$env:OPENAI_API_KEY="sk-..."
With the environment ready, you can import the libraries and perform a quick test to verify Whisper can transcribe a short audio file.
Basic Whisper Transcription
Whisper offers several model sizes—from tiny (fast, low‑accuracy) to large (slow, high‑accuracy). For a voice assistant, base or small strikes a good balance between latency and quality.
import whisper
# Load the model once at startup
model = whisper.load_model("small")
def transcribe(audio_path: str) -> str:
result = model.transcribe(audio_path)
return result["text"]
# Quick sanity check
print(transcribe("sample.wav"))
The function returns a plain string, which we’ll later feed into GPT‑4. Notice how Whisper also returns segments with timestamps—useful for highlighting spoken words in a UI.
Connecting Whisper to GPT‑4
Now we need a thin wrapper that sends the transcribed text to the Chat Completion endpoint. We’ll use the gpt-4o-mini model for fast responses; swap to gpt-4o for richer output if latency permits.
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
def chat_with_gpt4(prompt: str, system_msg: str = "You are a helpful voice assistant.") -> str:
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": prompt}
],
temperature=0.2,
)
return response.choices[0].message.content.strip()
With these two building blocks—transcribe and chat_with_gpt4—we can create a simple “listen‑and‑reply” loop that processes microphone input in real time.
Real‑Time Voice Assistant Loop
Capturing audio from the microphone continuously can be done with pyaudio. We’ll record short chunks (e.g., 3 seconds), save them to a temporary WAV file, transcribe, and then generate a response. The loop runs until the user says a stop phrase like “goodbye”.
import pyaudio
import wave
import uuid
import os
import time
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 3
STOP_PHRASES = {"goodbye", "exit", "stop listening"}
def record_chunk(filename: str):
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
frames = []
for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(filename, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
def voice_assistant():
print("🗣️ Voice assistant ready. Speak now...")
while True:
temp_file = f"/tmp/{uuid.uuid4()}.wav"
record_chunk(temp_file)
user_text = transcribe(temp_file)
os.remove(temp_file)
if not user_text:
continue
print(f"You said: {user_text}")
# Check for stop condition
if any(phrase in user_text.lower() for phrase in STOP_PHRASES):
print("👋 Goodbye!")
break
assistant_reply = chat_with_gpt4(user_text)
print(f"Assistant: {assistant_reply}")
# Optional: use a TTS engine (e.g., pyttsx3) to speak the reply
# speak(assistant_reply)
if __name__ == "__main__":
voice_assistant()
This script demonstrates a minimal end‑to‑end pipeline: capture → transcribe → reason → respond. You can replace the print statements with a Text‑to‑Speech (TTS) library like pyttsx3 or a cloud service for a fully vocal experience.
Adding Contextual Memory
Real‑world assistants need to remember the conversation history. GPT‑4 supports a list of messages that act as a rolling buffer. We’ll maintain a deque of the last N exchanges, pruning older entries to stay within token limits.
from collections import deque
MAX_TURNS = 6 # total messages (user + assistant) to keep
history = deque(maxlen=MAX_TURNS)
def chat_with_memory(user_input: str) -> str:
# Append the new user message
history.append({"role": "user", "content": user_input})
# Build the message list for the API
messages = [{"role": "system", "content": "You are a concise, helpful voice assistant."}]
messages.extend(history)
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.2,
)
assistant_msg = response.choices[0].message.content.strip()
history.append({"role": "assistant", "content": assistant_msg})
return assistant_msg
Now the assistant can refer back to earlier questions, such as “What time is my meeting tomorrow?” after you’ve already asked about the agenda. The deque ensures we never exceed the model’s context window.
Function Calling for Structured Actions
One of GPT‑4’s most powerful features is function calling. You define a JSON schema for an action (e.g., turning on a smart light), and GPT‑4 will return a structured request instead of free‑form text. This lets the assistant safely interact with external services.
Define the function schema
def turn_on_light(room: str) -> str:
# Placeholder for actual IoT integration
return f"✅ Light in {room} turned on."
function_schema = [
{
"name": "turn_on_light",
"description": "Turn on a smart light in a specific room.",
"parameters": {
"type": "object",
"properties": {
"room": {"type": "string", "description": "Name of the room"}
},
"required": ["room"]
}
}
]
Invoke GPT‑4 with function support
def chat_with_functions(user_input: str):
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_input}],
functions=function_schema,
function_call="auto", # let the model decide when to call
)
message = response.choices[0].message
if message.get("function_call"):
fn_name = message["function_call"]["name"]
args = eval(message["function_call"]["arguments"]) # safe because we control schema
if fn_name == "turn_on_light":
result = turn_on_light(**args)
# Send result back to model for a natural reply
follow_up = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": user_input},
{"role": "assistant", "content": None, "function_call": message["function_call"]},
{"role": "function", "name": fn_name, "content": result}
],
)
return follow_up.choices[0].message.content.strip()
else:
return message.content.strip()
Now you can say “Turn on the kitchen lights,” and the assistant will call turn_on_light behind the scenes, then respond with a friendly confirmation.
Real‑World Use Cases
- Smart Home Hub: Combine Whisper, GPT‑4, and function calls to control lights, thermostats, and media players via voice.
- Customer Support Bot: Let agents dictate notes while GPT‑4 suggests resolutions, logs tickets, and escalates when needed.
- Accessibility Tool: Provide a hands‑free interface for users with motor impairments, translating speech into commands for desktop applications.
Each scenario benefits from Whisper’s low‑latency transcription and GPT‑4’s ability to understand intent, ask clarifying questions, and safely execute actions.
Performance and Cost Considerations
Whisper runs locally, so transcription cost is essentially zero after hardware acquisition. However, GPU acceleration dramatically reduces latency; a small model on a modern laptop GPU can transcribe a 3‑second clip in under 200 ms. GPT‑4 calls are billed per token, so keep prompts concise and reuse system messages when possible.
Batch multiple user utterances when appropriate, but avoid excessive batching for real‑time interaction—users expect sub‑second responses. Monitoring token usage in logs helps you stay within budget while maintaining quality.
Pro Tip: Cache the Whisper model in a singleton pattern and reuse the same openai.ChatCompletion session object when possible. This reduces network overhead and improves overall responsiveness.
Testing and Debugging Strategies
When building voice assistants, two failure modes dominate: inaccurate transcription and misunderstood intent. To diagnose transcription errors, log the raw Whisper confidence scores and compare them against the original audio waveform. For intent issues, print the full messages payload sent to GPT‑4 and the raw JSON response, especially when function calling is involved.
Automated tests can simulate audio by feeding pre‑recorded WAV files into the transcribe function and asserting expected text. For the language layer, use the openai.ChatCompletion.create API with a temperature=0 to get deterministic outputs for unit tests.
Deploying to the Cloud
If you want your assistant to be reachable from mobile devices, wrap the pipeline in a FastAPI endpoint. The client records audio, uploads it, receives the assistant’s reply, and optionally streams TTS back.
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
app = FastAPI()
@app.post("/voice")
async def handle_voice(file: UploadFile = File(...)):
# Save uploaded file temporarily
temp_path = f"/tmp/{uuid.uuid4()}.wav"
with open(temp_path, "wb") as f:
f.write(await file.read())
user_text = transcribe(temp_path)
os.remove(temp_path)
if any(p in user_text.lower() for p in STOP_PHRASES):
reply = "Goodbye! 👋"
else:
reply = chat_with_gpt4(user_text)
return JSONResponse(content={"transcript": user_text, "reply": reply})
Deploy the FastAPI app to a platform like Render, Fly.io, or Azure App Service. Make sure the hosting environment has sufficient CPU (or GPU) for Whisper; otherwise, consider using the hosted Whisper API if latency permits.
Security and Privacy Best Practices
Audio data often contains sensitive information. Store recordings only transiently, delete them immediately after transcription, and encrypt any logs that contain user text. Use OpenAI’s data controls to opt‑out of data logging for your API key.
When exposing function calls, whitelist only safe actions. Never let the model execute arbitrary shell commands; always validate arguments against a schema and use sandboxed wrappers for hardware control.
Pro Tip: Implement a “safe mode” flag that disables function calls for untrusted users. This prevents accidental activation of critical IoT devices during testing.
Extending the Assistant with Multilingual Support
Whisper’s multilingual capability means you can accept input in many languages without extra models. Pass the detected language to GPT‑4 by adding a system message like “You are a multilingual assistant; respond in the same language as the user.” GPT‑4 will automatically continue in that language, enabling global deployments.
Here’s a quick snippet to detect language using Whisper’s metadata:
def transcribe_with_lang(audio_path: str):
result = model.transcribe(audio_path, language=None) # let Whisper auto-detect
text = result["text"]
lang = result["language"]
return text, lang
Combine the returned lang with a dynamic system prompt to keep the conversation consistent.
Future Directions
As OpenAI releases newer models (e.g., GPT‑4 Turbo) and Whisper updates, you can swap them in with minimal code changes. Adding visual context—like a camera feed—opens the door to multimodal assistants that can see and hear, enabling scenarios such as “Read the label on that bottle.”
Experiment with on‑device inference using TensorRT or ONNX for Whisper to eliminate any cloud dependency, especially for privacy‑critical applications.
Conclusion
Building a voice assistant with Whisper and GPT‑4 blends cutting‑edge speech recognition with powerful language reasoning. By structuring the pipeline into clear modules—audio capture, transcription, context