Weights and Biases Weave: LLM Observability Guide
Observability is the new lingua franca for anyone building or scaling large language model (LLM) applications. With the rise of prompt engineering, chain orchestration, and fine‑tuning, developers need a way to peek inside the black box, understand latency bottlenecks, and debug hallucinations. Weights & Biases (W&B) Weave offers a turnkey observability layer that integrates seamlessly with popular LLM frameworks like LangChain, LlamaIndex, and OpenAI’s SDK. In this guide, we’ll walk through setting up Weave, instrumenting your code, and extracting actionable insights—all with concrete, runnable examples.
Why Observability Matters for LLMs
Traditional software observability focuses on request latency, error rates, and resource consumption. LLMs add new dimensions: token‑level latency, prompt composition, model version drift, and downstream evaluation metrics such as relevance or toxicity. Without visibility, a single mis‑crafted prompt can silently degrade user experience or inflate costs.
Weave captures these signals automatically, enriching them with metadata like user IDs, experiment tags, and custom metrics. The result is a unified dashboard where you can trace a single inference from API call to token generation, compare runs across model versions, and set alerts on anomalous behavior.
Getting Started: Install and Initialize Weave
First, add the Weave SDK to your environment. It works with both pip and conda, but pip is the most common route.
pip install wandb[weave] # installs core wandb and the weave extension
Next, create a W&B project (or reuse an existing one) and log in via the CLI. This step only needs to be done once per machine.
wandb login # paste your API key when prompted
Now you can initialize Weave in your Python script. The wandb.init call sets up a run, while wandb.weave activates the LLM observer.
import wandb
# Initialize a W&B run – name it meaningfully for later lookup
run = wandb.init(
project="llm-observability",
name="weather-chatbot-v1",
config={"model": "gpt-4o-mini", "temperature": 0.7}
)
# Activate the Weave observer for LLM calls
wandb.weave()
Instrumenting LangChain with Weave
LangChain is a popular orchestration library that chains together prompts, LLM calls, and post‑processing. Weave automatically hooks into LangChain’s LLMChain and ChatOpenAI classes, logging each prompt, response, and token latency.
Below is a minimal weather‑assistant built with LangChain. The code is ready to run; just replace YOUR_OPENAI_API_KEY with a valid key.
import os
import wandb
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Define a simple prompt template
template = """
You are a helpful weather assistant. Answer the user's question in one concise sentence.
User: {question}
Assistant:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
# Create the LLM instance – Weave will automatically wrap this
llm = OpenAI(model_name="gpt-4o-mini", temperature=0.7)
# Build the chain
chain = LLMChain(llm=llm, prompt=prompt)
def get_weather_answer(question: str) -> str:
# This call is logged by Weave: prompt, response, latency, token count
return chain.run(question)
# Example usage
print(get_weather_answer("Will it rain in Seattle tomorrow?"))
After executing the script, head over to your W&B dashboard. You’ll see a new run under “LLM Observability” with a timeline view that breaks down prompt creation, API request, and token generation. Click any segment to view the exact prompt text and the model’s raw response.
Pro tip: Tag runs withrun.tags = ["weather", "production"]to filter them later. You can also add custom metrics likerun.log({"cost_usd": token_cost})for financial monitoring.
Adding Custom Metadata and Metrics
While Weave captures a rich default set, you’ll often want to surface business‑specific context: user segment, request ID, or A/B test bucket. You can inject arbitrary key‑value pairs into the run context using wandb.log before or after each LLM call.
def get_weather_answer(question: str, user_id: str, experiment_group: str) -> str:
# Log request‑level metadata
wandb.log({
"user_id": user_id,
"experiment_group": experiment_group,
"question": question,
"timestamp": wandb.util.utcnow()
})
answer = chain.run(question)
# Compute a simple relevance metric (placeholder)
relevance = 1.0 if "rain" in answer.lower() else 0.8
# Log the response and custom metric
wandb.log({
"answer": answer,
"relevance_score": relevance
})
return answer
When you view the run in the dashboard, these fields appear as columns in the “Table” view, letting you slice and dice results by user segment or experiment group.
Tracking Token‑Level Costs
OpenAI charges per 1,000 tokens. Knowing the exact token count per request helps you predict monthly spend and spot outliers.
def get_weather_answer(question: str, user_id: str) -> str:
# Pre‑log request details
wandb.log({"user_id": user_id, "question": question})
# Run the chain and capture the response object
response = chain.run(question, return_raw=True) # assume LangChain supports raw return
# Extract token usage from the response metadata
usage = response["usage"] # {'prompt_tokens': X, 'completion_tokens': Y, 'total_tokens': Z}
wandb.log({
"prompt_tokens": usage["prompt_tokens"],
"completion_tokens": usage["completion_tokens"],
"total_tokens": usage["total_tokens"],
"cost_usd": (usage["total_tokens"] / 1000) * 0.0005 # example rate $0.0005 per 1k tokens
})
return response["text"]
Now the “Metrics” tab in Weave displays a live cost curve, and you can set alerts for spikes that exceed a threshold.
Observing Retrieval‑Augmented Generation (RAG)
Many production LLM apps combine a vector store with a generator – a pattern known as Retrieval‑Augmented Generation. Observability becomes trickier because you need to trace both the retrieval step and the generation step.
Weave provides built‑in hooks for popular vector store libraries like FAISS, Pinecone, and Chroma. Below is a simple RAG pipeline using langchain and Chroma that logs each stage.
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1️⃣ Load documents (could be from a DB, S3, etc.)
documents = [
{"page_content": "Seattle has a maritime climate with mild, wet winters.", "metadata": {"source": "wiki"}},
{"page_content": "Seattle's average annual rainfall is about 38 inches.", "metadata": {"source": "climate-data"}}
]
# 2️⃣ Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
# 3️⃣ Build a RetrievalQA chain – Weave will log the retrieval query and results
qa = RetrievalQA.from_chain_type(
llm=OpenAI(model_name="gpt-4o-mini", temperature=0),
retriever=vectorstore.as_retriever(search_kwargs={"k": 2})
)
def answer_question(question: str, user_id: str):
# Log the incoming question
wandb.log({"user_id": user_id, "question": question})
# Run the RAG pipeline
result = qa({"query": question})
# Log the retrieved chunks and final answer
wandb.log({
"retrieved_chunks": result["source_documents"], # list of docs
"answer": result["result"]
})
return result["result"]
print(answer_question("How much rain does Seattle get each year?", "user_42"))
In the Weave UI, you’ll see a nested view: the top‑level LLM call, and underneath it a “retrieval” sub‑trace showing the vector store query, distance scores, and which documents were selected. This hierarchical view is priceless when you need to pinpoint whether a poor answer stemmed from irrelevant retrieval or from the generator itself.
Pro tip: Enable wandb.weave.set_trace_level("debug") to capture the raw embeddings vectors for a handful of queries. This helps you verify that your vector store is indexing text as expected.
Monitoring Latency and Scaling
Latency is a first‑order metric for user satisfaction. Weave automatically records the time spent in each stage (prompt preparation, API round‑trip, token decoding). To visualize trends, open the “Latency” chart in the dashboard and apply a moving average.
If you’re deploying on a Kubernetes cluster, you can push Weave metrics to Prometheus or Grafana using the wandb sync command. This enables you to set up alerts that trigger when 95th‑percentile latency exceeds your SLA.
# Example: export Weave run data to a CSV for external analysis
run = wandb.run
run.save() # ensures all logs are flushed
run.finish()
# Later, in a CI pipeline
wandb sync --project llm-observability --entity your_org
When you combine these exports with your infrastructure metrics (CPU, GPU utilization), you can answer questions like “Did a spike in GPU memory cause higher token latency?” or “Is scaling the vector store reducing retrieval time?”
Evaluating Model Quality Over Time
Observability isn’t just about performance; it’s also about output quality. Weave lets you log evaluation scores alongside each inference. For LLMs, common metrics include BLEU, ROUGE, and custom relevance or safety scores.
Below is a tiny evaluation loop that compares model answers against a ground‑truth dataset. The loop logs both the raw answer and the computed F1 score.
from sklearn.metrics import f1_score
import random
# Mock dataset
eval_set = [
{"question": "Is it sunny in Los Angeles today?", "ground_truth": "Yes, it is sunny."},
{"question": "Will it snow in Miami this winter?", "ground_truth": "No, Miami does not snow."}
]
def evaluate_model():
scores = []
for item in eval_set:
answer = get_weather_answer(item["question"], user_id="eval_bot")
# Very naive binary relevance: check if answer contains "yes" or "no"
pred = 1 if "yes" in answer.lower() else 0
true = 1 if "yes" in item["ground_truth"].lower() else 0
scores.append((pred, true))
# Log per‑example metrics
wandb.log({
"question": item["question"],
"ground_truth": item["ground_truth"],
"model_answer": answer,
"binary_pred": pred,
"binary_true": true
})
# Compute overall F1
preds, trues = zip(*scores)
overall_f1 = f1_score(trues, preds)
wandb.log({"overall_f1": overall_f1})
print(f"Overall F1: {overall_f1:.2f}")
evaluate_model()
After the run finishes, the Weave dashboard will display a “Table” view with each evaluation example, and a “Metrics” view where you can track overall_f1 across model versions. This makes it trivial to spot regression after a new fine‑tune or after switching providers.
Pro tip: Store the evaluation dataset in a W&B Artifact. This way, every run is reproducibly tied to the exact version of the test set you used, eliminating accidental drift.
Debugging Hallucinations with Prompt Tracing
Hallucinations—responses that are plausible but factually incorrect—are a notorious LLM pain point. Weave’s prompt tracing lets you replay any inference with the exact same context, making it easier to experiment with prompt tweaks.
To enable replay, set the record_prompt=True flag when initializing the LLM. Then, in the UI, click “Replay” on a specific run to generate a new answer with a modified temperature or a different system prompt.
# Re‑initialize the LLM with prompt recording enabled
llm = OpenAI(
model_name="gpt-4o-mini",
temperature=0.7,
record_prompt=True # <-- important for replay
)
chain = LLMChain(llm=llm, prompt=prompt)
# Normal usage stays the same
answer = chain.run("What is the capital of France?")
When you open the run in Weave, you’ll see a “Replay” button next to the prompt. Clicking it opens a modal where you can adjust temperature, max tokens, or even inject a new system message. The UI then shows side‑by‑side comparisons of the original and replayed outputs, highlighting any changes in token usage or latency.
Security and Privacy Considerations
Observability pipelines often transmit sensitive user data to external services. Weave respects W&B’s security model: logs are encrypted in transit and at rest, and you can configure a private on‑premise W&B server if you need full data residency.
To redact personally identifiable information (PII) before logging, use the wandb.util.sanitize helper. It supports regex‑based redaction for email addresses, phone numbers, and credit‑card patterns.
def safe_log(payload: dict):
# Automatically mask emails and phone numbers
sanitized = wandb.util.sanitize(payload, patterns=["email", "phone"])
wandb.log(sanitized)
# Example usage
safe_log({"user_message": "My email is alice@example.com and my phone is 555‑123‑4567"})
Additionally, you can disable logging of the raw LLM response by setting log_response=False when initializing the observer, useful for high‑risk domains like healthcare.
Advanced: Custom Weave Callbacks
For full control, you can write your own Weave callback that fires on every LLM request. This is handy when you want to push metrics to an external monitoring system or enrich logs with domain‑specific context.
from wandb.sdk.weave import WeaveCallback
class MyMetricsCallback(WeaveCallback):
def on_llm_start(self, request_id, payload):
# payload contains prompt, model name, etc.
print(f"LLM start: {payload['model_name']} – prompt length {len(payload['prompt'])}")
def on_llm_end(self, request_id, response):
# response includes token usage, latency, and raw output
latency = response["latency"]
tokens = response["usage"]["total_tokens"]
# Push to an external system (pseudo code)
external_monitor.record("llm_latency_ms", latency, tags={"model": response["model"]})
external_monitor.record("llm_tokens", tokens)
# Register the callback
wandb.weave.register_callback(MyMetricsCallback())
Once registered, the callback runs alongside the built‑in observer, giving you a dual‑view of both W&B’s UI and your custom dashboards.
Best Practices Checklist
- Initialize
wandb.weave()as early as possible in your entrypoint. - Tag runs with meaningful identifiers (e.g.,
["beta", "v2.1"]). - Log custom business metrics (cost, relevance