LiteLLM: Unified Python SDK for 100+ LLM Providers
Imagine a world where you can switch between OpenAI, Anthropic, Cohere, or any of the 100+ LLM providers with a single line of Python. That’s the promise of LiteLLM – a thin, unified SDK that abstracts away provider‑specific quirks while keeping you in full control of prompts, tokens, and costs.
In this deep dive we’ll explore how LiteLLM works under the hood, walk through three practical code snippets, and uncover real‑world scenarios where the SDK shines. Whether you’re building a multi‑tenant SaaS, a research prototype, or a cost‑aware chatbot, LiteLLM can become the glue that ties your LLM workflow together.
Getting Started: Installation & First Call
The SDK is available on PyPI and can be installed with a single command. It pulls in optional dependencies only when you need them, keeping the base install lightweight.
pip install litellm
After installation, you need an API key for at least one provider. LiteLLM reads keys from environment variables, a .env file, or a dictionary you pass directly.
import os
from litellm import completion
# Set your OpenAI key – you can also use ANTHROPIC_API_KEY, COHERE_API_KEY, etc.
os.environ["OPENAI_API_KEY"] = "sk-..."
# A minimal completion request
response = completion(
model="gpt-4o-mini", # Provider‑agnostic model name
messages=[{"role": "user", "content": "Explain quantum tunneling in one sentence."}],
temperature=0.7
)
print(response.choices[0].message["content"])
Notice how the model argument uses a generic name. LiteLLM maps gpt-4o-mini to the correct endpoint for OpenAI, but you could just as easily pass claude-3-sonnet-20240229 and the SDK would route the request to Anthropic.
Core Concepts: Provider Abstraction, Token Tracking, and Cost Management
LiteLLM’s power comes from three pillars: unified provider abstraction, automatic token usage reporting, and built‑in cost tracking.
Unified Provider Abstraction
Each LLM provider has its own request schema. LiteLLM normalizes these into a common completion() or chat_completion() signature. Under the hood, the SDK selects the correct HTTP client, payload format, and authentication method based on the model prefix (e.g., openai/, anthropic/, cohere/).
Automatic Token Usage Reporting
Every response includes a usage field that mirrors the provider’s native token count. LiteLLM also adds a litellm_metadata dictionary containing the raw request and response payloads, which is invaluable for debugging.
print(response.usage) # {'prompt_tokens': 12, 'completion_tokens': 23, 'total_tokens': 35}
print(response.litellm_metadata) # Full raw JSON from the provider
Built‑in Cost Tracking
Because token pricing varies dramatically across models, LiteLLM maintains a pricing table (updated weekly) and can compute the USD cost of each call on the fly.
print(f"Cost: ${response.cost_usd:.5f}") # e.g., Cost: $0.00021
For enterprises that need to enforce per‑user or per‑project budgets, you can hook into the cost_callback to log or reject expensive calls.
Advanced Patterns: Fallbacks, Streaming, and Callbacks
Beyond a single request, LiteLLM shines when you need resilience, real‑time output, or custom observability. The following sections demonstrate three advanced patterns that are common in production.
Provider Fallbacks
Imagine a SaaS that guarantees 99.9% uptime for LLM responses. With LiteLLM you can define a priority list of providers; if the primary model fails (rate limit, outage, etc.) the SDK automatically retries with the next option.
from litellm import chat_completion
fallback_models = [
"openai/gpt-4o-mini", # Primary
"anthropic/claude-3-sonnet-20240229", # Secondary
"cohere/command-r-plus" # Tertiary
]
response = chat_completion(
model=fallback_models,
messages=[{"role": "user", "content": "Summarize the latest AI news in 3 bullet points."}],
temperature=0.5,
max_tokens=150
)
print(response.choices[0].message["content"])
The model argument now accepts a list. LiteLLM iterates through the list until it receives a successful 2xx response, preserving the original request payload for each attempt.
Streaming Responses
For interactive applications—like a coding assistant or a live chat UI—streaming tokens as they arrive improves perceived latency. LiteLLM exposes a generator that yields partial messages.
from litellm import chat_completion
stream = chat_completion(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Write a Python function that merges two sorted lists."}],
stream=True
)
for chunk in stream:
# Each chunk contains a partial delta
delta = chunk.choices[0].delta.get("content", "")
print(delta, end="", flush=True)
When stream=True, the SDK returns an iterator that yields the same JSON shape as the non‑streaming response, but each delta only contains the newly generated token(s).
Custom Callbacks for Observability
LiteLLM lets you register callbacks that fire before a request, after a response, or on error. This hook system is perfect for logging, tracing, or injecting custom headers.
from litellm import completion, callbacks
def log_request(request):
print("[LiteLLM] Sending request to:", request["model"])
# You could push this to a monitoring system here
def log_response(response):
print("[LiteLLM] Received", response.usage["total_tokens"], "tokens")
print("[LiteLLM] Cost:", f"${response.cost_usd:.5f}")
# Register callbacks
callbacks.on_request.append(log_request)
callbacks.on_response.append(log_response)
# Normal call – callbacks run automatically
completion(
model="anthropic/claude-3-opus-20240229",
messages=[{"role": "user", "content": "Give me a haiku about sunrise."}],
temperature=0.9
)
Because callbacks are just Python callables, you can integrate with OpenTelemetry, Sentry, or any proprietary observability stack without touching the core SDK.
Pro Tip: When using fallbacks, wrap the call in atry/exceptblock to capture the final exception. LiteLLM raisesLiteLLMErrorwith a.last_exceptionattribute that contains the underlying provider error.
Real‑World Use Cases
Below are three scenarios where LiteLLM can dramatically reduce engineering effort while adding robustness.
1. Multi‑Tenant SaaS with Per‑User Cost Caps
Suppose you run a content‑generation platform where each user has a monthly $10 LLM budget. By centralizing all calls through LiteLLM, you can track cumulative cost per user and reject requests that would exceed the limit.
from collections import defaultdict
from litellm import chat_completion, callbacks
# In‑memory usage store (replace with Redis/DB in production)
user_spend = defaultdict(float)
def enforce_budget(request):
user_id = request["metadata"]["user_id"]
projected_cost = request["metadata"]["estimated_cost"]
if user_spend[user_id] + projected_cost > 10.0:
raise RuntimeError(f"User {user_id} would exceed monthly budget.")
# Reserve the cost so concurrent calls see the updated total
user_spend[user_id] += projected_cost
def commit_spend(response):
user_id = response.litellm_metadata["metadata"]["user_id"]
user_spend[user_id] += response.cost_usd - response.litellm_metadata["metadata"]["estimated_cost"]
callbacks.on_pre_request.append(enforce_budget)
callbacks.on_response.append(commit_spend)
# Example request
chat_completion(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Draft a 200‑word blog intro about renewable energy."}],
metadata={"user_id": "user_42", "estimated_cost": 0.0015}
)
The metadata field travels with the request and is echoed back in litellm_metadata, making it easy to tie cost calculations to business identifiers.
2. Dynamic Prompt Routing for A/B Testing
Product teams often want to compare the output quality of two models on the same prompt. LiteLLM’s ability to accept a list of models enables a concise A/B test harness.
from litellm import chat_completion
def ab_test(prompt):
variants = ["openai/gpt-4o-mini", "anthropic/claude-3-sonnet-20240229"]
results = {}
for model in variants:
resp = chat_completion(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.8
)
results[model] = resp.choices[0].message["content"]
return results
outcomes = ab_test("Explain the benefits of serverless architecture.")
for mdl, txt in outcomes.items():
print(f"--- {mdl} ---")
print(txt, "\n")
Collect the outputs, feed them into a human rating UI, and feed the scores back into your model selection logic. All of this runs with less than 20 lines of code.
3. Cost‑Optimized Batch Processing
Data pipelines that annotate millions of records can quickly become expensive. By leveraging LiteLLM’s batch_completion() helper (available in v1.5+), you can send up to 100 prompts per HTTP request, dramatically reducing overhead.
from litellm import batch_completion
prompts = [
{"role": "user", "content": f"Classify sentiment for: '{txt}'"} for txt in large_text_list
]
responses = batch_completion(
model="cohere/command-r-plus",
messages=prompts,
max_tokens=30
)
sentiments = [r.choices[0].message["content"] for r in responses]
print("Processed", len(sentiments), "records with total cost:",
f"${sum(r.cost_usd for r in responses):.4f}")
Batching not only cuts API call latency but also lets you compute a single aggregate cost, simplifying budgeting for large‑scale jobs.
Pro Tip: When batching, ensure your prompts fit within the provider’s maximum token limit *after* adding the batch overhead. LiteLLM automatically truncates if you set max_input_tokens.
Integrations with Popular Frameworks
LiteLLM is designed to be a drop‑in replacement for the OpenAI SDK, which means most libraries that expect an openai.ChatCompletion interface work out of the box.
LangChain
LangChain’s ChatOpenAI wrapper can be swapped for ChatLiteLLM, giving you access to every provider while keeping the same chain syntax.
from langchain.chat_models import ChatLiteLLM
from langchain.schema import HumanMessage
llm = ChatLiteLLM(model="anthropic/claude-3-opus-20240229", temperature=0.6)
msg = HumanMessage(content="Write a short story about a time‑traveling cat.")
response = llm.invoke([msg])
print(response.content)
Because LangChain pulls model metadata from the wrapper, you can still use LangChain’s LLMChain, AgentExecutor, and PromptTemplate utilities without any code changes.
FastAPI & Webhooks
Expose a single endpoint that forwards incoming chat requests to LiteLLM. The endpoint can enforce authentication, inject user metadata, and log costs centrally.
from fastapi import FastAPI, Request, HTTPException
from litellm import chat_completion
app = FastAPI()
@app.post("/chat")
async def chat_endpoint(req: Request):
payload = await req.json()
user_id = req.headers.get("X-User-ID")
if not user_id:
raise HTTPException(status_code=401, detail="Missing user ID")
# Attach metadata for downstream callbacks
payload.setdefault("metadata", {})["user_id"] = user_id
resp = chat_completion(**payload)
return {"answer": resp.choices[0].message["content"],
"cost_usd": resp.cost_usd}
This pattern centralizes all LLM traffic, making it trivial to audit usage across teams or switch providers without touching client code.
Testing & Mocking LiteLLM Calls
Unit tests that hit real LLM APIs are slow and flaky. LiteLLM provides a built‑in mock mode that returns deterministic responses based on the prompt.
from litellm import set_mock_response, chat_completion
# Define a simple deterministic mock
def echo_mock(request):
prompt = request["messages"][-1]["content"]
return {
"choices": [{"message": {"content": f"Echo: {prompt}"}}],
"usage": {"prompt_tokens": 5, "completion_tokens": 5, "total_tokens": 10},
"cost_usd": 0.0,
"litellm_metadata": {}
}
set_mock_response(echo_mock)
# In your test
resp = chat_completion(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello world!"}]
)
assert resp.choices[0].message["content"] == "Echo: Hello world!"
When set_mock_response is active, LiteLLM bypasses network calls entirely, making CI pipelines fast and deterministic.
Security & Compliance Considerations
Handling sensitive data with LLMs demands careful attention to encryption, data residency, and audit trails. LiteLLM helps you stay compliant in three ways:
- Custom Headers: Pass
extra_headersto inject organization‑specific tokens or compliance flags. - Redaction Hooks: Register a
on_pre_requestcallback that scrubs PII from the prompt before it leaves your environment. - Audit Logging: The
litellm_metadatapayload contains raw request/response JSON, which you can stream to immutable storage (e.g., AWS S3 with Object Lock).
def redact_pii(request):
# Simple regex example – replace with a robust library in production
import re
for msg in request["messages"]:
msg["content"] = re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[REDACTED-SSN]", msg["content"])
callbacks.on_pre_request.append(redact_pii)
By centralizing redaction, you guarantee that no downstream