BentoML: Package and Deploy ML Models as REST APIs
BentoML has quickly become the go‑to framework for turning a trained machine‑learning model into a production‑ready REST API. It abstracts away the boilerplate of Flask, FastAPI, or Docker, letting you focus on the model itself while still giving you full control over deployment. In this guide we’ll walk through installing BentoML, packaging a simple scikit‑learn model, and deploying it to a containerized environment – all with practical code you can copy‑paste.
What is BentoML?
BentoML is an open‑source library that streamlines the entire lifecycle of an ML model: from saving the artifact, to defining a service interface, to bundling everything into a reproducible container. Think of it as a “model‑as‑code” platform that captures both the model file and the code required to serve it.
Core concepts
- Model Store – a versioned repository on your local disk or cloud where BentoML keeps serialized models.
- Service – a Python class that declares input‑validation, inference logic, and optional batch processing.
- Bento – a self‑contained bundle (Docker image or zip) that includes the model, its dependencies, and the service code.
- Runner – an abstraction for scaling inference, supporting multi‑process, GPU, or asynchronous execution.
These pieces work together to give you reproducibility (exact same model version), portability (run anywhere Docker works), and observability (built‑in logging and metrics).
Installing BentoML
The installation is straightforward via pip. BentoML supports Python 3.8+ and works on Linux, macOS, and Windows.
pip install bentoml[all]
The optional [all] extra pulls in common ML libraries (scikit‑learn, pandas, torch, etc.) and the FastAPI server used for serving REST endpoints.
Building your first model service
We’ll start with a classic Iris classification example using scikit‑learn. The steps are: train → save → define service → serve locally.
Step 1: Train a simple model
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Quick sanity check
preds = model.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, preds))
At this point you have a fully trained RandomForestClassifier. The next step is to hand it over to BentoML.
Step 2: Save the model with BentoML
import bentoml
# Save the model; BentoML automatically versions it
bento_model = bentoml.sklearn.save_model(
"iris_classifier",
model,
signatures={"predict": {"batchable": True}}
)
print("Model saved to:", bento_model.path)
The signatures argument tells BentoML that the predict method can handle batched inputs, enabling automatic request parallelism later on.
Step 3: Define a Service
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray
# Load the saved model
model_ref = bentoml.sklearn.get("iris_classifier:latest")
# Create a runner (optional but recommended for scaling)
runner = model_ref.to_runner()
@bentoml.service(
resources={"cpu": "1"},
runners=[runner],
name="iris_service"
)
class IrisClassifier:
"""REST API for Iris classification."""
@bentoml.api(
input=NumpyNdarray.from_dtype(np.float32, shape=(-1, 4)),
output=NumpyNdarray()
)
def predict(self, data):
# Run inference through the runner for async support
return runner.run(data)
The @bentoml.api decorator declares the input and output schema. BentoML will automatically generate OpenAPI documentation based on these definitions.
Step 4: Run locally
# From the terminal, navigate to the directory containing the service file
bentoml serve IrisClassifier:latest
By default the service starts on http://127.0.0.1:3000. You can test it with curl or any HTTP client.
Packaging and versioning
Once you’re happy with the local behavior, it’s time to create a reproducible Bento bundle.
- Run
bentoml buildin the folder that contains your service file. This command packages the model, the service code, and the exact Python environment into a.bentodirectory. - Inspect the generated bundle with
bentoml get IrisClassifier:latestto verify dependencies. - Optionally push the bundle to a remote model store (e.g., S3, GCS) using
bentoml push.
Every time you call bentoml save_model or bentoml build, BentoML creates a new immutable version. This makes rollback as easy as specifying an older tag.
Deploying to production
Production deployment usually means containerizing the Bento bundle and running it on a cloud platform. BentoML ships with a ready‑made Dockerfile, but you can also customise it.
Docker container
# Build the Docker image
docker build -t iris-service:latest .
# Run the container
docker run -p 8080:8080 iris-service:latest
The generated Dockerfile installs the exact Python version and all pip packages recorded during the build step, guaranteeing that the container behaves exactly like your local test environment.
Deploy to Kubernetes
- Create a Kubernetes Deployment manifest that references the Docker image.
- Expose the service via a LoadBalancer or Ingress.
- Scale the replica count based on traffic; BentoML’s runner will automatically distribute requests across pods.
Because BentoML services are built on FastAPI, they play nicely with the uvicorn server, which supports graceful shutdown and hot reload in development.
Serverless options
If you prefer a fully managed platform, BentoML can export to AWS Lambda (via aws-lambda-runtime) or Google Cloud Run. The workflow is similar: build the Bento, then run bentoml export lambda or bentoml export cloudrun. This gives you pay‑as‑you‑go billing without managing servers.
Real‑world use cases
- Fraud detection API – a model that scores transaction risk in milliseconds, deployed behind a REST endpoint for real‑time decision making.
- Image classification microservice – a PyTorch model packaged with BentoML, served on GPU‑enabled instances for low‑latency inference.
- Recommendation engine – a collaborative‑filtering model that receives user IDs via POST and returns top‑N items, scaled horizontally with Kubernetes.
All of these scenarios benefit from BentoML’s versioned model store, automatic OpenAPI generation, and out‑of‑the‑box observability.
Performance tuning & monitoring
Even a well‑packaged service can become a bottleneck under load. BentoML provides several knobs to turn.
Batch vs. streaming
If your endpoint receives many small requests, enable the batchable=True flag in the model signature (as we did earlier). BentoML will aggregate incoming payloads into batches of configurable size, dramatically improving GPU utilization.
Logging & metrics
- Integrate with
prometheus_fastapi_instrumentatorto expose/metricsfor Prometheus. - Use
bentoml.loggingto emit structured JSON logs that can be shipped to ELK or CloudWatch. - Enable tracing with OpenTelemetry for end‑to‑end latency analysis.
These tools let you set alerts on latency percentiles, error rates, and CPU/GPU usage.
Pro tip: When you enable batching, experiment with themax_batch_sizeandbatch_timeout_msrunner options. A typical sweet spot for CPU‑only services is 32‑64 samples per batch with a 10‑ms timeout; for GPUs you can push the batch size to 256 or higher without sacrificing latency.
Testing your API
Automated testing ensures that changes to the model or service code don’t break the contract. Below is a minimal pytest example that hits the live endpoint.
import json
import requests
def test_predict():
url = "http://localhost:8080/predict"
payload = {
"data": [[5.1, 3.5, 1.4, 0.2]]
}
headers = {"Content-Type": "application/json"}
resp = requests.post(url, data=json.dumps(payload), headers=headers)
assert resp.status_code == 200
result = resp.json()
# Expect a single class label (0, 1, or 2)
assert isinstance(result[0], int)
Run the test with pytest test_api.py. For CI pipelines, spin up the Docker container in a temporary network and execute the test suite before promoting the image.
Security considerations
Exposing a model as a public API introduces attack surfaces. Follow these best practices to keep your service secure.
- Authentication – add an API key or OAuth2 layer in front of the FastAPI router.
- Input validation – rely on BentoML’s
NumpyNdarrayorPandasDataFrameschemas to reject malformed payloads early. - Rate limiting – use a gateway (e.g., Kong, Envoy) to throttle abusive clients.
- Dependency scanning – run
pip-auditon the generated Docker image to catch vulnerable packages. - Data privacy – avoid logging raw input data; instead, log hashes or feature statistics.
Conclusion
BentoML bridges the gap between data science notebooks and production‑grade services with minimal friction. By handling model versioning, dependency capture, and container generation out of the box, it lets you focus on model quality and business logic. Whether you’re deploying a tiny scikit‑learn classifier or a massive transformer on GPU, the same workflow—train, save, serve, and scale—applies. Armed with the examples and tips above, you’re ready to turn any ML model into a robust REST API that can grow with your traffic demands.