PROGRAMMING LANGUAGES April 9, 2026, 5:30 a.m.

BentoML: Package and Deploy ML Models as REST APIs

BentoML has quickly become the go‑to framework for turning a trained machine‑learning model into a production‑ready REST API. It abstracts away the boilerplate of Flask, FastAPI, or Docker, letting you focus on the model itself while still giving you full control over deployment. In this guide we’ll walk through installing BentoML, packaging a simple scikit‑learn model, and deploying it to a containerized environment – all with practical code you can copy‑paste.

What is BentoML?

BentoML is an open‑source library that streamlines the entire lifecycle of an ML model: from saving the artifact, to defining a service interface, to bundling everything into a reproducible container. Think of it as a “model‑as‑code” platform that captures both the model file and the code required to serve it.

Core concepts

Model Store – a versioned repository on your local disk or cloud where BentoML keeps serialized models.
Service – a Python class that declares input‑validation, inference logic, and optional batch processing.
Bento – a self‑contained bundle (Docker image or zip) that includes the model, its dependencies, and the service code.
Runner – an abstraction for scaling inference, supporting multi‑process, GPU, or asynchronous execution.

These pieces work together to give you reproducibility (exact same model version), portability (run anywhere Docker works), and observability (built‑in logging and metrics).

Installing BentoML

The installation is straightforward via pip. BentoML supports Python 3.8+ and works on Linux, macOS, and Windows.

pip install bentoml[all]

The optional [all] extra pulls in common ML libraries (scikit‑learn, pandas, torch, etc.) and the FastAPI server used for serving REST endpoints.

Building your first model service

We’ll start with a classic Iris classification example using scikit‑learn. The steps are: train → save → define service → serve locally.

Step 1: Train a simple model

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Quick sanity check
preds = model.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, preds))

At this point you have a fully trained RandomForestClassifier. The next step is to hand it over to BentoML.

Step 2: Save the model with BentoML

import bentoml

# Save the model; BentoML automatically versions it
bento_model = bentoml.sklearn.save_model(
    "iris_classifier",
    model,
    signatures={"predict": {"batchable": True}}
)

print("Model saved to:", bento_model.path)

The signatures argument tells BentoML that the predict method can handle batched inputs, enabling automatic request parallelism later on.

Step 3: Define a Service

import numpy as np
import bentoml
from bentoml.io import NumpyNdarray

# Load the saved model
model_ref = bentoml.sklearn.get("iris_classifier:latest")

# Create a runner (optional but recommended for scaling)
runner = model_ref.to_runner()

@bentoml.service(
    resources={"cpu": "1"},
    runners=[runner],
    name="iris_service"
)
class IrisClassifier:
    """REST API for Iris classification."""

    @bentoml.api(
        input=NumpyNdarray.from_dtype(np.float32, shape=(-1, 4)),
        output=NumpyNdarray()
    )
    def predict(self, data):
        # Run inference through the runner for async support
        return runner.run(data)

The @bentoml.api decorator declares the input and output schema. BentoML will automatically generate OpenAPI documentation based on these definitions.

Step 4: Run locally

# From the terminal, navigate to the directory containing the service file
bentoml serve IrisClassifier:latest

By default the service starts on http://127.0.0.1:3000. You can test it with curl or any HTTP client.

Packaging and versioning

Once you’re happy with the local behavior, it’s time to create a reproducible Bento bundle.

Run bentoml build in the folder that contains your service file. This command packages the model, the service code, and the exact Python environment into a .bento directory.
Inspect the generated bundle with bentoml get IrisClassifier:latest to verify dependencies.
Optionally push the bundle to a remote model store (e.g., S3, GCS) using bentoml push.

Every time you call bentoml save_model or bentoml build, BentoML creates a new immutable version. This makes rollback as easy as specifying an older tag.

Deploying to production

Production deployment usually means containerizing the Bento bundle and running it on a cloud platform. BentoML ships with a ready‑made Dockerfile, but you can also customise it.

Docker container

# Build the Docker image
docker build -t iris-service:latest .

# Run the container
docker run -p 8080:8080 iris-service:latest

The generated Dockerfile installs the exact Python version and all pip packages recorded during the build step, guaranteeing that the container behaves exactly like your local test environment.

Deploy to Kubernetes

Create a Kubernetes Deployment manifest that references the Docker image.
Expose the service via a LoadBalancer or Ingress.
Scale the replica count based on traffic; BentoML’s runner will automatically distribute requests across pods.

Because BentoML services are built on FastAPI, they play nicely with the uvicorn server, which supports graceful shutdown and hot reload in development.

Serverless options

If you prefer a fully managed platform, BentoML can export to AWS Lambda (via aws-lambda-runtime) or Google Cloud Run. The workflow is similar: build the Bento, then run bentoml export lambda or bentoml export cloudrun. This gives you pay‑as‑you‑go billing without managing servers.

Real‑world use cases

Fraud detection API – a model that scores transaction risk in milliseconds, deployed behind a REST endpoint for real‑time decision making.
Image classification microservice – a PyTorch model packaged with BentoML, served on GPU‑enabled instances for low‑latency inference.
Recommendation engine – a collaborative‑filtering model that receives user IDs via POST and returns top‑N items, scaled horizontally with Kubernetes.

All of these scenarios benefit from BentoML’s versioned model store, automatic OpenAPI generation, and out‑of‑the‑box observability.

Performance tuning & monitoring

Even a well‑packaged service can become a bottleneck under load. BentoML provides several knobs to turn.

Batch vs. streaming

If your endpoint receives many small requests, enable the batchable=True flag in the model signature (as we did earlier). BentoML will aggregate incoming payloads into batches of configurable size, dramatically improving GPU utilization.

Logging & metrics

Integrate with prometheus_fastapi_instrumentator to expose /metrics for Prometheus.
Use bentoml.logging to emit structured JSON logs that can be shipped to ELK or CloudWatch.
Enable tracing with OpenTelemetry for end‑to‑end latency analysis.

These tools let you set alerts on latency percentiles, error rates, and CPU/GPU usage.

Pro tip: When you enable batching, experiment with the max_batch_size and batch_timeout_ms runner options. A typical sweet spot for CPU‑only services is 32‑64 samples per batch with a 10‑ms timeout; for GPUs you can push the batch size to 256 or higher without sacrificing latency.

Testing your API

Automated testing ensures that changes to the model or service code don’t break the contract. Below is a minimal pytest example that hits the live endpoint.

import json
import requests

def test_predict():
    url = "http://localhost:8080/predict"
    payload = {
        "data": [[5.1, 3.5, 1.4, 0.2]]
    }
    headers = {"Content-Type": "application/json"}
    resp = requests.post(url, data=json.dumps(payload), headers=headers)
    assert resp.status_code == 200
    result = resp.json()
    # Expect a single class label (0, 1, or 2)
    assert isinstance(result[0], int)

Run the test with pytest test_api.py. For CI pipelines, spin up the Docker container in a temporary network and execute the test suite before promoting the image.

Security considerations

Exposing a model as a public API introduces attack surfaces. Follow these best practices to keep your service secure.

Authentication – add an API key or OAuth2 layer in front of the FastAPI router.
Input validation – rely on BentoML’s NumpyNdarray or PandasDataFrame schemas to reject malformed payloads early.
Rate limiting – use a gateway (e.g., Kong, Envoy) to throttle abusive clients.
Dependency scanning – run pip-audit on the generated Docker image to catch vulnerable packages.
Data privacy – avoid logging raw input data; instead, log hashes or feature statistics.

Conclusion

BentoML bridges the gap between data science notebooks and production‑grade services with minimal friction. By handling model versioning, dependency capture, and container generation out of the box, it lets you focus on model quality and business logic. Whether you’re deploying a tiny scikit‑learn classifier or a massive transformer on GPU, the same workflow—train, save, serve, and scale—applies. Armed with the examples and tips above, you’re ready to turn any ML model into a robust REST API that can grow with your traffic demands.

Share this article