Ray Serve: Scale Python ML Models to Millions of Users
PROGRAMMING LANGUAGES April 12, 2026, 11:30 a.m.

Ray Serve: Scale Python ML Models to Millions of Users

Imagine you have a machine‑learning model that predicts user churn, recommends movies, or detects fraudulent transactions. It works great in a notebook, but as soon as you expose it to real traffic, latency spikes and the service crashes. Ray Serve is built to solve exactly this problem: it lets you turn any Python model into a production‑grade, horizontally scalable API with just a few lines of code.

What Is Ray Serve?

Ray Serve is a high‑level model serving library that sits on top of the Ray distributed execution engine. It abstracts away the complexities of scaling, load balancing, and fault tolerance, letting you focus on model logic. Because it runs on Ray, you can leverage the same cluster for data preprocessing, training, and inference, reducing operational overhead.

Serve treats each model as a deployment—a unit that can be replicated, versioned, and routed independently. Deployments are defined in plain Python, and Ray automatically manages the underlying actors, scaling them up or down based on traffic.

Core Concepts You Need to Know

Deployment

  • A Python class or function that handles a request.
  • Can be scaled horizontally by setting num_replicas.
  • Supports async and batch processing out of the box.

Endpoint

  • A named HTTP route that forwards requests to one or more deployments.
  • Enables traffic splitting for A/B tests or gradual rollouts.

Autoscaling

  • Ray monitors CPU, GPU, and custom metrics to add or remove replicas automatically.
  • Configured via autoscaling_config on the deployment.

Getting Started: Install and Run a Simple Model

The first step is installing Ray with the Serve extras. This pulls in FastAPI and Uvicorn for the HTTP layer.

pip install "ray[serve]"

Next, spin up a Ray cluster locally and define a trivial deployment that returns the square of a number.

import ray
from ray import serve

ray.init()
serve.start()

@serve.deployment(name="square", num_replicas=2)
class SquareModel:
    def __call__(self, request):
        # FastAPI automatically parses JSON bodies.
        data = request.json()
        number = data["num"]
        return {"result": number ** 2}

# Register the deployment with the Serve runtime.
SquareModel.deploy()

Now you can hit the endpoint with curl or any HTTP client.

import requests
resp = requests.post("http://localhost:8000/SquareModel", json={"num": 7})
print(resp.json())  # {"result": 49}

Even though the example is tiny, Serve has already launched two replicas, load‑balanced the request, and provided a stable HTTP API.

Serving a Scikit‑Learn Model at Scale

Scikit‑Learn models are often used for tabular predictions. Let’s see how to wrap a trained model in a Serve deployment and enable batch inference for higher throughput.

import joblib
import numpy as np
from ray import serve

# Assume you have a trained model saved as model.joblib
model = joblib.load("model.joblib")

@serve.deployment(
    name="churn_predictor",
    num_replicas=4,
    ray_actor_options={"num_cpus": 1},
    # Process up to 32 requests together.
    max_batch_size=32,
    batch_wait_timeout_s=0.05,
)
class ChurnPredictor:
    def __init__(self):
        self.model = model

    async def __call__(self, batch):
        # batch is a list of FastAPI Request objects.
        inputs = np.array([req.json()["features"] for req in batch])
        preds = self.model.predict_proba(inputs)[:, 1]
        # Return a list of dicts matching the order of requests.
        return [{"churn_prob": float(p)} for p in preds]

ChurnPredictor.deploy()

With max_batch_size set to 32, Serve will accumulate up to 32 incoming requests before invoking the model, dramatically reducing per‑request overhead on CPU‑bound models.

Deploying a PyTorch Model with GPU Acceleration

Deep learning models often require GPUs for low latency. Ray Serve can schedule each replica on a specific GPU, and it also supports automatic batching for GPU kernels.

import torch
import torch.nn as nn
from ray import serve

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 16, 3, stride=2)
        self.fc = nn.Linear(16 * 15 * 15, 10)

    def forward(self, x):
        x = torch.relu(self.conv(x))
        return self.fc(x.view(x.size(0), -1))

# Load pretrained weights
model = SimpleCNN()
model.load_state_dict(torch.load("cnn_weights.pth"))
model.eval()
model.cuda()  # Move to GPU

@serve.deployment(
    name="image_classifier",
    num_replicas=2,
    ray_actor_options={"num_gpus": 1},
    max_batch_size=8,
    batch_wait_timeout_s=0.01,
)
class ImageClassifier:
    def __init__(self):
        self.model = model

    async def __call__(self, batch):
        # Convert list of requests into a single tensor batch.
        imgs = torch.stack([torch.tensor(req.json()["image"]).float() for req in batch])
        imgs = imgs.cuda()
        with torch.no_grad():
            logits = self.model(imgs)
            probs = torch.softmax(logits, dim=1).cpu().numpy()
        return [{"probs": prob.tolist()} for prob in probs]

ImageClassifier.deploy()

Notice the use of max_batch_size=8. Even on a single GPU, processing eight images at once can increase throughput by 3‑5× compared to single‑request inference.

Pro tip: When deploying on multi‑node clusters, pin each replica to a specific GPU by setting resources={"GPU": 1} in ray_actor_options. This avoids accidental oversubscription and keeps latency predictable.

Advanced Features: Autoscaling, Traffic Splitting, and Model Versioning

Autoscaling in Action

Instead of manually picking num_replicas, you can let Ray adjust the replica count based on real‑time metrics. Below is a minimal autoscaling config that scales between 2 and 10 replicas, targeting 70 % CPU utilization.

autoscale_cfg = {
    "min_replicas": 2,
    "max_replicas": 10,
    "target_num_ongoing_requests_per_replica": 5,
    "metrics_interval_s": 5,
}

@serve.deployment(
    name="autoscaled_predictor",
    autoscaling_config=autoscale_cfg,
    ray_actor_options={"num_cpus": 0.5},
)
class AutoscaledPredictor:
    def __call__(self, request):
        # Simulate work
        import time; time.sleep(0.02)
        return {"msg": "handled"}

Ray monitors the queue length and automatically adds or removes replicas to keep the request backlog near the target.

Traffic Splitting for A/B Testing

When you roll out a new model version, you often want to route a fraction of traffic to it while keeping the old version live. Serve’s Endpoint abstraction makes this painless.

from ray import serve

# Deploy two versions of the same model.
OldModel.deploy()
NewModel.deploy()

# Create an endpoint that splits traffic 80/20.
serve.get_deployment("OldModel").set_route(
    endpoint_name="predict",
    traffic={"OldModel": 0.8, "NewModel": 0.2},
)

All incoming requests to /predict will now be forwarded according to the defined weights, and you can monitor key metrics to decide when to promote the new version.

Versioned Deployments

Serve automatically namespaces each deployment by its name, but you can also embed a version tag in the name to keep history clean.

@serve.deployment(name="recommender_v1")
class RecommenderV1: ...

@serve.deployment(name="recommender_v2")
class RecommenderV2: ...

# Switch the endpoint to point to v2 after validation.
serve.get_deployment("recommender_v1").set_route(
    endpoint_name="recommend",
    traffic={"recommender_v2": 1.0},
)

This pattern works well with CI/CD pipelines that automatically promote the “green” version once tests pass.

Real‑World Use Cases

Personalized Recommendation Engine

Streaming platforms need to serve millions of recommendations per second. By deploying a matrix factorization model with Ray Serve, you can batch user‑item lookups, cache embeddings in shared memory, and autoscale during peak hours (e.g., evenings). The same Ray cluster can also run nightly retraining jobs, eliminating data movement between environments.

Fraud Detection in Financial Services

Financial institutions process tens of thousands of transactions per second. A LightGBM model, wrapped in a Serve deployment with max_batch_size=256, can evaluate batches of transaction records in sub‑millisecond latency. Autoscaling ensures that sudden spikes—like a flash sale—don’t overwhelm the service.

Real‑Time Computer Vision for Autonomous Drones

Drones stream video frames to a central control tower. A PyTorch object‑detection model deployed on GPU‑enabled Serve replicas can process dozens of frames in parallel, returning bounding boxes in under 30 ms. The fleet management system can spin up additional replicas as more drones join the mission.

Pro tip: Pair Serve with Ray’s ray.util.queue.Queue to decouple ingestion from inference. This allows you to buffer bursts of traffic without dropping frames.

Monitoring, Observability, and Logging

Production systems need visibility. Serve automatically emits metrics to Prometheus, including request latency, queue length, and replica health. You can also attach custom metrics inside your deployment.

from ray import serve
from ray.serve.metrics import Counter, Histogram

request_counter = Counter("my_model_requests_total", "Total requests")
latency_hist = Histogram("my_model_latency_seconds", "Request latency")

@serve.deployment
class MonitoredModel:
    def __call__(self, request):
        request_counter.inc()
        with latency_hist.time():
            # Model inference logic here
            return {"ok": True}

Dashboards built on Grafana can visualize these metrics, while logs can be streamed to Elasticsearch or CloudWatch for root‑cause analysis.

Testing Locally Before Going Live

Ray Serve provides a convenient serve.run function that starts an in‑process HTTP server, perfect for unit tests. You can also use the Client API to invoke deployments directly without HTTP.

from ray import serve

client = serve.start(detached=False)

@serve.deployment
class Echo:
    def __call__(self, request):
        return request.json()

Echo.deploy()

# Direct call without HTTP
result = client.get_handle("Echo").remote({"msg": "hello"}).result()
print(result)  # {'msg': 'hello'}

This approach speeds up CI pipelines, letting you verify model behavior and scaling logic before pushing to a shared cluster.

Production Best Practices

  • Cold‑start mitigation: Warm up replicas by sending a few dummy requests after each deployment.
  • Graceful shutdown: Implement __del__ or a cleanup method to close DB connections and release GPU memory.
  • Schema validation: Use Pydantic models with FastAPI to enforce request shapes, reducing runtime errors.
  • Security: Place Serve behind an API gateway, enable TLS, and enforce authentication tokens.
  • Resource isolation: Reserve separate Ray namespaces for dev, staging, and prod to avoid accidental cross‑traffic.

Cost Optimization Strategies

Running a large Serve cluster can be expensive if not tuned. Here are three practical ways to keep costs under control.

  1. Dynamic autoscaling: Combine CPU‑based autoscaling with GPU‑only replicas for heavy inference paths.
  2. Batching thresholds: Tune max_batch_size and batch_wait_timeout_s per model; larger batches improve GPU utilization but increase latency.
  3. Spot instances: For non‑critical workloads, schedule Ray nodes on cloud spot instances and let Ray handle preemptions gracefully.

Future Directions and Roadmap

Ray Serve is evolving rapidly. Upcoming features include built‑in model‑level caching, native support for TensorRT inference, and tighter integration with Ray Data for streaming pipelines. Keeping an eye on the Ray release notes helps you adopt performance improvements as soon as they land.

Conclusion

Ray Serve turns the daunting task of scaling Python ML models into a series of declarative steps. By leveraging Ray’s distributed runtime, you get autoscaling, batching, versioning, and observability with minimal boilerplate. Whether you’re serving a lightweight scikit‑learn predictor or a heavyweight GPU‑accelerated deep network, Serve provides the tools to reach millions of users without sacrificing latency or reliability. Start with the simple examples above, iterate on autoscaling policies, and soon your models will be ready for production‑grade traffic.

Share this article