HOW TO GUIDES April 5, 2026, 11:30 a.m.

Modal: Run Serverless GPU Workloads from Python

When you think about running GPU‑intensive code, the first thing that usually comes to mind is provisioning a pricey EC2 instance, juggling drivers, and worrying about idle time. Modal flips that script: it lets you spin up GPU‑backed containers from plain Python, pay only for the seconds you actually use, and never touch a VM again. In this post we’ll walk through the core concepts, fire up a couple of real‑world examples, and sprinkle in some pro tips to keep your workloads snappy and cost‑effective.

Getting Started with Modal

Modal’s Python SDK is deliberately lightweight. A single pip install modal brings in everything you need, from the client that talks to the cloud to the decorators that turn a regular function into a serverless GPU job. The first step is to create an account on modal.com and retrieve your API token—think of it as a secret key that lets your local notebook talk to Modal’s control plane.

Once you have the token, store it securely using the modal token set CLI command. This writes the token to ~/.modal/token, which the SDK reads automatically. You can also inject the token as an environment variable MODAL_TOKEN if you prefer a CI/CD‑friendly approach.

Installation and Basic Configuration

# Install the SDK
!pip install -q modal

# Verify the installation
import modal
print("Modal version:", modal.__version__)

With the SDK ready, you can start defining functions that run on Modal’s GPU workers. The magic lives in the @modal.function decorator, where you specify the container image, GPU type, and any required packages.

Your First GPU Function: Image Classification

Let’s train a tiny ResNet on the CIFAR‑10 dataset using PyTorch. The entire training loop will execute on a remote GPU, while you keep the notebook responsive. Modal automatically provisions a container with the nvidia/cuda base image, installs PyTorch, and tears down the instance as soon as the job finishes.

import modal

# Define the container image with PyTorch and CUDA support
image = modal.Image.debian_slim().pip_install(
    "torch==2.2.0",
    "torchvision==0.17.0",
    "tqdm"
)

@modal.function(image=image, gpu="any")
def train_resnet(epochs: int = 5) -> str:
    import torch
    import torchvision
    import torchvision.transforms as T
    from torch import nn, optim
    from tqdm import tqdm

    # Data loading
    transform = T.Compose([T.ToTensor(), T.Normalize((0.5,), (0.5,))])
    train_set = torchvision.datasets.CIFAR10(root="/tmp", train=True,
                                             download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=128,
                                               shuffle=True, num_workers=2)

    # Model, loss, optimizer
    model = torchvision.models.resnet18(num_classes=10)
    model = model.cuda()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    model.train()
    for epoch in range(epochs):
        epoch_loss = 0.0
        for images, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
            images, labels = images.cuda(), labels.cuda()
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        print(f"Epoch {epoch+1} finished – avg loss: {epoch_loss/len(train_loader):.4f}")

    # Save a tiny checkpoint to a Modal volume (see next section)
    torch.save(model.state_dict(), "/tmp/resnet_cifar10.pth")
    return "/tmp/resnet_cifar10.pth"

To invoke the function, simply call train_resnet.remote(). Modal returns a modal.Future that you can await or fetch later. Because the container is spun up on demand, you only pay for the minutes the GPU is actually crunching numbers.

# Kick off the training job
checkpoint_path = train_resnet.remote(epochs=3).result()
print("Model checkpoint stored at:", checkpoint_path)

Pro tip: For longer experiments, consider using modal.Volume to persist checkpoints across multiple function invocations. This avoids re‑downloading data or re‑training from scratch.

Persisting Data with Modal Volumes

Modal volumes are essentially network‑attached storage that can be mounted into any function. They’re perfect for sharing datasets, model checkpoints, or even intermediate results between separate jobs. Creating a volume is as easy as calling modal.Volume.persisted("my-data"), which ensures the data lives beyond the lifetime of a single container.

# Define a persistent volume for CIFAR‑10
cifar_volume = modal.Volume.persisted("cifar10-data")

@modal.function(image=image, gpu="any", mounts=[cifar_volume])
def download_cifar():
    import torchvision
    torchvision.datasets.CIFAR10(root="/cifar", train=True,
                                 download=True)
    return "Dataset ready"

Now the download_cifar function writes the dataset into /cifar, which is backed by the persistent volume. Any subsequent function that mounts cifar_volume can read the data instantly, cutting down on redundant network traffic.

Running Inference at Scale

Once you have a trained model, serving predictions is a natural next step. Modal’s @modal.web_endpoint decorator turns a Python function into an HTTP endpoint that automatically scales based on request volume. Combine it with a GPU‑enabled container, and you’ve got a low‑latency inference API without managing a fleet of servers.

# Load the checkpoint into a global variable (cold‑start only once)
@modal.function(image=image, gpu="any", mounts=[cifar_volume])
def load_model() -> torch.nn.Module:
    import torch, torchvision
    model = torchvision.models.resnet18(num_classes=10)
    model.load_state_dict(torch.load("/cifar/resnet_cifar10.pth"))
    model.eval().cuda()
    return model

# Define a web endpoint for inference
@app = modal.App()
@app.function(image=image, gpu="any", mounts=[cifar_volume])
def get_model():
    return load_model.remote().result()

@app.web_endpoint(method="POST")
def predict(request):
    import torch, torchvision.transforms as T
    from PIL import Image
    # Decode incoming image
    img = Image.open(request.files["file"]).convert("RGB")
    transform = T.Compose([T.Resize((32, 32)), T.ToTensor(),
                           T.Normalize((0.5,), (0.5,))])
    tensor = transform(img).unsqueeze(0).cuda()
    # Run inference
    model = get_model()
    with torch.no_grad():
        logits = model(tensor)
        pred = logits.argmax(dim=1).item()
    return {"prediction": int(pred)}

Deploy the app with modal deploy my_app.py, and Modal will give you a public URL. Each request spins up a short‑lived GPU container, runs the inference, and shuts down—ideal for bursty traffic patterns where a traditional GPU server would sit idle most of the day.

Note: For ultra‑low latency (sub‑100 ms), you can keep a warm pool of containers using concurrency=5 in the decorator. This trades a bit of cost for faster cold‑start times.

Real‑World Use Cases

1. Video Frame Upscaling – Content creators often need to upscale 4K footage to 8K for modern displays. Using Modal, you can spin up a GPU container with the realsr library, process each frame in parallel, and store the results back to an S3 bucket—all without ever provisioning a dedicated GPU workstation.

video_image = modal.Image.debian_slim().pip_install(
    "opencv-python", "realsr", "boto3"
)

@modal.function(image=video_image, gpu="any", concurrency=10)
def upscale_frame(frame_bytes: bytes) -> bytes:
    import cv2, numpy as np, realsr
    # Decode frame
    nparr = np.frombuffer(frame_bytes, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    # Upscale with RealSR
    upscaled = realsr.upscale(img, scale=2)
    # Encode back to JPEG
    _, buf = cv2.imencode('.jpg', upscaled)
    return buf.tobytes()

Pair this function with an S3 trigger to automatically upscale every uploaded frame, turning a costly on‑prem GPU farm into a pay‑as‑you‑go pipeline.

2. Scientific Simulations – Researchers running Monte Carlo simulations or finite‑element analyses can leverage Modal to parallelize thousands of tiny GPU jobs. By defining a function that accepts a random seed and returns a result, Modal’s scheduler distributes the work across a fleet of GPUs, aggregates the outcomes, and delivers a final report—all within minutes.

sim_image = modal.Image.debian_slim().pip_install("numpy", "cupy")

@modal.function(image=sim_image, gpu="any", concurrency=50)
def monte_carlo(seed: int, n: int = 10_000) -> float:
    import cupy as cp
    cp.random.seed(seed)
    draws = cp.random.normal(size=n)
    return float(cp.mean(draws ** 2))

# Run 100 independent simulations
results = [monte_carlo.remote(seed=i).result() for i in range(100)]
print("Average of simulations:", sum(results) / len(results))

Because each simulation runs in isolation, you get deterministic reproducibility while still exploiting massive parallelism.

Advanced Modal Features

Beyond the basics, Modal offers several knobs to fine‑tune performance and cost. Here are a few that often get overlooked but can make a big difference in production workloads.

Secrets Management

Hard‑coding API keys or database passwords is a security risk. Modal’s modal.Secret object lets you inject secrets into containers at runtime. Create a secret via the CLI and reference it in your function without ever exposing the raw value in source code.

# Create a secret (run once)
# modal secret create my-db-secret --from-env DATABASE_URL

db_secret = modal.Secret.from_name("my-db-secret")

@modal.function(secrets=[db_secret])
def query_db():
    import os, sqlalchemy as sa
    engine = sa.create_engine(os.getenv("DATABASE_URL"))
    with engine.connect() as conn:
        return conn.execute(sa.text("SELECT COUNT(*) FROM users")).scalar()

Custom Docker Images

Sometimes you need system‑level libraries (e.g., ffmpeg or libgl1) that aren’t available via pip. Modal lets you build a custom Dockerfile and use it as the base image for your functions. This keeps your environment reproducible and isolates complex dependencies.

custom_image = modal.Image.from_dockerfile(
    "Dockerfile",  # Dockerfile should install ffmpeg, libgl1, etc.
    context_dir="."
)

@modal.function(image=custom_image, gpu="any")
def process_video(path: str):
    import subprocess
    subprocess.run(["ffmpeg", "-i", path, "-vf", "scale=1920:1080", "out.mp4"])
    return "out.mp4"

Batching and Streaming

If your workload processes a stream of data (e.g., sensor readings), you can use Modal’s modal.Stream to pipe data between functions without materializing intermediate files. This reduces I/O overhead and keeps the pipeline fully serverless.

@modal.function
def sensor_ingest(stream: modal.Stream):
    for reading in stream:
        # Simple filter
        if reading["temperature"] > 30:
            yield reading

@modal.function
def alert_handler(readings):
    for r in readings:
        print(f"⚠️ High temp: {r['temperature']} at {r['timestamp']}")

Connect the two functions with sensor_ingest.map(alert_handler) to create an end‑to‑end streaming pipeline that runs entirely on Modal’s infrastructure.

Pro tip: Use concurrency in combination with modal.Stream to achieve back‑pressure handling—Modal will automatically throttle the producer when the consumer can’t keep up.

Cost Management Strategies

Serverless doesn’t mean free. Modal charges per GPU second, plus a modest fee for storage and data transfer. Here are three practical ways to keep the bill under control:

Warm Pools: Set keep_warm=1 for functions you hit frequently. This maintains a single idle container, eliminating cold‑start latency while still charging only for the active seconds.
Spot GPU Instances: Modal can automatically fall back to spot‑priced GPUs when available, reducing compute cost by up to 80% for non‑time‑critical jobs.
Data Locality: Store large datasets in Modal volumes close to the compute nodes. Pulling data from remote S3 on every invocation adds both latency and egress charges.

By combining these tactics with the modal.monitor dashboard, you can spot inefficiencies, set alerts, and iterate on your architecture without guessing.

Testing and Debugging Locally

Modal provides a modal.run command that executes your function in a local Docker container mimicking the cloud environment. This is invaluable for rapid iteration, especially when dealing with GPU drivers or complex native dependencies.

# Run the training function locally for quick debugging
modal run my_module.train_resnet --args 1

The local run respects the same image and mounts definitions, so you can be confident that code that works locally will also work on Modal’s servers. For step‑by‑step debugging, attach a remote debugger (e.g., ptvsd) inside the container and connect from your IDE.

Best Practices Checklist

Always pin library versions in Image.pip_install to guarantee reproducibility.
Use modal.Volume for any data that needs to survive beyond a single function call.
Leverage modal.Secret for credentials—never embed them in source code.
Set appropriate concurrency and keep_warm values based on traffic patterns.
Monitor cost dashboards weekly; adjust spot usage and warm pools as needed.
Write unit tests that call .local() on your functions to validate logic without incurring cloud charges.

Conclusion

Modal transforms the way Python developers think about GPU workloads. By abstracting away infrastructure, handling secrets, and offering seamless scaling, it lets you focus on the core logic—whether that’s training a deep network, upscaling video frames, or running massive Monte Carlo simulations. With the patterns and pro tips covered in this guide, you’re ready to launch production‑grade, cost‑effective GPU jobs from a single Python file. Happy serverless computing!

Share this article