ModelWrapper Guide

ModelWrapper enables zero-copy model sharing for PyTorch inference in Ray. It supports two execution modes: task-based for ad-hoc inference and actor-based for batch processing with Ray Data or Ray Actors.

What is Zero-Copy?

Zero-copy means sharing model weights across Ray workers without duplicating them in memory:

  1. Model weights are stored once in Ray’s object store

  2. Multiple workers reference the same memory location via object references

  3. No duplication = significant memory savings

Example: 4 actors with a 5GB model use ~5GB total (not 20GB) because they share the same model weights from the object store.

Quick Start

Installation

pip install ray-zerocopy

Requirements:

  • Python 3.11+

  • PyTorch 2.0+

  • Ray 2.43+

Task Mode

Use task mode for ad-hoc inference calls. Each call spawns a Ray task with zero-copy model loading.

Example: Task-Based Inference

import torch
import torch.nn as nn
from ray_zerocopy import ModelWrapper

# Define your model
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.network(x)

# Create and wrap model
model = SimpleClassifier()
model.eval()
wrapped = ModelWrapper.for_tasks(model)

# Use immediately - each call spawns a Ray task
result = wrapped(torch.randn(1, 128))

Actor Mode

Use actor mode for batch processing with Ray Data or long-running Ray Actors. The same pattern works for both.

Example: Actor-Based Inference

import ray
import torch
import torch.nn as nn
from ray.data import ActorPoolStrategy
from ray_zerocopy import ModelWrapper

# Define your model
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.network(x)

# Create and wrap model
model = SimpleClassifier()
model.eval()
model_wrapper = ModelWrapper.from_model(model, mode="actor")

# Define inference actor
class InferenceActor:
    def __init__(self, model_wrapper):
        # Load model once per actor (zero-copy, on CPU)
        self.model = model_wrapper.load()
        self.model.eval()

    def __call__(self, batch):
        inputs = torch.tensor(batch["data"], dtype=torch.float32)
        with torch.no_grad():
            outputs = self.model(inputs)
        return {"predictions": outputs.numpy()}

# Use with Ray Data
ds = ray.data.from_items([{"data": [0.1] * 128} for _ in range(100)])
results = ds.map_batches(
    InferenceActor,
    fn_constructor_kwargs={"model_wrapper": model_wrapper},
    batch_size=32,
    compute=ActorPoolStrategy(size=4),  # 4 actors share the model
)

# Or use with Ray Actors
@ray.remote
class RayInferenceActor:
    def __init__(self, model_wrapper):
        self.model = model_wrapper.load()
        self.model.eval()

    def predict(self, data):
        with torch.no_grad():
            return self.model(torch.tensor(data, dtype=torch.float32))

actors = [RayInferenceActor.remote(model_wrapper) for _ in range(4)]
results = ray.get([actor.predict.remote([0.1] * 128) for actor in actors])

Note: The same actor pattern works for both Ray Data map_batches and Ray Actors. The only difference is whether you use @ray.remote decorator.

Pipelines with Multiple Models

ModelWrapper automatically detects and shares all nn.Module attributes in your pipeline:

class MyPipeline:
    def __init__(self):
        self.encoder = EncoderModel()  # nn.Module - shared
        self.decoder = DecoderModel()  # nn.Module - shared
        self.config = {"temp": 1.0}    # Regular attribute - copied

    def __call__(self, x):
        return self.decoder(self.encoder(x))

pipeline = MyPipeline()
model_wrapper = ModelWrapper.from_model(pipeline, mode="actor")

# In actor: both encoder and decoder are zero-copy shared
class InferenceActor:
    def __init__(self, model_wrapper):
        self.pipeline = model_wrapper.load()  # Loads both models

Device Placement

Models are loaded on CPU by default. Move them to GPU after loading if needed:

class InferenceActor:
    def __init__(self, model_wrapper):
        self.model = model_wrapper.load()  # Loaded on CPU
        # Move to GPU if available
        if torch.cuda.is_available():
            self.model = self.model.cuda()

API Reference

ModelWrapper.from_model()

ModelWrapper.from_model(
    model_or_pipeline,
    mode="actor",  # "task" or "actor"
    model_attr_names=None,  # Optional: specify model attributes
    method_names=None,  # Optional: for task mode
)

ModelWrapper.for_tasks()

Convenience shortcut for task mode:

wrapped = ModelWrapper.for_tasks(model)
# Equivalent to: ModelWrapper.from_model(model, mode="task")

model_wrapper.load()

Load the model in an actor (actor mode only):

model = model_wrapper.load()  # Returns model on CPU

When to Use What

Scenario

Use This

Ad-hoc inference calls

ModelWrapper.for_tasks()

Ray Data batch inference

ModelWrapper.from_model(..., mode="actor")

Ray Actors (long-running)

ModelWrapper.from_model(..., mode="actor")

Batch processing workloads

ModelWrapper.from_model(..., mode="actor")

Memory Savings

Without zero-copy:

  • Each actor loads its own copy: 4 actors × 5GB = 20GB

With zero-copy:

  • Model stored once in object store: 5GB

  • All actors reference the same memory: ~5GB total

Next Steps