ModelWrapper Guide
ModelWrapper enables zero-copy model sharing for PyTorch inference in Ray. It supports two execution modes: task-based for ad-hoc inference and actor-based for batch processing with Ray Data or Ray Actors.
What is Zero-Copy?
Zero-copy means sharing model weights across Ray workers without duplicating them in memory:
Model weights are stored once in Ray’s object store
Multiple workers reference the same memory location via object references
No duplication = significant memory savings
Example: 4 actors with a 5GB model use ~5GB total (not 20GB) because they share the same model weights from the object store.
Quick Start
Installation
pip install ray-zerocopy
Requirements:
Python 3.11+
PyTorch 2.0+
Ray 2.43+
Task Mode
Use task mode for ad-hoc inference calls. Each call spawns a Ray task with zero-copy model loading.
Example: Task-Based Inference
import torch
import torch.nn as nn
from ray_zerocopy import ModelWrapper
# Define your model
class SimpleClassifier(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
def forward(self, x):
return self.network(x)
# Create and wrap model
model = SimpleClassifier()
model.eval()
wrapped = ModelWrapper.for_tasks(model)
# Use immediately - each call spawns a Ray task
result = wrapped(torch.randn(1, 128))
Actor Mode
Use actor mode for batch processing with Ray Data or long-running Ray Actors. The same pattern works for both.
Example: Actor-Based Inference
import ray
import torch
import torch.nn as nn
from ray.data import ActorPoolStrategy
from ray_zerocopy import ModelWrapper
# Define your model
class SimpleClassifier(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
def forward(self, x):
return self.network(x)
# Create and wrap model
model = SimpleClassifier()
model.eval()
model_wrapper = ModelWrapper.from_model(model, mode="actor")
# Define inference actor
class InferenceActor:
def __init__(self, model_wrapper):
# Load model once per actor (zero-copy, on CPU)
self.model = model_wrapper.load()
self.model.eval()
def __call__(self, batch):
inputs = torch.tensor(batch["data"], dtype=torch.float32)
with torch.no_grad():
outputs = self.model(inputs)
return {"predictions": outputs.numpy()}
# Use with Ray Data
ds = ray.data.from_items([{"data": [0.1] * 128} for _ in range(100)])
results = ds.map_batches(
InferenceActor,
fn_constructor_kwargs={"model_wrapper": model_wrapper},
batch_size=32,
compute=ActorPoolStrategy(size=4), # 4 actors share the model
)
# Or use with Ray Actors
@ray.remote
class RayInferenceActor:
def __init__(self, model_wrapper):
self.model = model_wrapper.load()
self.model.eval()
def predict(self, data):
with torch.no_grad():
return self.model(torch.tensor(data, dtype=torch.float32))
actors = [RayInferenceActor.remote(model_wrapper) for _ in range(4)]
results = ray.get([actor.predict.remote([0.1] * 128) for actor in actors])
Note: The same actor pattern works for both Ray Data map_batches and Ray Actors. The only difference is whether you use @ray.remote decorator.
Pipelines with Multiple Models
ModelWrapper automatically detects and shares all nn.Module attributes in your pipeline:
class MyPipeline:
def __init__(self):
self.encoder = EncoderModel() # nn.Module - shared
self.decoder = DecoderModel() # nn.Module - shared
self.config = {"temp": 1.0} # Regular attribute - copied
def __call__(self, x):
return self.decoder(self.encoder(x))
pipeline = MyPipeline()
model_wrapper = ModelWrapper.from_model(pipeline, mode="actor")
# In actor: both encoder and decoder are zero-copy shared
class InferenceActor:
def __init__(self, model_wrapper):
self.pipeline = model_wrapper.load() # Loads both models
Device Placement
Models are loaded on CPU by default. Move them to GPU after loading if needed:
class InferenceActor:
def __init__(self, model_wrapper):
self.model = model_wrapper.load() # Loaded on CPU
# Move to GPU if available
if torch.cuda.is_available():
self.model = self.model.cuda()
API Reference
ModelWrapper.from_model()
ModelWrapper.from_model(
model_or_pipeline,
mode="actor", # "task" or "actor"
model_attr_names=None, # Optional: specify model attributes
method_names=None, # Optional: for task mode
)
ModelWrapper.for_tasks()
Convenience shortcut for task mode:
wrapped = ModelWrapper.for_tasks(model)
# Equivalent to: ModelWrapper.from_model(model, mode="task")
model_wrapper.load()
Load the model in an actor (actor mode only):
model = model_wrapper.load() # Returns model on CPU
When to Use What
Scenario |
Use This |
|---|---|
Ad-hoc inference calls |
|
Ray Data batch inference |
|
Ray Actors (long-running) |
|
Batch processing workloads |
|
Memory Savings
Without zero-copy:
Each actor loads its own copy: 4 actors × 5GB = 20GB
With zero-copy:
Model stored once in object store: 5GB
All actors reference the same memory: ~5GB total
Next Steps
See JIT Wrappers for TorchScript support (under development)
Check API Reference for detailed API docs