# ModelWrapper Guide

ModelWrapper enables zero-copy model sharing for PyTorch inference in Ray. It supports two execution modes: **task-based** for ad-hoc inference and **actor-based** for batch processing with Ray Data or Ray Actors.

## What is Zero-Copy?

Zero-copy means sharing model weights across Ray workers without duplicating them in memory:

1. Model weights are stored **once** in Ray's object store
2. Multiple workers **reference** the same memory location via object references
3. No duplication = significant memory savings

**Example:** 4 actors with a 5GB model use ~5GB total (not 20GB) because they share the same model weights from the object store.

## Quick Start

### Installation

```bash
pip install ray-zerocopy
```

**Requirements:**
- Python 3.11+
- PyTorch 2.0+
- Ray 2.43+

## Task Mode

Use task mode for ad-hoc inference calls. Each call spawns a Ray task with zero-copy model loading.

### Example: Task-Based Inference

```python
import torch
import torch.nn as nn
from ray_zerocopy import ModelWrapper

# Define your model
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.network(x)

# Create and wrap model
model = SimpleClassifier()
model.eval()
wrapped = ModelWrapper.for_tasks(model)

# Use immediately - each call spawns a Ray task
result = wrapped(torch.randn(1, 128))
```

## Actor Mode

Use actor mode for batch processing with Ray Data or long-running Ray Actors. The same pattern works for both.

### Example: Actor-Based Inference

```python
import ray
import torch
import torch.nn as nn
from ray.data import ActorPoolStrategy
from ray_zerocopy import ModelWrapper

# Define your model
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.network(x)

# Create and wrap model
model = SimpleClassifier()
model.eval()
model_wrapper = ModelWrapper.from_model(model, mode="actor")

# Define inference actor
class InferenceActor:
    def __init__(self, model_wrapper):
        # Load model once per actor (zero-copy, on CPU)
        self.model = model_wrapper.load()
        self.model.eval()

    def __call__(self, batch):
        inputs = torch.tensor(batch["data"], dtype=torch.float32)
        with torch.no_grad():
            outputs = self.model(inputs)
        return {"predictions": outputs.numpy()}

# Use with Ray Data
ds = ray.data.from_items([{"data": [0.1] * 128} for _ in range(100)])
results = ds.map_batches(
    InferenceActor,
    fn_constructor_kwargs={"model_wrapper": model_wrapper},
    batch_size=32,
    compute=ActorPoolStrategy(size=4),  # 4 actors share the model
)

# Or use with Ray Actors
@ray.remote
class RayInferenceActor:
    def __init__(self, model_wrapper):
        self.model = model_wrapper.load()
        self.model.eval()

    def predict(self, data):
        with torch.no_grad():
            return self.model(torch.tensor(data, dtype=torch.float32))

actors = [RayInferenceActor.remote(model_wrapper) for _ in range(4)]
results = ray.get([actor.predict.remote([0.1] * 128) for actor in actors])
```

**Note:** The same actor pattern works for both Ray Data `map_batches` and Ray Actors. The only difference is whether you use `@ray.remote` decorator.

## Pipelines with Multiple Models

ModelWrapper automatically detects and shares all `nn.Module` attributes in your pipeline:

```python
class MyPipeline:
    def __init__(self):
        self.encoder = EncoderModel()  # nn.Module - shared
        self.decoder = DecoderModel()  # nn.Module - shared
        self.config = {"temp": 1.0}    # Regular attribute - copied

    def __call__(self, x):
        return self.decoder(self.encoder(x))

pipeline = MyPipeline()
model_wrapper = ModelWrapper.from_model(pipeline, mode="actor")

# In actor: both encoder and decoder are zero-copy shared
class InferenceActor:
    def __init__(self, model_wrapper):
        self.pipeline = model_wrapper.load()  # Loads both models
```

## Device Placement

Models are loaded on CPU by default. Move them to GPU after loading if needed:

```python
class InferenceActor:
    def __init__(self, model_wrapper):
        self.model = model_wrapper.load()  # Loaded on CPU
        # Move to GPU if available
        if torch.cuda.is_available():
            self.model = self.model.cuda()
```

## API Reference

### ModelWrapper.from_model()

```python
ModelWrapper.from_model(
    model_or_pipeline,
    mode="actor",  # "task" or "actor"
    model_attr_names=None,  # Optional: specify model attributes
    method_names=None,  # Optional: for task mode
)
```

### ModelWrapper.for_tasks()

Convenience shortcut for task mode:

```python
wrapped = ModelWrapper.for_tasks(model)
# Equivalent to: ModelWrapper.from_model(model, mode="task")
```

### model_wrapper.load()

Load the model in an actor (actor mode only):

```python
model = model_wrapper.load()  # Returns model on CPU
```

## When to Use What

| Scenario | Use This |
|----------|----------|
| Ad-hoc inference calls | `ModelWrapper.for_tasks()` |
| Ray Data batch inference | `ModelWrapper.from_model(..., mode="actor")` |
| Ray Actors (long-running) | `ModelWrapper.from_model(..., mode="actor")` |
| Batch processing workloads | `ModelWrapper.from_model(..., mode="actor")` |

## Memory Savings

**Without zero-copy:**
- Each actor loads its own copy: 4 actors × 5GB = 20GB

**With zero-copy:**
- Model stored once in object store: 5GB
- All actors reference the same memory: ~5GB total

## Next Steps

- See [JIT Wrappers](jit_wrappers.md) for TorchScript support (under development)
- Check [API Reference](../api_reference/model_wrappers.md) for detailed API docs