1.6 KiB
1.6 KiB
Ray Data Transformations
Complete guide to data transformations in Ray Data.
Core operations
Map batches (vectorized)
# Recommended for performance
def process_batch(batch):
# batch is dict of numpy arrays or pandas Series
batch["doubled"] = batch["value"] * 2
return batch
ds = ds.map_batches(process_batch, batch_size=1000)
Performance: 10-100× faster than row-by-row
Map (row-by-row)
# Use only when vectorization not possible
def process_row(row):
row["squared"] = row["value"] ** 2
return row
ds = ds.map(process_row)
Filter
# Remove rows
ds = ds.filter(lambda row: row["score"] > 0.5)
Flat map
# One row → multiple rows
def expand_row(row):
return [{"value": row["value"] + i} for i in range(3)]
ds = ds.flat_map(expand_row)
GPU-accelerated transforms
def gpu_transform(batch):
import torch
data = torch.tensor(batch["data"]).cuda()
# GPU processing
result = data * 2
return {"processed": result.cpu().numpy()}
ds = ds.map_batches(gpu_transform, num_gpus=1, batch_size=64)
Groupby operations
# Group by column
grouped = ds.groupby("category")
# Aggregate
result = grouped.count()
# Custom aggregation
result = grouped.map_groups(lambda group: {
"sum": group["value"].sum(),
"mean": group["value"].mean()
})
Best practices
- Use map_batches over map - 10-100× faster
- Tune batch_size - Larger = faster (balance with memory)
- Use GPUs for heavy compute - Image/audio preprocessing
- Stream large datasets - Use iter_batches for >memory data