Files
ara/orchestra-skills/05-data-processing/ray-data/references/transformations.md
T

1.6 KiB
Raw Blame History

Ray Data Transformations

Complete guide to data transformations in Ray Data.

Core operations

Map batches (vectorized)

# Recommended for performance
def process_batch(batch):
    # batch is dict of numpy arrays or pandas Series
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)

Performance: 10-100× faster than row-by-row

Map (row-by-row)

# Use only when vectorization not possible
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)

Filter

# Remove rows
ds = ds.filter(lambda row: row["score"] > 0.5)

Flat map

# One row → multiple rows
def expand_row(row):
    return [{"value": row["value"] + i} for i in range(3)]

ds = ds.flat_map(expand_row)

GPU-accelerated transforms

def gpu_transform(batch):
    import torch
    data = torch.tensor(batch["data"]).cuda()
    # GPU processing
    result = data * 2
    return {"processed": result.cpu().numpy()}

ds = ds.map_batches(gpu_transform, num_gpus=1, batch_size=64)

Groupby operations

# Group by column
grouped = ds.groupby("category")

# Aggregate
result = grouped.count()

# Custom aggregation
result = grouped.map_groups(lambda group: {
    "sum": group["value"].sum(),
    "mean": group["value"].mean()
})

Best practices

  1. Use map_batches over map - 10-100× faster
  2. Tune batch_size - Larger = faster (balance with memory)
  3. Use GPUs for heavy compute - Image/audio preprocessing
  4. Stream large datasets - Use iter_batches for >memory data