streaming#

Modules

composer.datasets.streaming.dataset

The StreamingDataset class, used for building streaming iterable datasets.

composer.datasets.streaming.download

Download handling for StreamingDataset.

composer.datasets.streaming.format

The StreamingDatsetIndex format that defines shard/sample metadata for StreamingDataset.

composer.datasets.streaming.world

The World class is used for easily querying distributed training info, used by StreamingDataset.

composer.datasets.streaming.writer

StreamingDatasetWriter is used to convert a list of samples into binary .mds files that can be read as a StreamingDataset.

MosaicML Streaming Datasets for cloud-native model training.

This is a new dataset class StreamingDataset(torch.utils.data.IterableDatset) and associated dataset format: shard-[00x].mds that has much better performance, shuffling, and usability than existing solutions.

A brief list of improvements:

  • No requirement of n_samples % n_shards == 0: Sharded datasets are complete with no dropped samples.

  • No requirement of n_shards % n_cpu_workers == 0: Supports reading from any # of devices, with any # of CPU workers.

  • Dataset is downloaded only ~once, regardless of # nodes and # devices and # CPU workers, no duplicate downloads and egress fees.

  • Dataset is cached on local storage after epoch 1.

  • When used with a torch.utils.data.DataLoader, the epoch boundaries are consistent (# samples, # batches) regardless of num_workers, producing (nearly) the same behavior as a map-style torch.utils.data.Dataset.

  • When data is read from a single device with num_workers <= 1, samples are read in-order (useful for local dataset inspection).

  • (TODO) Supports lazy random-access retrieval of samples (useful for local dataset inspection).

  • Shuffling is best-effort in epoch 1, and samples are made available for random acess as they are being downloaded.

  • (TODO) Shuffling is perfect, i.e. random access (per-worker), in all subsequent epochs.

Classes

StreamingDataset

A sharded, streaming, iterable dataset.

StreamingDatasetWriter

Used for writing a StreamingDataset from a list of samples.