composer.datasets.webdataset#

composer.datasets.webdataset

Functions

WebDataset

Return a pipeline for WebDataset-style data files.

create_webdataset

Write an entire WebDataset to a local directory, given an iterable of samples.

create_webdatasets_from_image_folder

Given a directory tree of classified images, create a WebDataset per dataset split.

load_webdataset

Load WebDataset from remote, optionally caching, with the given preprocessing and batching.

pipes

Capture C-level stdout/stderr in a context manager.

Classes

ShardWriter

Like TarWriter but splits into multiple shards.

tqdm

Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.

Attributes

  • Any

  • Dict

  • Iterable

  • List

  • Optional

  • TYPE_CHECKING

  • Tuple

  • Union

  • log

  • webdataset_installed

composer.datasets.webdataset.create_webdataset(samples, dataset_dir, split, n_samples, n_shards, use_tqdm=True)[source]#

Write an entire WebDataset to a local directory, given an iterable of samples.

Parameters
  • samples (iterable of dict) โ€“ Each dataset sample.

  • dataset_dir (str) โ€“ Output dataset directory.

  • split (str) โ€“ Dataset split.

  • n_samples (int) โ€“ Number of samples in dataset.

  • n_shards (int) โ€“ Number of full shards to write (may write a leftovers shard).

  • use_tqdm (bool) โ€“ Whether to show progress with tqdm.

composer.datasets.webdataset.create_webdatasets_from_image_folder(in_root, out_root, n_shards, use_tqdm=True)[source]#

Given a directory tree of classified images, create a WebDataset per dataset split.

Directory tree format: (path to dataset)/(split name)/(class name)/(image file).

Parameters
  • in_root (str) โ€“ Input dataset root.

  • out_root (str) โ€“ Output WebDataset root.

  • n_shards (int) โ€“ Number of full shards to write (may write a leftovers shard).

  • use_tqdm (bool) โ€“ Whether to show progress with tqdm.

composer.datasets.webdataset.load_webdataset(remote, name, split, cache_dir, cache_verbose, shuffle, shuffle_buffer, preprocess, n_devices, workers_per_device, batch_size, drop_last)[source]#

Load WebDataset from remote, optionally caching, with the given preprocessing and batching.

Parameters
  • remote (str) โ€“ Remote path (either an s3:// url or a directory on local filesystem).

  • name (str) โ€“ Name of this dataset, used to locate dataset in local cache.

  • cache_dir (str, optional) โ€“ Root directory of local filesystem cache.

  • cache_verbose (bool) โ€“ WebDataset caching verbosity.

  • shuffle (bool) โ€“ Whether to shuffle samples.

  • shuffle_buffer (int) โ€“ How many samples to buffer when shuffling.

  • preprocess (Callable) โ€“ What transformations to apply to the samples, as WebDataset iterator(s).

  • n_devices (int) โ€“ Number of devices.

  • workers_per_device (int) โ€“ Number of workers per device.

  • batch_size (int) โ€“ Batch size.

  • drop_last (bool) โ€“ Whether to drop partial last batches.