composer.datasets.webdataset_utils#
composer.datasets.webdataset_utils
Functions
|
Return a pipeline for WebDataset-style data files. |
Write an entire WebDataset to a local directory, given an iterable of samples. |
|
Given a directory tree of classified images, create a WebDataset per dataset split. |
|
Read a WebDataset meta file. |
|
Load WebDataset from remote, optionally caching, with the given preprocessing and batching. |
|
|
Capture C-level stdout/stderr in a context manager. |
Classes
|
Like TarWriter but splits into multiple shards. |
|
Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested. |
Attributes
Any
Dict
Iterable
List
Optional
TYPE_CHECKING
Tuple
Union
annotations
log
webdataset_installed
- composer.datasets.webdataset_utils.create_webdataset(samples, dataset_dir, split, n_samples, n_shards, use_tqdm=True)[source]#
Write an entire WebDataset to a local directory, given an iterable of samples.
- Parameters
samples (iterable of dict) โ Each dataset sample.
dataset_dir (str) โ Output dataset directory.
split (str) โ Dataset split.
n_samples (int) โ Number of samples in dataset.
n_shards (int) โ Number of full shards to write (may write a leftovers shard).
use_tqdm (bool) โ Whether to show progress with tqdm.
- composer.datasets.webdataset_utils.create_webdatasets_from_image_folder(in_root, out_root, n_shards, use_tqdm=True)[source]#
Given a directory tree of classified images, create a WebDataset per dataset split.
Directory tree format: (path to dataset)/(split name)/(class name)/(image file).
- composer.datasets.webdataset_utils.init_webdataset_meta(remote, split=None)[source]#
Read a WebDataset meta file.
- composer.datasets.webdataset_utils.load_webdataset(remote, name, split, cache_dir, cache_verbose, shuffle, shuffle_buffer, preprocess, n_devices, workers_per_device, batch_size, drop_last)[source]#
Load WebDataset from remote, optionally caching, with the given preprocessing and batching.
- Parameters
remote (str) โ Remote path (either an s3:// url or a directory on local filesystem).
name (str) โ Name of this dataset, used to locate dataset in local cache.
cache_dir (str, optional) โ Root directory of local filesystem cache.
cache_verbose (bool) โ WebDataset caching verbosity.
shuffle (bool) โ Whether to shuffle samples.
shuffle_buffer (int) โ How many samples to buffer when shuffling.
preprocess (Callable) โ What transformations to apply to the samples, as WebDataset iterator(s).
n_devices (int) โ Number of devices.
workers_per_device (int) โ Number of workers per device.
batch_size (int) โ Batch size.
drop_last (bool) โ Whether to drop partial last batches.