composer.datasets.cifar#

CIFAR image classification dataset.

The CIFAR datasets are a collection of labeled 32x32 colour images. Please refer to the CIFAR dataset for more details.

Hparams

These classes are used with yahp for YAML-based configuration.

CIFAR100WebDatasetHparams

Defines an instance of the CIFAR-100 WebDataset for image classification.

CIFAR10DatasetHparams

Defines an instance of the CIFAR-10 dataset for image classification from a local disk.

CIFAR10WebDatasetHparams

Defines an instance of the CIFAR-10 WebDataset for image classification.

CIFAR20WebDatasetHparams

Defines an instance of the CIFAR-20 WebDataset for image classification.

CIFARWebDatasetHparams

Common functionality for CIFAR WebDatasets.

class composer.datasets.cifar.CIFAR100WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar100', name='cifar100', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=100, channel_means=(0.5071, 0.4867, 0.4408), channel_stds=(0.2675, 0.2565, 0.2761))[source]#

Bases: composer.datasets.cifar.CIFARWebDatasetHparams

Defines an instance of the CIFAR-100 WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir โ€“ WebDataset cache directory.

  • webdataset_cache_verbose โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem.

  • n_train_samples (int) โ€“ Number of training samples.

  • n_val_samples (int) โ€“ Number of validation samples.

  • height (int) โ€“ Sample image height in pixels. Default: 32.

  • width (int) โ€“ Sample image width in pixels. Default: 32.

  • n_classes (int) โ€“ Number of output classes.

  • channel_means (list of float) โ€“ Channel means for normalization.

  • channel_stds (list of float) โ€“ Channel stds for normalization.

  • remote โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-cifar100'.

  • name โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'cifar100'.

  • n_train_samples โ€“ Number of training samples. Default: 50000.

  • n_val_samples โ€“ Number of validation samples. Default: 10000.

  • n_classes โ€“ Number of output classes. Default: 100.

  • channel_means โ€“ Channel means for normalization. Default: (0.5071, 0.4867, 0.4408).

  • channel_stds โ€“ Channel stds for normalization. Default: (0.2675, 0.2565, 0.2761).

class composer.datasets.cifar.CIFAR10DatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, download=True, use_ffcv=False, ffcv_dir='/tmp', ffcv_dest='cifar_train.ffcv', ffcv_write_dataset=False)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Defines an instance of the CIFAR-10 dataset for image classification from a local disk.

Parameters
  • use_synthetic (bool, optional) โ€“ Whether to use synthetic data. Default: False.

  • synthetic_num_unique_samples (int, optional) โ€“ The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.

  • synthetic_device (str, optional) โ€“ The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.

  • synthetic_memory_format โ€“ The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.

  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • download (bool) โ€“ Whether to download the dataset, if needed. Default: True.

  • use_ffcv (bool) โ€“ Whether to use FFCV dataloaders. Default: False.

  • ffcv_dir (str) โ€“ A directory containing train/val <file>.ffcv files. If these files donโ€™t exist and ffcv_write_dataset is True, train/val <file>.ffcv files will be created in this dir. Default: "/tmp".

  • ffcv_dest (str) โ€“ <file>.ffcv file that has dataset samples. Default: "cifar_train.ffcv".

  • ffcv_write_dataset (std) โ€“ Whether to create dataset in FFCV format (<file>.ffcv) if it doesnโ€™t exist. Default:

  • False. โ€“

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

class composer.datasets.cifar.CIFAR10WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar10', name='cifar10', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=10, channel_means=(0.4914, 0.4822, 0.4465), channel_stds=(0.247, 0.243, 0.261))[source]#

Bases: composer.datasets.cifar.CIFARWebDatasetHparams

Defines an instance of the CIFAR-10 WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir โ€“ WebDataset cache directory.

  • webdataset_cache_verbose โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem.

  • n_train_samples (int) โ€“ Number of training samples.

  • n_val_samples (int) โ€“ Number of validation samples.

  • height (int) โ€“ Sample image height in pixels. Default: 32.

  • width (int) โ€“ Sample image width in pixels. Default: 32.

  • n_classes (int) โ€“ Number of output classes.

  • channel_means (list of float) โ€“ Channel means for normalization.

  • channel_stds (list of float) โ€“ Channel stds for normalization.

  • remote โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-cifar10'.

  • name โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'cifar10'.

  • n_train_samples โ€“ Number of training samples. Default: 50000.

  • n_val_samples โ€“ Number of validation samples. Default: 10000.

  • n_classes โ€“ Number of output classes. Default: 10.

  • channel_means โ€“ Channel means for normalization. Default: (0.4914, 0.4822, 0.4465).

  • channel_stds โ€“ Channel stds for normalization. Default: (0.247, 0.243, 0.261).

class composer.datasets.cifar.CIFAR20WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar20', name='cifar20', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=20, channel_means=(0.5071, 0.4867, 0.4408), channel_stds=(0.2675, 0.2565, 0.2761))[source]#

Bases: composer.datasets.cifar.CIFARWebDatasetHparams

Defines an instance of the CIFAR-20 WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir โ€“ WebDataset cache directory.

  • webdataset_cache_verbose โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem.

  • n_train_samples (int) โ€“ Number of training samples.

  • n_val_samples (int) โ€“ Number of validation samples.

  • height (int) โ€“ Sample image height in pixels. Default: 32.

  • width (int) โ€“ Sample image width in pixels. Default: 32.

  • n_classes (int) โ€“ Number of output classes.

  • channel_means (list of float) โ€“ Channel means for normalization.

  • channel_stds (list of float) โ€“ Channel stds for normalization.

  • remote โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-cifar20'.

  • name โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'cifar20'.

  • n_train_samples โ€“ Number of training samples. Default: 50000.

  • n_val_samples โ€“ Number of validation samples. Default: 10000.

  • n_classes โ€“ Number of output classes. Default: 20.

  • channel_means โ€“ Channel means for normalization. Default: (0.5071, 0.4867, 0.4408).

  • channel_stds โ€“ Channel stds for normalization. Default: (0.2675, 0.2565, 0.2761).

class composer.datasets.cifar.CIFARWebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='', name='', n_train_samples=0, n_val_samples=0, height=32, width=32, n_classes=0, channel_means=(0, 0, 0), channel_stds=(0, 0, 0))[source]#

Bases: composer.datasets.hparams.WebDatasetHparams

Common functionality for CIFAR WebDatasets.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem.

  • n_train_samples (int) โ€“ Number of training samples.

  • n_val_samples (int) โ€“ Number of validation samples.

  • height (int) โ€“ Sample image height in pixels. Default: 32.

  • width (int) โ€“ Sample image width in pixels. Default: 32.

  • n_classes (int) โ€“ Number of output classes.

  • channel_means (list of float) โ€“ Channel means for normalization.

  • channel_stds (list of float) โ€“ Channel stds for normalization.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.