composer.datasets.imagenet#

ImageNet classfication dataset.

The most widely used dataset for Image Classification algorithms. Please refer to the ImageNet 2012 Classification Dataset for more details. Also includes streaming dataset versions based on the WebDatasets.

Hparams

These classes are used with yahp for YAML-based configuration.

Imagenet1kWebDatasetHparams

Defines an instance of the ImageNet-1k WebDataset for image classification.

ImagenetDatasetHparams

Defines an instance of the ImageNet dataset for image classification.

TinyImagenet200WebDatasetHparams

Defines an instance of the TinyImagenet-200 WebDataset for image classification.

class composer.datasets.imagenet.Imagenet1kWebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-imagenet1k', name='imagenet1k', resize_size=- 1, crop_size=224)[source]#

Bases: composer.datasets.hparams.WebDatasetHparams

Defines an instance of the ImageNet-1k WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-imagenet1k'.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'imagenet1k'.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

class composer.datasets.imagenet.ImagenetDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, resize_size=- 1, crop_size=224, use_ffcv=False, ffcv_dir='/tmp', ffcv_dest='imagenet_train.ffcv', ffcv_write_dataset=False)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Defines an instance of the ImageNet dataset for image classification.

Parameters
  • use_synthetic (bool, optional) โ€“ Whether to use synthetic data. Default: False.

  • synthetic_num_unique_samples (int, optional) โ€“ The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.

  • synthetic_device (str, optional) โ€“ The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.

  • synthetic_memory_format โ€“ The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.

  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

  • use_ffcv (bool) โ€“ Whether to use FFCV dataloaders. Default: False.

  • ffcv_dir (str) โ€“ A directory containing train/val <file>.ffcv files. If these files donโ€™t exist and ffcv_write_dataset is True, train/val <file>.ffcv files will be created in this dir. Default: "/tmp".

  • ffcv_dest (str) โ€“ <file>.ffcv file that has dataset samples. Default: "imagenet_train.ffcv".

  • ffcv_write_dataset (std) โ€“ Whether to create dataset in FFCV format (<file>.ffcv) if it doesnโ€™t exist. Default:

  • False. โ€“

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

class composer.datasets.imagenet.TinyImagenet200WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-tinyimagenet200', name='tinyimagenet200', n_train_samples=100000, n_val_samples=10000, height=64, width=64, n_classes=200, channel_means=(0.485, 0.456, 0.406), channel_stds=(0.229, 0.224, 0.225))[source]#

Bases: composer.datasets.hparams.WebDatasetHparams

Defines an instance of the TinyImagenet-200 WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-tinyimagenet200'.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'tinyimagenet200'.

  • n_train_samples (int) โ€“ Number of training samples. Default: 100000.

  • n_val_samples (int) โ€“ Number of validation samples. Default: 10000.

  • height (int) โ€“ Sample image height in pixels. Default: 64.

  • width (int) โ€“ Sample image width in pixels. Default: 64.

  • n_classes (int) โ€“ Number of output classes. Default: 200.

  • channel_means (list of float) โ€“ Channel means for normalization. Default: (0.485, 0.456, 0.406).

  • channel_stds (list of float) โ€“ Channel stds for normalization. Default: (0.229, 0.224, 0.225).

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.