composer.datasets.imagenet#

ImageNet classfication dataset.

The most widely used dataset for Image Classification algorithms. Please refer to the ImageNet 2012 Classification Dataset for more details. Also includes streaming dataset versions based on the WebDatasets.

Classes

StreamingImageNet1k

Implementation of the ImageNet1k dataset using StreamingDataset.

Hparams

These classes are used with yahp for YAML-based configuration.

Imagenet1kWebDatasetHparams

Defines an instance of the ImageNet-1k WebDataset for image classification.

ImagenetDatasetHparams

Defines an instance of the ImageNet dataset for image classification.

StreamingImageNet1kHparams

DatasetHparams for creating an instance of StreamingImageNet1k.

TinyImagenet200WebDatasetHparams

Defines an instance of the TinyImagenet-200 WebDataset for image classification.

class composer.datasets.imagenet.Imagenet1kWebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-imagenet1k', name='imagenet1k', resize_size=- 1, crop_size=224)[source]#

Bases: composer.datasets.hparams.WebDatasetHparams

Defines an instance of the ImageNet-1k WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-imagenet1k'.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'imagenet1k'.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.

class composer.datasets.imagenet.ImagenetDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, resize_size=- 1, crop_size=224, use_ffcv=False, ffcv_dir='/tmp', ffcv_dest='imagenet_train.ffcv', ffcv_write_dataset=False)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Defines an instance of the ImageNet dataset for image classification.

Parameters
  • use_synthetic (bool, optional) โ€“ Whether to use synthetic data. Default: False.

  • synthetic_num_unique_samples (int, optional) โ€“ The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.

  • synthetic_device (str, optional) โ€“ The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.

  • synthetic_memory_format โ€“ The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.

  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

  • use_ffcv (bool) โ€“ Whether to use FFCV dataloaders. Default: False.

  • ffcv_dir (str) โ€“ A directory containing train/val <file>.ffcv files. If these files donโ€™t exist and ffcv_write_dataset is True, train/val <file>.ffcv files will be created in this dir. Default: "/tmp".

  • ffcv_dest (str) โ€“ <file>.ffcv file that has dataset samples. Default: "imagenet_train.ffcv".

  • ffcv_write_dataset (std) โ€“ Whether to create dataset in FFCV format (<file>.ffcv) if it doesnโ€™t exist. Default:

  • False. โ€“

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.

class composer.datasets.imagenet.StreamingImageNet1k(remote, local, split, shuffle, resize_size=- 1, crop_size=224, batch_size=None)[source]#

Bases: composer.datasets.streaming.dataset.StreamingImageClassDataset

Implementation of the ImageNet1k dataset using StreamingDataset.

Parameters
  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored.

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation.

  • split (str) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™.

  • shuffle (bool) โ€“ Whether to shuffle the samples in this dataset.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

  • batch_size (Optional[int]) โ€“ Hint the batch_size that will be used on each deviceโ€™s DataLoader. Default: None.

class composer.datasets.imagenet.StreamingImageNet1kHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, remote='s3://mosaicml-internal-dataset-imagenet1k/mds/1/', local='/tmp/mds-cache/mds-imagenet1k/', split='train', resize_size=- 1, crop_size=224)[source]#

Bases: composer.datasets.hparams.DatasetHparams

DatasetHparams for creating an instance of StreamingImageNet1k.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored. Default: 's3://mosaicml-internal-dataset-imagenet1k/mds/1/`

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation. Default: '/tmp/mds-cache/mds-imagenet1k/`

  • split (str) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™. Default: 'train`.

  • resize_size (int, optional) โ€“ The resize size to use. Use -1 to not resize. Default: -1.

  • size (crop) โ€“ The crop size to use. Default: 224.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.

class composer.datasets.imagenet.TinyImagenet200WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-tinyimagenet200', name='tinyimagenet200', n_train_samples=100000, n_val_samples=10000, height=64, width=64, n_classes=200, channel_means=(0.485, 0.456, 0.406), channel_stds=(0.229, 0.224, 0.225))[source]#

Bases: composer.datasets.hparams.WebDatasetHparams

Defines an instance of the TinyImagenet-200 WebDataset for image classification.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • datadir โ€“ The path to the data directory.

  • is_train โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle โ€“ Whether to shuffle the dataset. Default: True.

  • webdataset_cache_dir (str) โ€“ WebDataset cache directory.

  • webdataset_cache_verbose (str) โ€“ WebDataset cache verbosity.

  • remote (str) โ€“ S3 bucket or root directory where dataset is stored. Default: 's3://mosaicml-internal-dataset-tinyimagenet200'.

  • name (str) โ€“ Key used to determine where dataset is cached on local filesystem. Default: 'tinyimagenet200'.

  • n_train_samples (int) โ€“ Number of training samples. Default: 100000.

  • n_val_samples (int) โ€“ Number of validation samples. Default: 10000.

  • height (int) โ€“ Sample image height in pixels. Default: 64.

  • width (int) โ€“ Sample image width in pixels. Default: 64.

  • n_classes (int) โ€“ Number of output classes. Default: 200.

  • channel_means (list of float) โ€“ Channel means for normalization. Default: (0.485, 0.456, 0.406).

  • channel_stds (list of float) โ€“ Channel stds for normalization. Default: (0.229, 0.224, 0.225).

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.