composer.datasets.cifar#
CIFAR image classification dataset.
The CIFAR datasets are a collection of labeled 32x32 colour images. Please refer to the CIFAR dataset for more details.
Hparams
These classes are used with yahp
for YAML
-based configuration.
Defines an instance of the CIFAR-100 WebDataset for image classification. |
|
Defines an instance of the CIFAR-10 dataset for image classification from a local disk. |
|
Defines an instance of the CIFAR-10 WebDataset for image classification. |
|
Defines an instance of the CIFAR-20 WebDataset for image classification. |
|
Common functionality for CIFAR WebDatasets. |
- class composer.datasets.cifar.CIFAR100WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar100', name='cifar100', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=100, channel_means=(0.5071, 0.4867, 0.4408), channel_stds=(0.2675, 0.2565, 0.2761))[source]#
Bases:
composer.datasets.cifar.CIFARWebDatasetHparams
Defines an instance of the CIFAR-100 WebDataset for image classification.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir โ WebDataset cache directory.
webdataset_cache_verbose โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored.
name (str) โ Key used to determine where dataset is cached on local filesystem.
n_train_samples (int) โ Number of training samples.
n_val_samples (int) โ Number of validation samples.
height (int) โ Sample image height in pixels. Default:
32
.width (int) โ Sample image width in pixels. Default:
32
.n_classes (int) โ Number of output classes.
channel_means (list of float) โ Channel means for normalization.
channel_stds (list of float) โ Channel stds for normalization.
remote โ S3 bucket or root directory where dataset is stored. Default:
's3://mosaicml-internal-dataset-cifar100'
.name โ Key used to determine where dataset is cached on local filesystem. Default:
'cifar100'
.n_train_samples โ Number of training samples. Default:
50000
.n_val_samples โ Number of validation samples. Default:
10000
.n_classes โ Number of output classes. Default:
100
.channel_means โ Channel means for normalization. Default:
(0.5071, 0.4867, 0.4408)
.channel_stds โ Channel stds for normalization. Default:
(0.2675, 0.2565, 0.2761)
.
- class composer.datasets.cifar.CIFAR10DatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, download=True, use_ffcv=False, ffcv_dir='/tmp', ffcv_dest='cifar_train.ffcv', ffcv_write_dataset=False)[source]#
Bases:
composer.datasets.hparams.DatasetHparams
,composer.datasets.hparams.SyntheticHparamsMixin
Defines an instance of the CIFAR-10 dataset for image classification from a local disk.
- Parameters
use_synthetic (bool, optional) โ Whether to use synthetic data. Default:
False
.synthetic_num_unique_samples (int, optional) โ The number of unique samples to allocate memory for. Ignored if
use_synthetic
isFalse
. Default:100
.synthetic_device (str, optional) โ The device to store the sample pool on. Set to
'cuda'
to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to'cpu'
to move data between host memory and the device on every batch. Ignored ifuse_synthetic
isFalse
. Default:'cpu'
.synthetic_memory_format โ The
MemoryFormat
to use. Ignored ifuse_synthetic
isFalse
. Default:'CONTIGUOUS_FORMAT'
.datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.download (bool) โ Whether to download the dataset, if needed. Default:
True
.use_ffcv (bool) โ Whether to use FFCV dataloaders. Default:
False
.ffcv_dir (str) โ A directory containing train/val <file>.ffcv files. If these files donโt exist and
ffcv_write_dataset
isTrue
, train/val <file>.ffcv files will be created in this dir. Default:"/tmp"
.ffcv_dest (str) โ <file>.ffcv file that has dataset samples. Default:
"cifar_train.ffcv"
.ffcv_write_dataset (std) โ Whether to create dataset in FFCV format (<file>.ffcv) if it doesnโt exist. Default:
False. โ
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
Iterable | DataSpec โ An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.
- class composer.datasets.cifar.CIFAR10WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar10', name='cifar10', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=10, channel_means=(0.4914, 0.4822, 0.4465), channel_stds=(0.247, 0.243, 0.261))[source]#
Bases:
composer.datasets.cifar.CIFARWebDatasetHparams
Defines an instance of the CIFAR-10 WebDataset for image classification.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir โ WebDataset cache directory.
webdataset_cache_verbose โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored.
name (str) โ Key used to determine where dataset is cached on local filesystem.
n_train_samples (int) โ Number of training samples.
n_val_samples (int) โ Number of validation samples.
height (int) โ Sample image height in pixels. Default:
32
.width (int) โ Sample image width in pixels. Default:
32
.n_classes (int) โ Number of output classes.
channel_means (list of float) โ Channel means for normalization.
channel_stds (list of float) โ Channel stds for normalization.
remote โ S3 bucket or root directory where dataset is stored. Default:
's3://mosaicml-internal-dataset-cifar10'
.name โ Key used to determine where dataset is cached on local filesystem. Default:
'cifar10'
.n_train_samples โ Number of training samples. Default:
50000
.n_val_samples โ Number of validation samples. Default:
10000
.n_classes โ Number of output classes. Default:
10
.channel_means โ Channel means for normalization. Default:
(0.4914, 0.4822, 0.4465)
.channel_stds โ Channel stds for normalization. Default:
(0.247, 0.243, 0.261)
.
- class composer.datasets.cifar.CIFAR20WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-cifar20', name='cifar20', n_train_samples=50000, n_val_samples=10000, height=32, width=32, n_classes=20, channel_means=(0.5071, 0.4867, 0.4408), channel_stds=(0.2675, 0.2565, 0.2761))[source]#
Bases:
composer.datasets.cifar.CIFARWebDatasetHparams
Defines an instance of the CIFAR-20 WebDataset for image classification.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir โ WebDataset cache directory.
webdataset_cache_verbose โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored.
name (str) โ Key used to determine where dataset is cached on local filesystem.
n_train_samples (int) โ Number of training samples.
n_val_samples (int) โ Number of validation samples.
height (int) โ Sample image height in pixels. Default:
32
.width (int) โ Sample image width in pixels. Default:
32
.n_classes (int) โ Number of output classes.
channel_means (list of float) โ Channel means for normalization.
channel_stds (list of float) โ Channel stds for normalization.
remote โ S3 bucket or root directory where dataset is stored. Default:
's3://mosaicml-internal-dataset-cifar20'
.name โ Key used to determine where dataset is cached on local filesystem. Default:
'cifar20'
.n_train_samples โ Number of training samples. Default:
50000
.n_val_samples โ Number of validation samples. Default:
10000
.n_classes โ Number of output classes. Default:
20
.channel_means โ Channel means for normalization. Default:
(0.5071, 0.4867, 0.4408)
.channel_stds โ Channel stds for normalization. Default:
(0.2675, 0.2565, 0.2761)
.
- class composer.datasets.cifar.CIFARWebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='', name='', n_train_samples=0, n_val_samples=0, height=32, width=32, n_classes=0, channel_means=(0, 0, 0), channel_stds=(0, 0, 0))[source]#
Bases:
composer.datasets.hparams.WebDatasetHparams
Common functionality for CIFAR WebDatasets.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored.
name (str) โ Key used to determine where dataset is cached on local filesystem.
n_train_samples (int) โ Number of training samples.
n_val_samples (int) โ Number of validation samples.
height (int) โ Sample image height in pixels. Default:
32
.width (int) โ Sample image width in pixels. Default:
32
.n_classes (int) โ Number of output classes.
channel_means (list of float) โ Channel means for normalization.
channel_stds (list of float) โ Channel stds for normalization.
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
Iterable | DataSpec โ An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.