composer.datasets.coco#

COCO (Common Objects in Context) dataset.

COCO is a large-scale object detection, segmentation, and captioning dataset. Please refer to the COCO dataset for more details.

Classes

`COCODetection`	PyTorch Dataset for the COCO dataset.
`StreamingCOCO`	Implementation of the COCO dataset using StreamingDataset.

Hparams

These classes are used with yahp for YAML-based configuration.

`COCODatasetHparams`	Defines an instance of the COCO Dataset.
`StreamingCOCOHparams`	DatasetHparams for creating an instance of StreamingCOCO.

class composer.datasets.coco.COCODatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Defines an instance of the COCO Dataset.

Parameters

datadir (str) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

Iterable | DataSpec – An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.

class composer.datasets.coco.COCODetection(img_folder, annotate_file, transform=None)[source]#

Bases: torch.utils.data.dataset.Dataset

PyTorch Dataset for the COCO dataset.

Parameters

img_folder (str) – the path to the COCO folder.
annotate_file (str) – path to a file that contains image id, annotations (e.g., bounding boxes and object classes) etc.
transform (Module) – transformations to apply to the image.

class composer.datasets.coco.StreamingCOCO(remote, local, split, shuffle, batch_size=None)[source]#

Bases: composer.datasets.streaming.dataset.StreamingDataset

Implementation of the COCO dataset using StreamingDataset.

Parameters

remote (str) – Remote directory (S3 or local filesystem) where dataset is stored.
local (str) – Local filesystem directory where dataset is cached during operation.
split (str) – The dataset split to use, either ‘train’ or ‘val’.
shuffle (bool) – Whether to shuffle the samples in this dataset.
batch_size (Optional[int]) – Hint the batch_size that will be used on each device’s DataLoader. Default: None.

class composer.datasets.coco.StreamingCOCOHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, remote='s3://mosaicml-internal-dataset-coco/mds/1/', local='/tmp/mds-cache/mds-coco/', split='train')[source]#

Bases: composer.datasets.hparams.DatasetHparams

DatasetHparams for creating an instance of StreamingCOCO.

Parameters

datadir (str) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.
remote (str) – Remote directory (S3 or local filesystem) where dataset is stored. Default: 's3://mosaicml-internal-dataset-coco/mds/1/`
local (str) – Local filesystem directory where dataset is cached during operation. Default: '/tmp/mds-cache/mds-coco/`
split (str) – The dataset split to use, either ‘train’ or ‘val’. Default: 'train`.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

Iterable | DataSpec – An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.