composer.datasets.coco#

COCO (Common Objects in Context) dataset.

COCO is a large-scale object detection, segmentation, and captioning dataset. Please refer to the COCO dataset for more details.

Classes

COCODetection

PyTorch Dataset for the COCO dataset.

StreamingCOCO

Implementation of the COCO dataset using StreamingDataset.

Hparams

These classes are used with yahp for YAML-based configuration.

COCODatasetHparams

Defines an instance of the COCO Dataset.

StreamingCOCOHparams

DatasetHparams for creating an instance of StreamingCOCO.

class composer.datasets.coco.COCODatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Defines an instance of the COCO Dataset.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.

class composer.datasets.coco.COCODetection(img_folder, annotate_file, transform=None)[source]#

Bases: torch.utils.data.dataset.Dataset

PyTorch Dataset for the COCO dataset.

Parameters
  • img_folder (str) โ€“ the path to the COCO folder.

  • annotate_file (str) โ€“ path to a file that contains image id, annotations (e.g., bounding boxes and object classes) etc.

  • transform (Module) โ€“ transformations to apply to the image.

class composer.datasets.coco.StreamingCOCO(remote, local, split, shuffle, batch_size=None)[source]#

Bases: composer.datasets.streaming.dataset.StreamingDataset

Implementation of the COCO dataset using StreamingDataset.

Parameters
  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored.

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation.

  • split (str) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™.

  • shuffle (bool) โ€“ Whether to shuffle the samples in this dataset.

  • batch_size (Optional[int]) โ€“ Hint the batch_size that will be used on each deviceโ€™s DataLoader. Default: None.

class composer.datasets.coco.StreamingCOCOHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, remote='s3://mosaicml-internal-dataset-coco/mds/1/', local='/tmp/mds-cache/mds-coco/', split='train')[source]#

Bases: composer.datasets.hparams.DatasetHparams

DatasetHparams for creating an instance of StreamingCOCO.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored. Default: 's3://mosaicml-internal-dataset-coco/mds/1/`

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation. Default: '/tmp/mds-cache/mds-coco/`

  • split (str) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™. Default: 'train`.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns
  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.