composer.datasets.c4#

C4 (Colossal Cleaned CommonCrawl) dataset.

This dataset is a colossal, cleaned version of Common Crawlโ€™s web crawl corpus and it is based on the Common Crawl dataset.

Classes

C4Dataset

Builds a streaming, sharded, sized torch.utils.data.IterableDataset for the C4 (Colossal Cleaned CommonCrawl) dataset.

Hparams

These classes are used with yahp for YAML-based configuration.

C4DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned CommonCrawl) dataset.

class composer.datasets.c4.C4Dataset(split, num_samples, tokenizer_name, max_seq_len, group_method, shuffle=False, shuffle_buffer_size=10000, seed=5)[source]#

Bases: torch.utils.data.dataset.IterableDataset

Builds a streaming, sharded, sized torch.utils.data.IterableDataset for the C4 (Colossal Cleaned CommonCrawl) dataset. Used for pretraining autoregressive or masked language models. Text samples are streamed directly from the cloud using HuggingFaceโ€™s C4 Dataset with streaming backend (See https://huggingface.co/datasets/c4 for more details). The text samples are then shuffled, tokenized, and grouped on- the-fly.

Parameters
  • split (str) โ€“ What split of the dataset to use. Either 'train' or 'validation'.

  • num_samples (int) โ€“ The number of post-processed token samples, used to set epoch size of the torch.data.utils.IterableDataset.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with.

  • max_seq_len (int) โ€“ The max sequence length of each token sample.

  • group_method (str) โ€“ How to group text samples into token samples. Either 'truncate' or 'concat'.

  • shuffle (bool) โ€“ Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default: False.

  • shuffle_buffer_size (int) โ€“ If shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default: 10000.

  • seed (int) โ€“ If shuffle=True, what seed to use for shuffling operations. Default: 5.

Returns

IterableDataset โ€“ A torch.utils.data.IterableDataset object.

class composer.datasets.c4.C4DatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, split=None, num_samples=None, tokenizer_name=None, max_seq_len=None, group_method=None, mlm=False, mlm_probability=0.15, shuffle_buffer_size=10000, seed=5)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned CommonCrawl) dataset.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • split (str) โ€“ What split of the dataset to use. Either 'train' or 'validation'. Default: None.

  • num_samples (int) โ€“ The number of post-processed token samples, used to set epoch size of the torch.utils.data.IterableDataset. Default: None.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. Default: None.

  • max_seq_len (int) โ€“ The max sequence length of each token sample. Default: None.

  • group_method (str) โ€“ How to group text samples into token samples. Either truncate or concat. Default: None.

  • mlm (bool) โ€“ Whether or not to use masked language modeling. Default: False.

  • mlm_probability (float) โ€“ If mlm=True, the probability that tokens are masked. Default: 0.15.

  • shuffle โ€“ Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default: False.

  • shuffle_buffer_size (int) โ€“ If shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default: 10000.

  • seed (int) โ€“ If shuffle=True, what seed to use for shuffling operations. Default: 5.

  • drop_last โ€“ Whether to drop the last samples for the last batch. Default: True.

Returns

DataSpec โ€“ A DataSpec object.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec โ€“ The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises

TypeError โ€“ Raises a TypeError if any fields are an incorrect type.