composer.datasets.c4#

C4 (Colossal Cleaned CommonCrawl) dataset.

This dataset is a colossal, cleaned version of Common Crawl’s web crawl corpus and it is based on the Common Crawl dataset.

Classes

C4Dataset

Builds a streaming, sharded, sized torch.utils.data.IterableDataset for the C4 (Colossal Cleaned CommonCrawl) dataset.

Hparams

These classes are used with yahp for YAML-based configuration.

C4DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned CommonCrawl) dataset.

class composer.datasets.c4.C4Dataset(split, num_samples, tokenizer_name, max_seq_len, group_method, shuffle=False, shuffle_buffer_size=10000, seed=5)[source]#

Bases: torch.utils.data.dataset.IterableDataset

Builds a streaming, sharded, sized torch.utils.data.IterableDataset for the C4 (Colossal Cleaned CommonCrawl) dataset. Used for pretraining autoregressive or masked language models. Text samples are streamed directly from the cloud using HuggingFace’s C4 Dataset with streaming backend (See https://huggingface.co/datasets/c4 for more details). The text samples are then shuffled, tokenized, and grouped on- the-fly.

Parameters

split (str) – What split of the dataset to use. Either 'train' or 'validation'.
num_samples (int) – The number of post-processed token samples, used to set epoch size of the torch.data.utils.IterableDataset.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with.
max_seq_len (int) – The max sequence length of each token sample.
group_method (str) – How to group text samples into token samples. Either 'truncate' or 'concat'.
shuffle (bool) – Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default: False.
shuffle_buffer_size (int) – If shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default: 10000.
seed (int) – If shuffle=True, what seed to use for shuffling operations. Default: 5.

Returns

IterableDataset – A torch.utils.data.IterableDataset object.

class composer.datasets.c4.C4DatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, split=None, num_samples=None, tokenizer_name=None, max_seq_len=None, group_method=None, mlm=False, mlm_probability=0.15, shuffle_buffer_size=10000, seed=5)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned CommonCrawl) dataset.

Parameters

datadir (str) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.
split (str) – What split of the dataset to use. Either 'train' or 'validation'. Default: None.
num_samples (int) – The number of post-processed token samples, used to set epoch size of the torch.utils.data.IterableDataset. Default: None.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. Default: None.
max_seq_len (int) – The max sequence length of each token sample. Default: None.
group_method (str) – How to group text samples into token samples. Either truncate or concat. Default: None.
mlm (bool) – Whether or not to use masked language modeling. Default: False.
mlm_probability (float) – If mlm=True, the probability that tokens are masked. Default: 0.15.
shuffle – Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default: False.
shuffle_buffer_size (int) – If shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default: 10000.
seed (int) – If shuffle=True, what seed to use for shuffling operations. Default: 5.
drop_last – Whether to drop the last samples for the last batch. Default: True.

Returns

DataSpec – A DataSpec object.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec – The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises: TypeError – Raises a TypeError if any fields are an incorrect type.