c4#
C4 (Colossal Cleaned Common Crawl) dataset.
This dataset is a colossal, cleaned version of Common Crawlโs web crawl corpus and it is based on the Common Crawl dataset.
Classes
Builds a streaming, sharded, sized |
|
Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset. |
- class composer.datasets.c4.C4Dataset(split, num_samples, tokenizer_name, max_seq_len, group_method, shuffle=False, shuffle_buffer_size=10000, seed=5)[source]#
Bases:
torch.utils.data.dataset.IterableDataset
Builds a streaming, sharded, sized
torch.utils.data.IterableDataset
for the C4 (Colossal Cleaned Common Crawl) dataset. Used for pretraining autoregressive or masked language models. Text samples are streamed directly from the cloud using HuggingFaceโs C4 Dataset with streaming backend (See https://huggingface.co/datasets/c4 for more details). The text samples are then shuffled, tokenized, and grouped on- the-fly.- Parameters
split (str) โ What split of the dataset to use. Either
'train'
or'validation'
.num_samples (int) โ The number of post-processed token samples, used to set epoch size of the
torch.data.utils.IterableDataset
.tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with.
max_seq_len (int) โ The max sequence length of each token sample.
group_method (str) โ How to group text samples into token samples. Either
'truncate'
or'concat'
.shuffle (bool) โ Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default:
False
.shuffle_buffer_size (int) โ If
shuffle=True
, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default:10000
.seed (int) โ If
shuffle=True
, what seed to use for shuffling operations. Default:5
.
- Returns
IterableDataset โ A
torch.utils.data.IterableDataset
object.
- class composer.datasets.c4.StreamingC4(remote, local, split, shuffle, tokenizer_name, max_seq_len, group_method='truncate', max_retries=2, timeout=120, batch_size=None)[source]#
Bases:
composer.datasets.streaming.dataset.StreamingDataset
Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset.
- Parameters
remote (str) โ Remote directory (S3 or local filesystem) where dataset is stored.
local (str) โ Local filesystem directory where dataset is cached during operation.
split (str) โ The dataset split to use, either โtrainโ or โvalโ.
shuffle (bool) โ Whether to shuffle the samples in this dataset.
tokenizer_name (str) โ The name of the HuggingFace tokenizer to use to tokenize samples.
max_seq_len (int) โ The max sequence length of each token sample.
group_method (str) โ How to group text samples into token samples. Currently only supporting
'truncate'
.max_retries (int) โ Number of download re-attempts before giving up. Default: 2.
timeout (float) โ How long to wait for shard to download before raising an exception. Default: 120 sec.
batch_size (Optional[int]) โ Hint batch_size that will be used on each deviceโs DataLoader. Default:
None
.