StreamingC4#
- class composer.datasets.StreamingC4(remote, local, split, shuffle, tokenizer_name, max_seq_len, group_method='truncate', max_retries=2, timeout=120, batch_size=None)[source]#
Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset.
- Parameters
remote (str) โ Remote directory (S3 or local filesystem) where dataset is stored.
local (str) โ Local filesystem directory where dataset is cached during operation.
split (str) โ The dataset split to use, either โtrainโ or โvalโ.
shuffle (bool) โ Whether to shuffle the samples in this dataset.
tokenizer_name (str) โ The name of the HuggingFace tokenizer to use to tokenize samples.
max_seq_len (int) โ The max sequence length of each token sample.
group_method (str) โ How to group text samples into token samples. Currently only supporting
'truncate'
.max_retries (int) โ Number of download re-attempts before giving up. Default: 2.
timeout (float) โ How long to wait for shard to download before raising an exception. Default: 120 sec.
batch_size (Optional[int]) โ Hint batch_size that will be used on each deviceโs DataLoader. Default:
None
.