StreamingC4#

class composer.datasets.StreamingC4(remote, local, split, shuffle, tokenizer_name, max_seq_len, group_method='truncate', max_retries=2, timeout=120, batch_size=None)[source]#

Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset.

Parameters
  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored.

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation.

  • split (str) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™.

  • shuffle (bool) โ€“ Whether to shuffle the samples in this dataset.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to use to tokenize samples.

  • max_seq_len (int) โ€“ The max sequence length of each token sample.

  • group_method (str) โ€“ How to group text samples into token samples. Currently only supporting 'truncate'.

  • max_retries (int) โ€“ Number of download re-attempts before giving up. Default: 2.

  • timeout (float) โ€“ How long to wait for shard to download before raising an exception. Default: 120 sec.

  • batch_size (Optional[int]) โ€“ Hint batch_size that will be used on each deviceโ€™s DataLoader. Default: None.