StreamingC4#

class composer.datasets.StreamingC4(remote, local, split, shuffle, tokenizer_name, max_seq_len, group_method='truncate', max_retries=2, timeout=120, batch_size=None)[source]#

Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset V1. :param remote: Remote directory (S3 or local filesystem) where dataset is stored. :type remote: str :param local: Local filesystem directory where dataset is cached during operation. :type local: str :param split: The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™. :type split: str :param shuffle: Whether to shuffle the samples in this dataset. :type shuffle: bool :param tokenizer_name: The name of the HuggingFace tokenizer to use to tokenize samples. :type tokenizer_name: str :param max_seq_len: The max sequence length of each token sample. :type max_seq_len: int :param group_method: How to group text samples into token samples. Supports โ€˜truncateโ€™ or โ€˜concatโ€™. :type group_method: str :param max_retries: Number of download re-attempts before giving up. Default: 2. :type max_retries: int :param timeout: How long to wait for shard to download before raising an exception. Default: 120 sec. :type timeout: float :param batch_size: Hint batch_size that will be used on each deviceโ€™s DataLoader. Default: None. :type batch_size: Optional[int]