build_synthetic_lm_dataloader#
- composer.datasets.build_synthetic_lm_dataloader(synthetic_num_unique_samples, tokenizer_name, global_batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, max_seq_length=1024, **dataloader_kwargs)[source]#
Builds a synthetic dataloader for a generic language modeling dataset.
- Parameters
synthetic_num_unique_samples (int) โ Number of unique synthetic samples to generate.
tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
global_batch_size (int) โ
split (str) โ the dataset split to use either โtrainโ, โvalโ, or โtestโ. Default:
Default ('train`.) โ
'train'
.shuffle (bool) โ whether to shuffle the dataset. Default:
True
.drop_last (bool) โ whether to drop last samples. Default:
True
.use_masked_lm (bool) โ Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) โ Number of tokens to train on.
0
will train on all tokens in the dataset. Default:0
.mlm_probability (float, optional) โ If using masked language modeling, the probability with which tokens will be masked. Default:
0.15
.subsample_ratio (float, optional) โ Proportion of the dataset to use. Default:
1.0
.max_seq_length (int, optional) โ Maximum sequence length for datasets. Default:
1024
.**dataloader_kwargs (Dict[str, Any]) โ Additional settings for the dataloader (e.g. num_workers, etc.)