build_synthetic_lm_dataloader#

composer.datasets.build_synthetic_lm_dataloader(synthetic_num_unique_samples, tokenizer_name, global_batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, max_seq_length=1024, **dataloader_kwargs)[source]#

Builds a synthetic dataloader for a generic language modeling dataset.

Parameters

synthetic_num_unique_samples (int) – Number of unique synthetic samples to generate.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
global_batch_size (int) –
split (str) – the dataset split to use either ‘train’, ‘val’, or ‘test’. Default:
Default ('train`.) – 'train'.
shuffle (bool) – whether to shuffle the dataset. Default: True.
drop_last (bool) – whether to drop last samples. Default: True.
use_masked_lm (bool) – Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) – Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.
mlm_probability (float, optional) – If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.
subsample_ratio (float, optional) – Proportion of the dataset to use. Default: 1.0.
max_seq_length (int, optional) – Maximum sequence length for datasets. Default: 1024.
**dataloader_kwargs (Dict[str, Any]) – Additional settings for the dataloader (e.g. num_workers, etc.)