build_lm_dataloader#

composer.datasets.build_lm_dataloader(datadir, tokenizer_name, batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, **dataloader_kwargs)[source]#

Builds a dataloader for a generic language modeling dataset.

Parameters

datadir (list) – List containing the string of the path to the HuggingFace Datasets directory.
dataloader_hparams (DataLoaderHparams) – DataLoaderHparams object.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
batch_size (int) – Batch size per device.
split (str) – the dataset split to use either ‘train’, ‘val’, or ‘test’. Default: 'train`. Default: 'train'.
shuffle (bool) – whether to shuffle the dataset. Default: True.
drop_last (bool) – whether to drop last samples. Default: True.
use_masked_lm (bool) – Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) – Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.
mlm_probability (float, optional) – If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.
subsample_ratio (float, optional) – Proportion of the dataset to use. Default: 1.0.
**dataloader_kwargs (Dict[str, Any]) – Additional settings for the dataloader (e.g. num_workers, etc.)