build_lm_dataloader#

composer.datasets.build_lm_dataloader(datadir, tokenizer_name, batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, **dataloader_kwargs)[source]#

Builds a dataloader for a generic language modeling dataset.

Parameters
  • datadir (list) โ€“ List containing the string of the path to the HuggingFace Datasets directory.

  • dataloader_hparams (DataLoaderHparams) โ€“ DataLoaderHparams object.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.

  • batch_size (int) โ€“ Batch size per device.

  • split (str) โ€“ the dataset split to use either โ€˜trainโ€™, โ€˜valโ€™, or โ€˜testโ€™. Default: 'train`. Default: 'train'.

  • shuffle (bool) โ€“ whether to shuffle the dataset. Default: True.

  • drop_last (bool) โ€“ whether to drop last samples. Default: True.

  • use_masked_lm (bool) โ€“ Whether the dataset should be encoded with masked language modeling or not.

  • num_tokens (int, optional) โ€“ Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.

  • mlm_probability (float, optional) โ€“ If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.

  • subsample_ratio (float, optional) โ€“ Proportion of the dataset to use. Default: 1.0.

  • **dataloader_kwargs (Dict[str, Any]) โ€“ Additional settings for the dataloader (e.g. num_workers, etc.)