lm_dataset_hparams#

Generic hyperparameters for self-supervised training of autoregressive and masked language models.

Hparams

These classes are used with yahp for YAML-based configuration.

LMDatasetHparams

Defines a generic dataset class for self-supervised training of autoregressive and masked language models.

class composer.datasets.lm_dataset_hparams.LMDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, drop_last=True, shuffle=True, datadir=<factory>, split=None, tokenizer_name=None, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, seed=5, subsample_ratio=1.0, max_seq_length=1024)[source]#

Bases: composer.datasets.dataset_hparams.DatasetHparams, composer.datasets.synthetic_hparams.SyntheticHparamsMixin

Defines a generic dataset class for self-supervised training of autoregressive and masked language models.

Parameters
  • datadir (list) โ€“ List containing the string of the path to the HuggingFace Datasets directory.

  • split (str) โ€“ Whether to use 'train', 'test', or 'validation' split.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.

  • use_masked_lm (bool) โ€“ Whether the dataset should be encoded with masked language modeling or not.

  • num_tokens (int, optional) โ€“ Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.

  • mlm_probability (float, optional) โ€“ If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.

  • seed (int, optional) โ€“ Random seed for generating train and validation splits. Default: 5.

  • subsample_ratio (float, optional) โ€“ Proportion of the dataset to use. Default: 1.0.

  • train_sequence_length (int, optional) โ€“ Sequence length for training dataset. Default: 1024.

  • val_sequence_length (int, optional) โ€“ Sequence length for validation dataset. Default: 1024.