composer.datasets.lm_datasets#

composer.datasets.lm_datasets

Functions

dataclass

Returns the same class as was passed in, with dunder methods added based on the fields defined in the class.

join

Join two or more pathname components, inserting '/' as needed.

Classes

DataSpec

Specifications for operating and training on data.

Hparams

These classes are used with yahp for YAML-based configuration.

DataloaderHparams

Hyperparameters to initialize a Dataloader.

DatasetHparams

Abstract base class for hyperparameters to initialize a dataset.

LMDatasetHparams

Defines a generic dataset class for autoregressive and masked language models trained with self-supervised learning.

Attributes

  • Batch

  • List

  • Optional

  • log

class composer.datasets.lm_datasets.LMDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=<factory>, split=None, tokenizer_name=None, use_masked_lm=None, num_tokens=0, mlm_probability=0.15, seed=5, subsample_ratio=1.0, train_sequence_length=1024, val_sequence_length=1024)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Defines a generic dataset class for autoregressive and masked language models trained with self-supervised learning.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data (the default) or validation data.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch (the default) or pad the last batch with zeros.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Defaults to True.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataloaderSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataloaderHparams) โ€“ The dataset-independent hparams for the dataloader

Returns
  • Dataloader or DataSpec โ€“ The dataloader, or if the dataloader yields batches of custom types,

  • a :class:`DataSpec`.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises

TypeError โ€“ Raises a TypeError if any fields are an incorrect type.