composer.datasets.glue#

composer.datasets.glue

Functions

cast

Cast a value to a type.

dataclass

Returns the same class as was passed in, with dunder methods added based on the fields defined in the class.

Classes

DataSpec

Specifications for operating and training on data.

Hparams

These classes are used with yahp for YAML-based configuration.

DataloaderHparams

Hyperparameters to initialize a Dataloader.

DatasetHparams

Abstract base class for hyperparameters to initialize a dataset.

GLUEHparams

Sets up a generic GLUE dataset loader.

Attributes

  • Dataset

  • log

class composer.datasets.glue.GLUEHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, task=None, tokenizer_name=None, split=None, max_seq_length=256, num_workers=64, max_network_retries=10)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Sets up a generic GLUE dataset loader.

Parameters
  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data (the default) or validation data.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch (the default) or pad the last batch with zeros.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Defaults to True.

  • task (str) โ€“ the GLUE task to train on, choose one from: CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, and STS-B.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with.

  • split (str) โ€“ Whether to use โ€˜trainโ€™, โ€˜validationโ€™ or โ€˜testโ€™ split.

  • max_seq_length (int) โ€“ Optionally, the ability to set a custom sequence length for the training dataset. Default: 256

  • num_workers (int) โ€“ Optionally, the number of CPU workers to use to preprocess the text. Default: 64

  • max_network_retries (int) โ€“ Optionally, the number of times to retry HTTP requests if they fail. Default: 10

Returns

A :class:`~composer.core.DataSpec` object

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataloaderSpec for this dataset.

Parameters
  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataloaderHparams) โ€“ The dataset-independent hparams for the dataloader

Returns
  • Dataloader or DataSpec โ€“ The dataloader, or if the dataloader yields batches of custom types,

  • a :class:`DataSpec`.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises

TypeError โ€“ Raises a TypeError if any fields are an incorrect type.