composer.datasets.glue#

GLUE (General Language Understanding Evaluation) dataset (Wang et al, 2019).

The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.

Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.

Please refer to the GLUE benchmark for more details.

Hparams

These classes are used with yahp for YAML-based configuration.

GLUEHparams

Sets up a generic GLUE dataset loader.

class composer.datasets.glue.GLUEHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, task=None, tokenizer_name=None, split=None, max_seq_length=256, num_workers=64, max_network_retries=10)[source]#

Bases: composer.datasets.hparams.DatasetHparams

Sets up a generic GLUE dataset loader.

Parameters

datadir (str) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.
task (str) – the GLUE task to train on, choose one from: 'CoLA', 'MNLI', 'MRPC', 'QNLI', 'QQP', 'RTE', 'SST-2', and 'STS-B'.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
split (str) – Whether to use 'train', 'validation', or 'test' split.
max_seq_length (int, optional) – Optionally, the ability to set a custom sequence length for the training dataset. Default: 256.
num_workers (int, optional) – Number of CPU workers to use to preprocess the text. Default: 64.
max_network_retries (int, optional) – Number of times to retry HTTP requests if they fail. Default: 10.

Returns

DataSpec – A DataSpec object.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec – The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises: TypeError – Raises a TypeError if any fields are an incorrect type.