composer.datasets.glue#

GLUE (General Language Understanding Evaluation) dataset (Wang et al, 2019).

The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.

Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.

Please refer to the GLUE benchmark for more details.

Hparams

These classes are used with yahp for YAML-based configuration.

GLUEHparams

Sets up a generic GLUE dataset loader.

class composer.datasets.glue.GLUEHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, task=None, tokenizer_name=None, split=None, max_seq_length=256, max_network_retries=10)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Sets up a generic GLUE dataset loader.

Parameters

use_synthetic (bool, optional) – Whether to use synthetic data. Default: False.
synthetic_num_unique_samples (int, optional) – The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.
synthetic_device (str, optional) – The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.
synthetic_memory_format – The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.
datadir (str) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.
task (str) – the GLUE task to train on, choose one from: 'CoLA', 'MNLI', 'MRPC', 'QNLI', 'QQP', 'RTE', 'SST-2', and 'STS-B'.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
split (str) – Whether to use 'train', 'validation', or 'test' split.
max_seq_length (int, optional) – Optionally, the ability to set a custom sequence length for the training dataset. Default: 256.
max_network_retries (int, optional) – Number of times to retry HTTP requests if they fail. Default: 10.

Returns

DataLoader – A PyTorch DataLoader object.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

Iterable | DataSpec – An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises: TypeError – Raises a TypeError if any fields are an incorrect type.