composer.datasets.glue_hparams#

GLUE (General Language Understanding Evaluation) dataset hyperparameters (Wang et al, 2019).

The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.

Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.

Please refer to the GLUE benchmark for more details.

Hparams

These classes are used with yahp for YAML-based configuration.

GLUEHparams

Sets up a generic GLUE dataset loader.

class composer.datasets.glue_hparams.GLUEHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, drop_last=True, shuffle=True, task=None, tokenizer_name=None, split=None, max_seq_length=256, max_network_retries=10)[source]#

Bases: composer.datasets.dataset_hparams.DatasetHparams, composer.datasets.synthetic_hparams.SyntheticHparamsMixin

Sets up a generic GLUE dataset loader.

Parameters

task (str) – the GLUE task to train on, choose one from: 'CoLA', 'MNLI', 'MRPC', 'QNLI', 'QQP', 'RTE', 'SST-2', and 'STS-B'.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
split (str) – Whether to use 'train', 'validation', or 'test' split.
max_seq_length (int, optional) – Optionally, the ability to set a custom sequence length for the training dataset. Default: 256.
max_network_retries (int, optional) – Number of times to retry HTTP requests if they fail. Default: 10.

Returns

DataLoader – A PyTorch DataLoader object.