composer.datasets.glue#
GLUE (General Language Understanding Evaluation) dataset (Wang et al, 2019).
The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.
Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.
Please refer to the GLUE benchmark for more details.
Hparams
These classes are used with yahp
for YAML
-based configuration.
Sets up a generic GLUE dataset loader. |
- class composer.datasets.glue.GLUEHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, task=None, tokenizer_name=None, split=None, max_seq_length=256, max_network_retries=10)[source]#
Bases:
composer.datasets.hparams.DatasetHparams
,composer.datasets.hparams.SyntheticHparamsMixin
Sets up a generic GLUE dataset loader.
- Parameters
use_synthetic (bool, optional) โ Whether to use synthetic data. Default:
False
.synthetic_num_unique_samples (int, optional) โ The number of unique samples to allocate memory for. Ignored if
use_synthetic
isFalse
. Default:100
.synthetic_device (str, optional) โ The device to store the sample pool on. Set to
'cuda'
to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to'cpu'
to move data between host memory and the device on every batch. Ignored ifuse_synthetic
isFalse
. Default:'cpu'
.synthetic_memory_format โ The
MemoryFormat
to use. Ignored ifuse_synthetic
isFalse
. Default:'CONTIGUOUS_FORMAT'
.datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.task (str) โ the GLUE task to train on, choose one from:
'CoLA'
,'MNLI'
,'MRPC'
,'QNLI'
,'QQP'
,'RTE'
,'SST-2'
, and'STS-B'
.tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
split (str) โ Whether to use
'train'
,'validation'
, or'test'
split.max_seq_length (int, optional) โ Optionally, the ability to set a custom sequence length for the training dataset. Default:
256
.max_network_retries (int, optional) โ Number of times to retry HTTP requests if they fail. Default:
10
.
- Returns
DataLoader โ A PyTorch
DataLoader
object.
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
Iterable | DataSpec โ An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.