🚪 GLUE Entry Point#

This notebook is intended to demonstrate how to use the GLUE (General Language Understanding Evaluation) entry point for pre-training and fine-tuning NLP models across the 8 GLUE tasks.

This will cover:

The basics of the entry point and what it enables
How to construct your YAML for training
Executing an example fine-tuning job

Setup#

Let’s get started and configure our environment.

Install Composer#

First, install Composer if you haven’t already:

[ ]:

%pip install 'mosaicml[nlp]'

Next, pull and cd into the Composer Github repository:

[ ]:

!git clone https://github.com/mosaicml/composer

import os
os.chdir('composer/')

Basics of the Entry Point#

This entry point allows you to specify if you want to pre-train a NLP model, fine-tune a model on the downstream tasks, or run the entire pipeline. If pre-training, the entry point will handle distributed training across all available GPUs. If fine-tuning, the entry point will fine-tune all given checkpoints on all 8 GLUE tasks in parallel using multiprocessing pools. This entry point is designed to make this process more efficient and remove the tediousness of individually spawning jobs and manually loading all the model checkpoints.

Constructing your YAML for training#

A full out-of-the-box YAML example for this entry point can be found in ./glue_example.yaml. If you’re already familiar with YAMLs, you can skip to the next part! If not, we’ll break down how this is structured.

Pre-training#

If you are only pre-training an NLP model from scratch, you only need to specify the pretrain_hparams section of the YAML. In this section, you will find your standard hyperparameters for pre-training a model – the model configuration, dataset and dataloader specifications, batch size, etc. For the default configuration, we use identical parameters to composer/yamls/models/bert-base.yaml to pre-train a BERT model. See TrainerHparams documentation for more information about what is included in these parameters.

pretrain_hparams:
  # Use a bert-base model, initialized from scratch
  model:
    bert:
      use_pretrained: false
      tokenizer_name: bert-base-uncased
      pretrained_model_name: bert-base-uncased

  # Train the model on the English C4 corpus
  train_dataset:
    streaming_c4:
      remote: s3://allenai-c4/mds/1/
      local: /tmp/mds-cache/mds-c4/
      split: train
      shuffle: true
      tokenizer_name: bert-base-uncased
      max_seq_len: 128
      group_method: truncate
      mlm: true
      mlm_probability: 0.15

  dataloader:
    pin_memory: true
    timeout: 0
    prefetch_factor: 2
    persistent_workers: true
    num_workers: 8

  # Periodically evaluate the LanguageCrossEntropy and Masked Accuracy
  # on the validation split of the dataset.
  evaluators:
    evaluator:
        label: bert_pre_training
        eval_dataset:
          streaming_c4:
            remote: s3://allenai-c4/mds/1/
            local: /tmp/mds-cache/mds-c4/
            split: val
            shuffle: false
            tokenizer_name: bert-base-uncased
            max_seq_len: 128
            group_method: truncate
            mlm: true
            mlm_probability: 0.15
        metric_names:
          - LanguageCrossEntropy
          - MaskedAccuracy

  # Run evaluation after every 1000 training steps
  eval_interval: 1000ba

  # Use the decoupled AdamW optimizer with learning rate warmup
  optimizers:
    decoupled_adamw:
      lr: 5.0e-4                     # Peak learning rate
      betas:
        - 0.9
        - 0.98
      eps: 1.0e-06
      weight_decay: 1.0e-5           # Amount of weight decay regularization
  schedulers:
    linear_decay_with_warmup:
      t_warmup: 0.06dur              # Point when peak learning rate is reached
      alpha_f: 0.02

  max_duration: 275184000sp          # Subsample the training data for 275M samples
  train_batch_size: 4000             # Number of training examples to use per update
  eval_batch_size: 2000

  precision: amp                     # Use mixed-precision training
  grad_clip_norm: -1.0               # Turn off gradient clipping
  grad_accum: 'auto'                 # Use automatic gradient accumulation to avoid OOMs

  save_folder: checkpoints           # The directory to save checkpoints to
  save_interval: 3500ba              # Save checkpoints every 3500 batches
  save_artifact_name: '{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}'
  save_num_checkpoints_to_keep: 0
  save_overwrite: True

  loggers:
    object_store:
      object_store_hparams:         # The bucket to save checkpoints to
        s3:
          bucket: your-bucket-here

Fine-tuning#

If you are only fine-tuning checkpoints on the GLUE tasks, you are expected to specify the checkpoints to load from by specifying a finetune_ckpts list as so in the finetune_hparams section of your YAML. Upon runnning the entry point with this list, it will automatically pull all checkpoints and fine-tune on all of them. Note that if the finetune_ckpts list contains paths in object store, the entry point expects a load_object_store instance, as well as its corresponding credentials to be specified, otherwise it will try to load from local disk. See our checkpointing guide if you’re not familiar with our checkpoint saving and loading schema.

In all logging instances, such as Weights and Biases and in the results table outputted at the end of training, all the fine-tune runs will be grouped by pre-train checkpoint name for easier organization and run tracking.

Below is an example finetune_hparams that loads checkpoints from an Amazon S3 bucket:

finetune_hparams:
  ...
  finetune_ckpts:
    - path/to/checkpoint1
    - path/to/checkpoint2

  # if paths are in ObjectStore, the following is expected to be defined
  load_object_store:
    s3:
      bucket: your-bucket-here

❗ Note: The load paths provided in finetune_ckpts have to be relative paths within an object store bucket/local directory as Composer does not currently allow checkpoints to be loaded via remote URIs. Alternatively, you can provide a full https URL to a remote checkpoint as your full path, such as https://storage.googleapis.com/path/to/checkpoint.pt.

Pre-training and fine-tuning#

To run the entire end-to-end pipeline, you are expected to provide the entry point with your pre-train configuration as explained above, as well as any overrides to apply to the fine-tuning jobs. In this case, the entry point is run in two distinct stages for distributed pre-training and multiprocessed fine-tuning; however, all information transferred between the stages is automatically handled by the entry point. Checkpoints are automatically saved to your specified save_folder and loaded from wherever pre-training saved them, therefore the finetune_ckpts section of finetune_hparams is ignored if specified.

❗ Note: The entry point runs all 8 GLUE fine-tuning tasks on every saved pre-training checkpoint, so set your save_interval within your pretrain_hparams appropriately to avoid unnecessarily long evaluation times.

Executing your job#

Let’s now put together all our knowledge about the entry point and launch a job that will fine-tune a pre-trained BERT model on the 8 GLUE tasks! Because we are only fine-tuning with no special configurations, we only need to specify our bucket information and the finetune_ckpts to load from. The following configuration will load a pre-trained model from our AWS S3 bucket, and save any fine-tune checkpoints under a local checkpoints folder:

[ ]:

data = {
  'finetune_hparams': {
    'load_object_store': {'s3': {'bucket': 'mosaicml-internal-checkpoints-bert'}},
    'save_folder': 'checkpoints',
    'finetune_ckpts': ['bert-baseline-tokenizer-2uoe/checkpoints/ep0-ba68796-rank0']
  }
}

Let’s now dump our constructed hparams to a YAML file to be loaded by the entry point:

[ ]:

import yaml
import tempfile

tmp_file = tempfile.NamedTemporaryFile()
with open(tmp_file.name, 'w+') as f:
    yaml.dump(data, f)

Let’s launch it! At the end of training, we will see a table containing the GLUE per-task, GLUE-Large, and GLUE-All scores!

[ ]:

!python examples/glue/run_glue_trainer.py -f {tmp_file.name} --training_scheme finetune

💡 Pro-tip: Try python examples/glue/run_glue_trainer.py --help to get more information about the entry point, and python examples/glue/run_glue_trainer.py {pretrain_hparams, finetune_hparams} --help to get a detailed breakdown of your hparams options!

Next steps#

Try pre-training and fine-tuning your own models with this framework! Also, feel free to check out the rest of our Composer docs to try using Composer speedups in this entry point!