Tip

This tutorial is available as a Jupyter notebook.

Open in Colab

🚪 GLUE Script#

Want to pre-train your NLP model and benchmark it by fine-tuning on GLUE? We got you covered.

Some training workloads are more complex than others—for example, let’s say you want to pre-train a language model and evaluate each of your pre-training checkpoints by fine-tuning it on all 8 GLUE tasks. Yeah, that sounds fairly complex, but many NLP researchers will recognize it as a standard chore, which is why we made a script to handle exactly this workload in Composer.

Tutorial Goals and Covered Concepts#

The goal of this tutorial is to demonstrate how to use the GLUE (General Language Understanding Evaluation) script for pre-training and fine-tuning NLP models across the 8 GLUE tasks.

This will cover:

  • The basics of the script and what it enables

  • How to construct your YAML for training

  • Executing an example fine-tuning job

Let’s get started!

Setup#

Let’s get started and configure our environment.

Install Composer#

First, install Composer if you haven’t already:

[ ]:
%pip install 'mosaicml[nlp]'
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install 'mosaicml[nlp] @ git+https://github.com/mosaicml/composer.git'"

Next, pull and cd into the Composer Github repository:

[ ]:
!git clone https://github.com/mosaicml/composer

import os
os.chdir('composer/')

Basics of the Script#

This script allows you to specify if you want to pre-train a NLP model, fine-tune a model on the downstream tasks, or run the entire pipeline.

If pre-training, the script will handle distributed training across all available GPUs. If fine-tuning, the script will fine-tune all given checkpoints on all 8 GLUE tasks in parallel using multiprocessing pools.

This script is designed to make this process more efficient and remove the tediousness of individually spawning jobs and manually loading all the model checkpoints.

Constructing your YAML for training#

A full out-of-the-box YAML example for this script can be found in ./glue_example.yaml. If you’re already familiar with using YAMLs with Composer, you can skip to the next part! If not, we’ll break down how this is structured.

Pre-training#

If you are only pre-training an NLP model from scratch, you just need to specify the pretrain_hparams section of the YAML. In this section, you will find your standard hyperparameters for pre-training a model—the model configuration, dataset and dataloader specifications, batch size, etc.

For the default configuration, we use identical parameters to composer/yamls/models/bert-base.yaml to pre-train a BERT model. See TrainerHparams documentation for more information about what is included in these parameters.

pretrain_hparams:
  # Use a bert-base model, initialized from scratch
  model:
    bert:
      use_pretrained: false
      tokenizer_name: bert-base-uncased
      pretrained_model_name: bert-base-uncased

  # Train the model on the English C4 corpus
  train_dataset:
    streaming_c4:
      remote: s3://allenai-c4/mds/1/
      local: /tmp/mds-cache/mds-c4/
      split: train
      shuffle: true
      tokenizer_name: bert-base-uncased
      max_seq_len: 128
      group_method: truncate
      mlm: true
      mlm_probability: 0.15

  dataloader:
    pin_memory: true
    timeout: 0
    prefetch_factor: 2
    persistent_workers: true
    num_workers: 8

  # Periodically evaluate the LanguageCrossEntropy and Masked Accuracy
  # on the validation split of the dataset.
  evaluators:
    evaluator:
        label: bert_pre_training
        eval_dataset:
          streaming_c4:
            remote: s3://allenai-c4/mds/1/
            local: /tmp/mds-cache/mds-c4/
            split: val
            shuffle: false
            tokenizer_name: bert-base-uncased
            max_seq_len: 128
            group_method: truncate
            mlm: true
            mlm_probability: 0.15
        metric_names:
          - LanguageCrossEntropy
          - MaskedAccuracy

  # Run evaluation after every 1000 training steps
  eval_interval: 1000ba

  # Use the decoupled AdamW optimizer with learning rate warmup
  optimizers:
    decoupled_adamw:
      lr: 5.0e-4                     # Peak learning rate
      betas:
        - 0.9
        - 0.98
      eps: 1.0e-06
      weight_decay: 1.0e-5           # Amount of weight decay regularization
  schedulers:
    linear_decay_with_warmup:
      t_warmup: 0.06dur              # Point when peak learning rate is reached
      alpha_f: 0.02

  max_duration: 275184000sp          # Subsample the training data for 275M samples
  train_batch_size: 4000             # Number of training examples to use per update
  eval_batch_size: 2000

  precision: amp                     # Use mixed-precision training
  grad_clip_norm: -1.0               # Turn off gradient clipping
  grad_accum: 'auto'                 # Use automatic gradient accumulation to avoid OOMs

  save_folder: checkpoints           # The directory to save checkpoints to
  save_interval: 3500ba              # Save checkpoints every 3500 batches
  save_artifact_name: '{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}'
  save_num_checkpoints_to_keep: 0
  save_overwrite: True

  loggers:
    object_store:
      object_store_hparams:         # The bucket to save checkpoints to
        s3:
          bucket: your-bucket-here

Fine-tuning#

If you are only fine-tuning checkpoints on the GLUE tasks, you are expected to specify the checkpoints to load from by specifying a finetune_ckpts list in the finetune_hparams section of your YAML, as shown below. Upon runnning the script with this list, it will automatically pull all checkpoints and fine-tune on all of them.

Note that if the finetune_ckpts list contains paths in object store, the script expects a load_object_store instance, as well as its corresponding credentials to be specified, otherwise it will try to load from local disk. See our checkpointing guide if you’re not familiar with our checkpoint saving and loading schema. You may also find our tutorial on training without local storage to be helpful.

In all logging instances, such as Weights and Biases and in the results table outputted at the end of training, all the fine-tune runs will be grouped by their pre-train checkpoint name for easier organization and run tracking.

Below is an example finetune_hparams that loads checkpoints from an Amazon S3 bucket:

finetune_hparams:
  ...
  finetune_ckpts:
    - path/to/checkpoint1
    - path/to/checkpoint2

  # if paths are in ObjectStore, the following is expected to be defined
  load_object_store:
    s3:
      bucket: your-bucket-here

Note: The load paths provided in finetune_ckpts have to be relative paths within an object store bucket/local directory as Composer does not currently allow checkpoints to be loaded via remote URIs. Alternatively, you can provide a full https URL to a remote checkpoint as your full path, such as https://storage.googleapis.com/path/to/checkpoint.pt.

Pre-training and fine-tuning#

To run the entire end-to-end pipeline, you are expected to provide the script with your pre-train configuration as explained above, as well as any overrides to apply to the fine-tuning jobs.

In this case, the script is run in two distinct stages for distributed pre-training and multiprocessed fine-tuning; however, all information transferred between the stages is automatically handled by the script. Checkpoints are automatically saved to your specified save_folder and loaded from wherever pre-training saved them, therefore the finetune_ckpts section of finetune_hparams is ignored if specified.

Note: The script runs all 8 GLUE fine-tuning tasks on every saved pre-training checkpoint, so set your save_interval within your pretrain_hparams appropriately to avoid unnecessarily long evaluation times.

Executing your job#

Let’s now put together all our knowledge about the script and launch a job that will fine-tune a pre-trained BERT model on the 8 GLUE tasks! Because we are only fine-tuning with no special configurations, we only need to specify our bucket information and the finetune_ckpts to load from. The following configuration will load a pre-trained model from our AWS S3 bucket and save any fine-tune checkpoints under a local checkpoints folder:

[ ]:
data = {
  'finetune_hparams': {
    'load_object_store': {'s3': {'bucket': 'mosaicml-internal-checkpoints-bert'}},
    'save_folder': 'checkpoints',
    'finetune_ckpts': ['bert-baseline-tokenizer-2uoe/checkpoints/ep0-ba68796-rank0']
  }
}

Let’s now dump our constructed hparams to a YAML file to be loaded by the script:

[ ]:
import yaml
import tempfile

tmp_file = tempfile.NamedTemporaryFile()
with open(tmp_file.name, 'w+') as f:
    yaml.dump(data, f)

Let’s launch it! At the end of training, we will see a table containing the GLUE per-task, GLUE-Large, and GLUE-All scores!

[ ]:
!python examples/glue/run_glue_trainer.py -f {tmp_file.name} --training_scheme finetune

💡 Pro-tip: Try python examples/glue/run_glue_trainer.py --help to get more information about the script, and python examples/glue/run_glue_trainer.py {pretrain_hparams, finetune_hparams} --help to get a detailed breakdown of your hparams options!

Next steps#

Now you’ve seen how to use Composer’s script for pre-training and fine-tuning on GLUE. Congratulations, this is about as complex as our tutorials get, so if you made it this far it’s time to get out there and start using Composer yourself!

To get going, try pre-training and fine-tuning your own models with this script. Also, feel free to check out the rest of the Composer docs.

Happy training!

Come get involved with MosaicML!#

We’d love for you to get involved with the MosaicML community in any of these ways:

Star Composer on GitHub#

Help make others aware of our work by starring Composer on GitHub.

Join the MosaicML Slack#

Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!

Contribute to Composer#

Is there a bug you noticed or a feature you’d like? File an issue or make a pull request!