Tip

This tutorial is available as a Jupyter notebook.

Open in Colab

โ™ป๏ธ Auto Grad Accum#

This notebook will demonstrate how to use automatic gradient accumulation to avoid CUDA OOMs, regardless of your batch size choice, GPU type, and number of devices. Experiment with different combinations and see how it works!

For details of the implementation, see our Auto Grad Accum documentation.

Weโ€™ll start by installing composer:

[ ]:
%pip install mosaicml

Set Up Our Workspace#

We are going to use the CIFAR10 dataset with a ResNet56 model, and some standard optimization settings. For the purposes of this notebook, weโ€™ll choose very large batch size, and also increase the image size to 96x 96, such that you would typically hit CUDA Out-of-Memory errors on most GPUs.

[ ]:
import torch

import composer
from torchvision import datasets, transforms


torch.manual_seed(42) # For replicability

data_directory = "./data"

# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)

# choose a very large batch size
batch_size = 2048

cifar10_transforms = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize(mean, std),
  transforms.Resize(size=[96, 96])  # choose a large image size
])

train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)
test_dataset = datasets.CIFAR10(data_directory, train=False, download=True, transform=cifar10_transforms)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
[ ]:
from composer import models
model = models.ComposerResNetCIFAR(model_name='resnet_56', num_classes=10)

Train a Baseline Model#

Now we run our trainer code with grad_accum=='auto'. setting. Note that this demo requires a GPU to demonstrate automatic gradient accumulation.

[ ]:
assert torch.cuda.is_available(), "Demonstrating automatic gradient accumulation requires a GPU."

optimizer = composer.optim.DecoupledSGDW(
    model.parameters(), # Model parameters to update
    lr=0.05,
    momentum=0.9,
)


trainer = composer.trainer.Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=test_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    grad_accum='auto',
    device='gpu'
)


trainer.fit()

Depending on your GPU type, you should see some logs that increase the gradient accumulation dynamically until the model fits into memory, prior to the start of training, e.g. something like:

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Experiment with different batch sizes and image sizes, and notice the trainer will never hit OutOfMemory errors, and you do not have to manually tweak the gradient accumulation to get the model to fit!

For more details, see our Auto Grad Accum documentation.