Tip

This tutorial is available as a Jupyter notebook.

Open in Colab

♻️ Auto Microbatching#

Have you ever wanted to choose your batch size without having to stress about CUDA Out-of-Memory (OOM) errors? We sure have. That’s why we built Composer’s automatic microbatching feature.

This tutorial will demonstrate how to use automatic microbatching to avoid CUDA OOMs, regardless of your batch size choice, GPU type, and number of devices.

Note that this demo requires a GPU to demonstrate automatic microbatching.

Tutorial Goals and Concepts Covered#

The goal of this tutorial is to show you how to turn on automatic gradient accumulation and to provide a sandbox to play around with it a bit. Please feel free to experiment with different batch sizes and other configuration choices to see how it works!

For details of the implementation, see our Automatic Microbatching documentation.

Let’s get started!

Set Up Our Workspace#

We’ll start by installing Composer:

[ ]:
%pip install mosaicml
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install git+https://github.com/mosaicml/composer.git

We are going to use the CIFAR-10 dataset with a ResNet-56 model and some standard optimization settings. For the purposes of this tutorial, we’ll choose a very large batch size and increase the image size to 96x96. These settings will cause CUDA Out-of-Memory errors on most GPUs.

[ ]:
import torch

import composer
from torchvision import datasets, transforms


torch.manual_seed(42) # For replicability

data_directory = "./data"

# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)

# choose a very large batch size
batch_size = 2048

cifar10_transforms = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize(mean, std),
  transforms.Resize(size=[96, 96])  # choose a large image size
])

train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)
test_dataset = datasets.CIFAR10(data_directory, train=False, download=True, transform=cifar10_transforms)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
[ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from composer.models import ComposerClassifier

class Block(nn.Module):
    """A ResNet block."""

    def __init__(self, f_in: int, f_out: int, downsample: bool = False):
        super(Block, self).__init__()

        stride = 2 if downsample else 1
        self.conv1 = nn.Conv2d(f_in, f_out, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(f_out)
        self.conv2 = nn.Conv2d(f_out, f_out, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(f_out)
        self.relu = nn.ReLU(inplace=True)

        # No parameters for shortcut connections.
        if downsample or f_in != f_out:
            self.shortcut = nn.Sequential(
                nn.Conv2d(f_in, f_out, kernel_size=1, stride=2, bias=False),
                nn.BatchNorm2d(f_out),
            )
        else:
            self.shortcut = nn.Sequential()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return self.relu(out)

class ResNetCIFAR(nn.Module):
    """A residual neural network as originally designed for CIFAR-10."""

    def __init__(self, outputs: int = 10):
        super(ResNetCIFAR, self).__init__()

        depth = 56
        width = 16
        num_blocks = (depth - 2) // 6

        plan = [(width, num_blocks), (2 * width, num_blocks), (4 * width, num_blocks)]

        self.num_classes = outputs

        # Initial convolution.
        current_filters = plan[0][0]
        self.conv = nn.Conv2d(3, current_filters, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(current_filters)
        self.relu = nn.ReLU(inplace=True)

        # The subsequent blocks of the ResNet.
        blocks = []
        for segment_index, (filters, num_blocks) in enumerate(plan):
            for block_index in range(num_blocks):
                downsample = segment_index > 0 and block_index == 0
                blocks.append(Block(current_filters, filters, downsample))
                current_filters = filters

        self.blocks = nn.Sequential(*blocks)

        # Final fc layer. Size = number of filters in last segment.
        self.fc = nn.Linear(plan[-1][0], outputs)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn(self.conv(x)))
        out = self.blocks(out)
        out = F.avg_pool2d(out, out.size()[3])
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = ComposerClassifier(module=ResNetCIFAR(), num_classes=10)

optimizer = composer.optim.DecoupledSGDW(
    model.parameters(), # Model parameters to update
    lr=0.05,
    momentum=0.9,
)

Train a Baseline Model#

Now we run our trainer code with the device_train_microbatch_size='auto' setting.

[ ]:
assert torch.cuda.is_available(), "Demonstrating automatic gradient accumulation requires a GPU."

trainer = composer.trainer.Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=test_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    device_train_microbatch_size='auto',  # <--- Activate Composer magic!
    device='gpu'
)

# Train
trainer.fit()

Depending on your GPU type, you should see some logs that increase the gradient accumulation dynamically until the model fits into memory, prior to the start of training—e.g., something like:

INFO:composer.trainer.trainer:CUDA out of memory detected.
Train microbatch size decreased from 2048 -> 1024, and the batch
will be retrained.

Worry not! This just means everything is working as expected. With automatic microbatching enabled, Composer responds to OOM errors during training by halving the microbatch size. Under the hood, each minibatch is split into n “microbatches”, where n is the accumulation rate, and gradients are accumulated across microbatches before stepping the optimizer. So, you should expect to see the microbatch size decrease and accumulation rate increase until the resulting microbatch size fits on the device. This lets you focus on getting the best minibatch size without having to stress about what your hardware can handle.

What next?#

You’ve now seen how to turn on automatic gradient accumulation using the Composer trainer.

To dig deeper, see our Automatic Microbatching documentation.

In addition, please continue to explore our tutorials! Here’s a couple suggestions:

Come get involved with MosaicML!#

We’d love for you to get involved with the MosaicML community in any of these ways:

Star Composer on GitHub#

Help make others aware of our work by starring Composer on GitHub.

Join the MosaicML Slack#

Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!

Contribute to Composer#

Is there a bug you noticed or a feature you’d like? File an issue or make a pull request!