โฏ๏ธ Autoresume Training#
When the Trainer is configured with autoresume=True
, it will automatically look for existing checkpoints and resume training. If no checkpoints exist, itโll start a new training run. This allows you to automatically resume from any faults, with no code changes.
To see this example in action, run this notebook twice.
The first time the notebook is run, the trainer will save a checkpoint to the
save_folder
and train for one epoch.Any subsequent time the notebook is run, the trainer will resume from the latest checkpoint. If the latest checkpoint was saved at
max_duration
, meaning all training is finished, the Trainer will exit immediately with an error that no training would occur.
To simulate a flaky spot instance, try interrupting the notebook (e.g. Ctrl-C) midway through the first training run (say, after epoch 0 is finished). Notice how the progress bars would resume at the next epoch and not repeat any training already completed.
This feature does not require code or configuration changes to distinguish between starting a new training run or automatically resuming from an existing one, making it easy to use Composer on preemptable cloud instances. Simply configure the instance to start Composer with the same command every time until training has finished!
Install Dependencies#
Install composer, if it isnโt already installed.
[ ]:
%pip install mosaicml
Training Script#
[ ]:
import torch.utils.data
from torch.optim import SGD
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from composer import Trainer
from composer.models.classify_mnist import mnist_model
# Configure the trainer -- here, we train a simple MNIST classifier
model = mnist_model(num_classes=10)
optimizer = SGD(model.parameters(), lr=0.01)
train_dataloader = torch.utils.data.DataLoader(
dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
batch_size=2048,
)
eval_dataloader = torch.utils.data.DataLoader(
dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
batch_size=2048,
)
# When using `autoresume`, it is required to specify the `run_name` is required, so
# Composer will know which training run to resume
run_name = 'my_autoresume_training_run'
trainer = Trainer(
model=model,
max_duration='5ep',
optimizers=optimizer,
# Train Data Configuration
train_dataloader=train_dataloader,
train_subset_num_batches=5, # For this example, limit each epoch to 5 batches
# Evaluation Configuration
eval_dataloader=eval_dataloader,
eval_subset_num_batches=5, # For this example, limit evaluation to 5 batches
# Checkpoint Configuration
run_name=run_name,
save_folder='./my_autoresume_training_run',
save_interval='1ep',
# Configure autoresume!
autoresume=True,
)
print('Training!')
# Train!
trainer.fit()
# Print the number of trained epochs (should always bee the `max_duration`, which is 5ep)
print(f'\nNumber of epochs trained: {trainer.state.timestamp.epoch}')