composer.callbacks.checkpoint_saver#
Callback to save checkpoints during training.
Functions
Helper function to create a checkpoint scheduler according to a specified interval. |
Classes
Callback to save checkpoints. |
- class composer.callbacks.checkpoint_saver.CheckpointSaver(save_folder='checkpoints', name_format='ep{epoch}-ba{batch}/rank_{rank}', save_latest_format='latest/rank_{rank}', overwrite=False, save_interval='1ep', weights_only=False)[source]#
Bases:
composer.core.callback.Callback
Callback to save checkpoints.
Note
If the
save_folder
argument is specified constructing theTrainer
, then theCheckpointSaver
callback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of thisCheckpointSaver
callback can be specified in thecallbacks
argument of theTrainer
, as shown in the example below.Example
>>> trainer = Trainer(..., callbacks=[ ... CheckpointSaver( ... save_folder='checkpoints', ... name_format="ep{epoch}-ba{batch}/rank_{rank}", ... save_latest_format="latest/rank_{rank}", ... save_interval="1ep", ... weights_only=False, ... ) ... ])
- Parameters
save_folder (str) โ
Folder where checkpoints are saved.
If an absolute path is specified, then that path will be used. Otherwise, the
save_folder
will be relative to the folder returned byget_run_directory()
. If thesave_folder
does not exist, it will be created.name_format (str, optional) โ
A format string describing how to name checkpoints. (default:
'ep{epoch}-ba{batch}/rank_{rank}'
)Checkpoints will be saved approximately to
{save_folder}/{name_format.format(...)}
.See
format_name()
for the available format variables.Note
By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that
'{rank}'
appears within thename_format_string
. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified,'.tar'
will be used.To use compression (regardless of whether DeepSpeed is enabled), set the file extension to
'.tar.gz'
,'.tgz'
,'.tar.bzip'
, or'.tar.lzma'
(depending on the desired compression algorithm).
Warning
Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.
Consider the following scenario, where:
The default
save_folder='checkpoints'
is used.The default
name_format='ep{epoch}-ba{batch}/rank_{rank}'
is used.The current epoch count is
1
.The current batch count is
42
.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to
"checkpoints/ep1-ba42/rank_0"
.When DeepSpeed is being used, each rank (process) will save checkpoints to:
checkpoints/ep1-ba42/rank_0.tar checkpoints/ep1-ba42/rank_1.tar checkpoints/ep1-ba42/rank_2.tar ...
save_latest_format (str, optional) โ
A format string for a symlink which points to the last saved checkpoint. (default:
'latest/rank_{rank}'
)Symlinks will be created approximately at
{save_folder}/{save_latest_format.format(...)}
.See
format_name()
for the available format variables.To disable symlinks, set this parameter to
None
.Consider the following scenario, where:
The default
save_folder='checkpoints'
is used.The default
name_format='ep{epoch}-ba{batch}/rank_{rank}'
is used.The default
save_latest_format='latest/rank_{rank}'
is used.The current epoch count is
1
.The current batch count is
42
.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to
'checkpoints/ep1-ba42/rank_0'
, and a symlink will be created at'checkpoints/latest/rank_0' -> 'checkpoints/ep1-ba42/rank_0'
When DeepSpeed is being used, each rank (process) will save checkpoints to:
checkpoints/ep1-ba42/rank_0.tar checkpoints/ep1-ba42/rank_1.tar checkpoints/ep1-ba42/rank_2.tar ...
Corresponding symlinks will be created at:
checkpoints/latest/rank_0.tar -> checkpoints/ep1-ba42/rank_0.tar checkpoints/latest/rank_1.tar -> checkpoints/ep1-ba42/rank_1.tar checkpoints/latest/rank_2.tar -> checkpoints/ep1-ba42/rank_2.tar ...
overwrite (bool, optional) โ Whether existing checkpoints should be overridden. If
False
(the default), then thecheckpoint_folder
must not exist or be empty. (default:False
)save_interval (Time | str | int | (State, Event) -> bool) โ
A
Time
, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved.If an integer, checkpoints will be saved every n epochs. If
Time
or a time-string, checkpoints will be saved according to this interval.See also
If a function, then this function should take two arguments (
State
,Event
). The first argument will be the current state of the trainer, and the second argument will be beEvent.BATCH_CHECKPOINT
orEPOCH_CHECKPOINT
(depending on the current training progress). It should returnTrue
if a checkpoint should be saved given the current state and event.weights_only (bool) โ If
True
, save only the model weights instead of the entire training state. This parmeter must beFalse
when using DeepSpeed. (default:False
)
- checkpoint_folder#
The folder in which checkpoints are stored. If an absolute path was specified for
save_folder
upon instantiation, then that path will be used. Otherwise, this folder is relative to the run directory of the training run (e.g.{run_directory}/{save_folder}
). If no run directory is provided, then by default, it is of the formruns/<timestamp>/rank_<GLOBAL_RANK>/<save_folder>
wheretimestamp
is the start time of the run in iso-format,GLOBAL_RANK
is the global rank of the process, andsave_folder
is the save_folder argument provided upon construction.See also
run_directory
for details on the format of the run directory and how to customize it.- Type
- saved_checkpoints#
A dictionary mapping a save timestamp to a list of filepaths corresponding to the checkpoints saved at that time.
Note
When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. These filepaths are valid only on the global rankโs node. Otherwise, when not using DeepSpeed, this list will contain only one filepath since only rank zero saves checkpoints.
- composer.callbacks.checkpoint_saver.checkpoint_periodically(interval)[source]#
Helper function to create a checkpoint scheduler according to a specified interval.
- Parameters
interval (Union[str, int, Time]) โ
The interval describing how often checkpoints should be saved. If an integer, it will be assumed to be in
EPOCH
s. Otherwise, the unit must be eitherTimeUnit.EPOCH
orTimeUnit.BATCH
.Checkpoints will be saved every
n
batches or epochs (depending on the unit), and at the end of training.- Returns
Callable[[State, Event], bool] โ A function that can be passed as the
save_interval
argument into theCheckpointSaver
.