composer.utils.checkpoint#

Utilities for working with training checkpoints.

Functions

load_checkpoint

Load a checkpoint from a local file, URI, or cloud object store into state.

save_checkpoint

Checkpoint the training state.

composer.utils.checkpoint.format_name(name_format, state)[source]#

Format a checkpoint filename according to the name_format and the training State.

The following format variables are available:

Variable

Description

{rank}

The global rank, as returned by get_global_rank().

{local_rank}

The local rank of the process, as returned by get_local_rank().

{world_size}

The world size, as returned by get_world_size().

{local_world_size}

The local world size, as returned by get_local_world_size().

{node_rank}

The node rank, as returned by get_node_rank().

{epoch}

The total epoch count, as returned by epoch().

{batch}

The total batch count, as returned by batch().

{batch_in_epoch}

The batch count in the current epoch, as returned by batch_in_epoch().

{sample}

The total sample count, as returned by sample().

{sample_in_epoch}

The sample count in the current epoch, as returned by sample_in_epoch().

{token}

The total token count, as returned by token().

{token_in_epoch}

The token count in the current epoch, as returned by token_in_epoch().

Note

If using DeepSpeed, and name_format does not end with an tarfile archive extension ('.tar', '.tgz', '.tar.gz', '.tar.bz2', or '.tar.lzma'), then '.tar' will be appended. DeepSpeed uses a tarball format as it saves model and optimizer states in separate files within the tarball.

Consider the following scenario, where the current epoch count is 1 and the current batch count is 42:

  • When not using DeepSpeed, then the rank zero process will call this function:

    >>> format_name("ep{epoch}-ba{batch}", state)
    'ep1-ba42'
    
  • When using DeepSpeed, each rank (process) will call this function. '{rank}' should appear within name_format, so each rank (process) will write to its own file. For example, on the rank zero process:

    >>> format_name("ep{epoch}-ba{batch}-rank{rank}", state)
    'ep1-ba42-rank0.tar'
    
composer.utils.checkpoint.load_checkpoint(path_format, state, object_store=None, load_weights_only=False, strict_model_weights=False, chunk_size=1048576, progress_bar=True)[source]#

Load a checkpoint from a local file, URI, or cloud object store into state.

Parameters
  • path_format (str) โ€“

    The path format string to an existing checkpoint file.

    It can be a path to a file on the local disk, a URL, or if object_store is set, the object name for a checkpoint in a cloud bucket.

    When using Deepspeed ZeRO, checkpoints are shareded by rank. Instead of hard-coding the rank in the path_format, use the following format variables:

    Variable

    Description

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {node_rank}

    The node rank, as returned by get_node_rank().

    For example, suppose that checkpoints are stored in the following structure:

    my_model/ep1-rank0.tar
    my_model/ep1-rank1.tar
    my_model/ep1-rank2.tar
    ...
    

    Then, path_format should be set to my_model/ep1-rank{rank}.tar, and all ranks will load the correct state.

  • state (State) โ€“ The State to load the checkpoint into.

  • object_store (ObjectStoreProvider, optional) โ€“ If the path_format is in an object store (i.e. AWS S3 or Google Cloud Storage), an instance of ObjectStoreProvider which will be used to retreive the checkpoint. Otherwise, if the checkpoint is a local filepath, set to None. (default: None)

  • load_weights_only (bool, optional) โ€“ Whether or not to only restore the model weights from the checkpoint without restoring the associated state. (default: False)

  • strict_model_weights (bool, optional) โ€“ Whether or not to force that the checkpointed weights must exactly match the model weights. (default: False)

  • chunk_size (int, optional) โ€“ Chunk size (in bytes) to use when downloading checkpoints. Ignored if the checkpoint is a local file path. (default: 1_048_576 bytes (1 MB))

  • progress_bar (bool, optional) โ€“ Whether or not to show a progress bar when downloading checkpoints. Ignored if the checkpoint is a local file path. (default: True)

Returns

Optional[List[types.StateDict]] โ€“ The RNG state dicts, indexed by global rank, if load_weights_only is not None. Otherwise, None.

composer.utils.checkpoint.save_checkpoint(state, name_format='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False)[source]#

Checkpoint the training state.

Parameters
  • state (State) โ€“ The current State of the trainer.

  • name_format (str) โ€“

    A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}-rank{rank}')

    See format_name() for the available format variables.

    Note

    • By default, only the rank zero process will save a checkpoint file.

    • When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the name_format_string. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, .tar will be used.

    • To use compression (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bzip', or '.tar.lzma' (depending on the desired compression algorithm).

    Warning

    Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.

    Consider the following scenario, where:

    • The default name_format='ep{epoch}-ba{batch}-rank{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'ep1-ba42-rank0'. When DeepSpeed is being used, each rank (process) will save checkpoints to:

    ep1-ba42-rank0.tar
    ep1-ba42-rank1.tar
    ep1-ba42-rank2.tar
    ...
    

  • weights_only (bool, optional) โ€“

    If True, save only the model weights instead of the entire training state. (default: False)

    Note

    When using DeepSpeed, this parameter must be False. Weights-only checkpointing is not currently compatible with DeepSpeed,

  • Returns โ€“

    List[pathlib.Path]: The list of checkpoint files saved, indexed by the rank of the process.

    Note

    When using DeepSpeed, each process (rank) saves its own checkpoint file. When doing multi-node training, the filepaths are valid only on each processโ€™s node; Composer does not move checkpoint files between nodes.

    Otherwise, when not using DeepSpeed, this list will contain only one filepath, since only the rank zero process saves checkpoints.