composer.trainer.trainer#
Train models!
The trainer supports models with ComposerModel
instances.
The Trainer
is highly customizable and can
support a wide variety of workloads.
Example
Train a model and save a checkpoint:
import os
from composer import Trainer
### Create a trainer
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
max_duration="1ep",
eval_dataloader=eval_dataloader,
optimizers=optimizer,
schedulers=scheduler,
device="cpu",
validate_every_n_epochs=1,
save_folder="checkpoints",
save_filename="ep{epoch}.pt",
save_interval="1ep",
save_overwrite=True,
)
# Fit and run evaluation for 1 epoch.
# Save a checkpoint after 1 epoch as specified during trainer creation.
trainer.fit()
Load the checkpoint and resume training:
# Get the saved checkpoint filepath
checkpoint_path = trainer.saved_checkpoints.pop()[0]
# Create a new trainer with the `load_path` argument set to the checkpoint path.
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
max_duration="2ep",
eval_dataloader=eval_dataloader,
optimizers=optimizer,
schedulers=scheduler,
device="cpu",
validate_every_n_epochs=1,
load_path=checkpoint_path,
)
# Continue training and running evaluation where the previous trainer left off
# until the new max_duration is reached.
# In this case it will be one additional epoch to reach 2 epochs total.
trainer.fit()
Classes
|
Trainer for training a models with Composer algorithms. |
- class composer.trainer.trainer.Trainer(*, model, train_dataloader, max_duration, eval_dataloader=None, algorithms=None, optimizers=None, schedulers=None, device=None, grad_accum=1, grad_clip_norm=None, validate_every_n_batches=-1, validate_every_n_epochs=1, compute_training_metrics=False, precision=Precision.FP32, scale_schedule_ratio=1.0, step_schedulers_every_batch=None, dist_timeout=300.0, ddp_sync_strategy=None, seed=None, deterministic_mode=False, run_name=None, loggers=None, callbacks=(), progress_bar=True, log_to_console=None, console_log_level=LogLevel.EPOCH, console_stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, load_path=None, load_object_store=None, load_weights_only=False, load_strict=False, load_chunk_size=1048576, load_progress_bar=True, save_folder=None, save_filename='ep{epoch}-ba{batch}-rank{rank}', save_artifact_name='{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}', save_latest_filename='latest-rank{rank}', save_latest_artifact_name='{run_name}/checkpoints/latest-rank{rank}', save_overwrite=False, save_interval='1ep', save_weights_only=False, save_num_checkpoints_to_keep=-1, train_subset_num_batches=None, eval_subset_num_batches=None, deepspeed_config=False, prof_trace_handlers=None, prof_schedule=None, sys_prof_cpu=True, sys_prof_memory=False, sys_prof_disk=False, sys_prof_net=False, sys_prof_stats_thread_interval_seconds=0.5, torch_prof_folder='{run_name}/torch_traces', torch_prof_filename='rank{rank}.{batch}.pt.trace.json', torch_prof_artifact_name='{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.json', torch_prof_overwrite=False, torch_prof_use_gzip=False, torch_prof_record_shapes=False, torch_prof_profile_memory=True, torch_prof_with_stack=False, torch_prof_with_flops=True, torch_prof_num_traces_to_keep=-1)[source]#
Trainer for training a models with Composer algorithms. See the Trainer guide for more information.
- Parameters
model (ComposerModel) โ
The model to train. Can be user-defined or one of the models included with Composer.
See also
composer.models
for models built into Composer.train_dataloader (DataLoader, DataSpec, or dict) โ
The
DataLoader
,DataSpec
, or dict ofDataSpec
kwargs for the training data. In order to specify custom preprocessing steps on each data batch, specify aDataSpec
instead of aDataLoader
.Note
The
train_dataloader
should yield per-rank batches. Each per-rank batch will then be further divided based on thegrad_accum
parameter. For example, if the desired optimization batch size is2048
and training is happening across 8 GPUs, then eachtrain_dataloader
should yield a batch of size2048 / 8 = 256
. Ifgrad_accum = 2
, then the per-rank batch will be divided into microbatches of size256 / 2 = 128
.max_duration (int, str, or Time) โ The maximum duration to train. Can be an integer, which will be interpreted to be epochs, a str (e.g.
1ep
, or10ba
), or aTime
object.eval_dataloader (DataLoader | DataSpec | Evaluator | Sequence[Evaluator], optional) โ
The
DataLoader
,DataSpec
,Evaluator
, or sequence of evaluators for the evaluation data.To evaluate one or more specific metrics across one or more datasets, pass in an
Evaluator
. If aDataSpec
orDataLoader
is passed in, then all metrics returned bymodel.metrics()
will be used during evaluation.None
results in no evaluation. (default:None
)algorithms (Algorithm | Sequence[Algorithm], optional) โ
The algorithms to use during training. If
None
, then no algorithms will be used. (default:None
)See also
composer.algorithms
for the different algorithms built into Composer.optimizers (Optimizer, optional) โ
The optimizer. If
None
, will be set toDecoupledSGDW(model.parameters(), lr=0.1)
. (default:None
)See also
composer.optim
for the different optimizers built into Composer.schedulers (PyTorchScheduler | ComposerScheduler | Sequence[PyTorchScheduler | ComposerScheduler], optional) โ
The learning rate schedulers. If
[]
orNone
, the learning rate will be constant. (default:None
).See also
composer.optim.scheduler
for the different schedulers built into Composer.device (str or Device, optional) โ The device to use for training. Either
cpu
orgpu
. (default:cpu
)grad_accum (Union[int, str], optional) โ
The number of microbatches to split a per-device batch into. Gradients are summed over the microbatches per device. If set to
auto
, dynamically increases grad_accum if microbatch is too large for GPU. (default:1
)Note
This is implemented by taking the batch yielded by the
train_dataloader
and splitting it intograd_accum
sections. Each section is of sizetrain_dataloader // grad_accum
. If the batch size of the dataloader is not divisible bygrad_accum
, then the last section will be of sizebatch_size % grad_accum
.grad_clip_norm (float, optional) โ The norm to clip gradient magnitudes to. Set to
None
for no gradient clipping. (default:None
)validate_every_n_batches (int, optional) โ Compute metrics on evaluation data every N batches. Set to
-1
to never validate on a batchwise frequency. (default:-1
)validate_every_n_epochs (int, optional) โ Compute metrics on evaluation data every N epochs. Set to
-1
to never validate on a epochwise frequency. (default:1
)compute_training_metrics (bool, optional) โ
True
to compute metrics on training data andFalse
to not. (default:False
)precision (str or Precision, optional) โ
Numerical precision to use for training. One of
fp32
,fp16
oramp
(recommended). (default:Precision.FP32
)Note
fp16
only works ifdeepspeed_config
is also provided.scale_schedule_ratio (float, optional) โ
Ratio by which to scale the training duration and learning rate schedules. E.g.,
0.5
makes the schedule take half as many epochs and2.0
makes it take twice as many epochs.1.0
means no change. (default:1.0
)Note
Training for less time, while rescaling the learning rate schedule, is a strong baseline approach to speeding up training. E.g., training for half duration often yields minor accuracy degradation, provided that the learning rate schedule is also rescaled to take half as long.
To see the difference, consider training for half as long using a cosine annealing learning rate schedule. If the schedule is not rescaled, training ends while the learning rate is still ~0.5 of the initial LR. If the schedule is rescaled with
scale_schedule_ratio
, the LR schedule would finish the entire cosine curve, ending with a learning rate near zero.step_schedulers_every_batch (bool, optional) โ By default, native PyTorch schedulers are updated every epoch, while Composer Schedulers are updated every step. Setting this to
True
will force schedulers to be stepped every batch, whileFalse
means schedulers stepped every epoch.None
indicates the default behavior. (default:None
)dist_timeout (float, optional) โ Timeout, in seconds, for initializing the distributed process group. (default:
15.0
)ddp_sync_strategy (str or DDPSyncStrategy, optional) โ The strategy to use for synchronizing gradients. Leave unset to let the trainer auto-configure this. See
DDPSyncStrategy
for more details.seed (int, optional) โ
The seed used in randomization. If
None
, then a random seed will be created. (default:None
)Note
In order to get reproducible results, call the
seed_all()
function at the start of your script with the seed passed to the trainer. This will ensure any initialization done before the trainer init (ex. model weight initialization) also uses the provided seed.See also
composer.utils.reproducibility
for more details on reproducibility.deterministic_mode (bool, optional) โ
Run the model deterministically. (default:
False
)Note
This is an experimental feature. Performance degradations expected. Certain Torch modules may not have deterministic implementations, which will result in a crash.
Note
In order to get reproducible results, call the
configure_deterministic_mode()
function at the start of your script. This will ensure any initialization done before the trainer init also runs deterministically.See also
composer.utils.reproducibility
for more details on reproducibility.run_name (str, optional) โ
A name for this training run. If not specified, one will be generated automatically.
See also
loggers (LoggerDestination | Sequence[LoggerDestination], optional) โ
The destinations to log training information to. If
None
, will be set to[ProgressBarLogger()]
. (default:None
)See also
composer.loggers
for the different loggers built into Composer.progress_bar (bool, optional) โ Whether to show a progress bar. (default:
True
)log_to_console (bool, optional) โ
Whether to print logging statements to the console. (default:
None
)The default behavior (when set to
None
) only prints logging statements whenshow_pbar
isFalse
.console_log_level (LogLevel | str | (State, LogLevel) -> bool, optional) โ
The maximum log level which should be printed to the console. (default:
LogLevel.EPOCH
)It can either be
LogLevel
, a string corresponding to aLogLevel
, or a callable that takes the trainingState
and theLogLevel
and returns a boolean of whether this statement should be printed.This parameter has no effect if
log_to_console
isFalse
, or is unspecified andprogres_bar
isTrue
.console_stream (TextIO | str, optional) โ The stream to write to. If a string, it can either be
'stdout'
or'stderr'
. (default:sys.stderr
)callbacks (Callback | Sequence[Callback], optional) โ
The callbacks to run during training. If
None
, then no callbacks will be run. (default:None
).See also
composer.callbacks
for the different callbacks built into Composer.load_path (str, optional) โ
The path format string to an existing checkpoint file.
It can be a path to a file on the local disk, a URL, or if
load_object_store
is set, the object name for a checkpoint in a cloud bucket.When using Deepspeed ZeRO, checkpoints are shareded by rank. Instead of hard-coding the rank in the
path
, use the following format variables:Variable
Description
{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{node_rank}
The node rank, as returned by
get_node_rank()
.For example, suppose that checkpoints are stored in the following structure:
my_model/ep1-rank0.tar my_model/ep1-rank1.tar my_model/ep1-rank2.tar ...
Then,
load_path
should be set tomy_model/ep1-rank{rank}.tar
, and all ranks will load the correct state.If
None
then no checkpoint will be loaded. (default:None
)load_object_store (ObjectStore, optional) โ
If the
load_path
is in an object store (i.e. AWS S3 or Google Cloud Storage), an instance ofObjectStore
which will be used to retreive the checkpoint. Otherwise, if the checkpoint is a local filepath, set toNone
. Ignored ifload_path
isNone
. (default:None
)Example:
from composer import Trainer from composer.utils import ObjectStore # Create the object store provider with the specified credentials creds = {"key": "object_store_key", "secret": "object_store_secret"} store = ObjectStore(provider="s3", container="my_container", provider_kwargs=creds) checkpoint_path = "./path_to_the_checkpoint_in_object_store" # Create a trainer which will load a checkpoint from the specified object store trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="10ep", eval_dataloader=eval_dataloader, optimizers=optimizer, schedulers=scheduler, device="cpu", validate_every_n_epochs=1, load_path=checkpoint_path, load_object_store=store, )
load_weights_only (bool, optional) โ Whether or not to only restore the weights from the checkpoint without restoring the associated state. Ignored if
load_path
isNone
. (default:False
)load_strict (bool, optional) โ Ensure that the set of weights in the checkpoint and model must exactly match. Ignored if
load_path
isNone
. (default:False
)load_chunk_size (int, optional) โ Chunk size (in bytes) to use when downloading checkpoints. Ignored if
load_path
is eitherNone
or a local file path. (default:1,048,675
)load_progress_bar (bool, optional) โ Display the progress bar for downloading the checkpoint. Ignored if
load_path
is eitherNone
or a local file path. (default:True
)save_folder (str, optional) โ
Format string for the folder where checkpoints are saved. If
None
, checkpoints will not be saved. (default:None
)See also
Note
For fine-grained control on checkpoint saving (e.g. to save different types of checkpoints at different intervals), leave this parameter as
None
, and instead pass instance(s) ofCheckpointSaver
directly ascallbacks
.save_filename (str, optional) โ
A format string describing how to name checkpoints. This parameter has no effect if
save_folder
isNone
. (default:"ep{epoch}-ba{batch}-rank{rank}"
)See also
save_artifact_name (str, optional) โ
A format string describing how to name checkpoints in loggers. This parameter has no effect if
save_folder
isNone
. (default:"{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}"
)See also
save_latest_filename (str, optional) โ
A format string for the name of a symlink (relative to
save_folder
) that points to the last saved checkpoint. This parameter has no effect ifsave_folder
isNone
. To disable symlinking, set this toNone
. (default:"latest-rank{rank}"
)See also
save_latest_artifact_name (str, optional) โ
A format string describing how to name symlinks in loggers. This parameter has no effect if
save_folder
,save_latest_filename
, orsave_artifact_name
areNone
. To disable symlinking in logger, set this orsave_latest_filename
toNone
. (default:"{run_name}/checkpoints/latest-rank{rank}"
)See also
save_overwrite (bool, optional) โ
Whether existing checkpoints should be overridden. This parameter has no effect if
save_folder
is None. (default:False
)See also
save_interval (Time | str | int | (State, Event) -> bool) โ
A
Time
, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved. This parameter has no effect ifsave_folder
isNone
. (default:'1ep'
)See also
save_weights_only (bool, optional) โ
Whether to save only the model weights instead of the entire training state. This parameter has no effect if
save_folder
isNone
. (default:False
)See also
save_num_checkpoints_to_keep (int, optional) โ
The number of checkpoints to keep locally. The oldest checkpoints are removed first. Set to
-1
to keep all checkpoints locally. (default:-1
)Checkpoints will be removed after they have been logged as a file artifact. For example, when this callback is used in conjunction with the
ObjectStoreLogger
, set this parameter to0
to immediately delete checkpoints from the local disk after they have been uploaded to the object store.This parameter only controls how many checkpoints are kept locally; checkpoints are not deleted from artifact stores.
train_subset_num_batches (int, optional) โ If specified, finish every epoch early after training on this many batches. This parameter has no effect if it is greater than
len(train_dataloader)
. IfNone
, then the entire dataloader will be iterated over. (default:None
)eval_subset_num_batches (int, optional) โ If specified, evaluate on this many batches. This parameter has no effect if it is greater than
len(eval_dataloader)
. IfNone
, then the entire dataloader will be iterated over. (default:None
)deepspeed_config (bool or Dict[str, Any], optional) โ Configuration for DeepSpeed, formatted as a JSON according to DeepSpeedโs documentation. If
True
is provided, the trainer will initialize the DeepSpeed engine with an empty config{}
. IfFalse
is provided, deepspeed will not be used. (default:False
)prof_schedule ((State) -> ProfilerAction, optional) โ
The profiler scheduler.
Must be specified in conjunction with
prof_trace_handlers
to use the profiler.from composer.trainer import Trainer from composer.profiler import JSONTraceHandler, cyclic_schedule trainer = Trainer( ..., prof_trace_handlers=JSONTraceHandler( folder='traces', ), prof_schedule=cyclic_schedule( skip_first=1, wait=0, warmup=1, active=4, repeat=1, ), )
See also
composer.profiler
for more details on profiling with the trainer.prof_trace_handlers (TraceHandler | Sequence[TraceHandler], optional) โ
Profiler trace handlers.
Must be specified in conjunction with
prof_trace_handlers
to use the profiler.See also
composer.profiler
for more details on profiling with the trainer.sys_prof_cpu (bool, optional) โ Whether to record cpu statistics. Ignored if
prof_schedule
andprof_trace_handlers
are not specified. (default:True
).sys_prof_memory (bool, optional) โ Whether to record memory statistics. Ignored if
prof_schedule
andprof_trace_handlers
are not specified. (default:False
).sys_prof_disk (bool, optional) โ Whether to record disk statistics. Ignored if
prof_schedule
andprof_trace_handlers
are not specified. (default:False
).sys_prof_net (bool, optional) โ Whether to record network statistics. Ignored if
prof_schedule
andprof_trace_handlers
are not specified. (default:False
).sys_prof_stats_thread_interval_seconds (float, optional) โ Interval to record stats, in seconds. Ignored if
prof_schedule
andprof_trace_handlers
are not specified. (default:0.5
).torch_prof_folder (str, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_filename (str, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_artifact_name (str, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_overwrite (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_use_gzip (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_record_shapes (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_profile_memory (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_with_stack (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_with_flops (bool, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.torch_prof_num_traces_to_keep (int, optional) โ See
TorchProfiler
. Ignored ifprof_schedule
andprof_trace_handlers
are not specified.
- close()[source]#
Shutdown the trainer.
See also
Engine.close()
for additional information.
- property deepspeed_enabled#
True
if DeepSpeed is being used for training andFalse
otherwise.See also
- eval(log_level=LogLevel.FIT)[source]#
Evaluate the model on the provided evaluation data and log appropriate metrics.
- Parameters
log_level (LogLevel, optional) โ The log level to use for metric logging during evaluation. Defaults to
FIT
.
- save_checkpoint(name='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False)[source]#
Checkpoint the training
State
.- Parameters
name (str, optional) โ See
save_checkpoint()
.weights_only (bool, optional) โ See
save_checkpoint()
.
- Returns
List[pathlib.Path] โ See
save_checkpoint()
.
- property saved_checkpoints#
The checkpoint timestamps and filepaths.
This list contains tuples of the save timestamp and the checkpoint filepaths. This list will have at most
save_num_checkpoints_to_keep
entries. The latest checkpoint will be at the end.Note
When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the processโs (rankโs) node.
Otherwise, when not using DeepSpeed, each sub-list will contain only one filepath since only rank zero saves checkpoints.