composer.profiler#
Modules
Profiler to measure the time it takes the data loader to return a batch. |
|
Outputs profiling data in JSON trace format. |
|
Merge trace files together. |
|
Example usage and definition of hparams. |
|
Profiler to record system level metrics. |
|
Profiler to collect |
Performance profiling tools.
The profiler gathers performance metrics during a training run that can be used to diagnose bottlenecks and facilitate model development.
The metrics gathered include:
Duration of each
Event
during trainingTime taken by the data loader to return a batch
Host metrics such as CPU, system memory, disk and network utilization over time
Execution order, latency and attributes of PyTorch operators and GPU kernels (see torch.profiler)
The following example demonstrates how to setup and perform profiling on a simple training run.
1# Copyright 2021 MosaicML. All Rights Reserved.
2
3from torch.utils.data import DataLoader
4from torchvision import datasets, transforms
5
6from composer import Trainer
7from composer.models import MNIST_Classifier
8
9# Specify Dataset and Instantiate DataLoader
10batch_size = 2048
11data_directory = "../data"
12
13mnist_transforms = transforms.Compose([transforms.ToTensor()])
14
15train_dataset = datasets.MNIST(data_directory, train=True, download=True, transform=mnist_transforms)
16train_dataloader = DataLoader(train_dataset,
17 batch_size=batch_size,
18 shuffle=False,
19 drop_last=True,
20 pin_memory=True,
21 persistent_workers=True,
22 num_workers=8)
23
24# Instantiate Model
25model = MNIST_Classifier(num_classes=10)
26
27# Instantiate the trainer
28profiler_trace_file = "profiler_traces.json"
29torch_trace_dir = "torch_profiler"
30
31trainer = Trainer(model=model,
32 train_dataloader=train_dataloader,
33 eval_dataloader=train_dataloader,
34 max_duration=2,
35 device="gpu",
36 validate_every_n_batches=-1,
37 validate_every_n_epochs=-1,
38 precision="amp",
39 train_subset_num_batches=16,
40 profiler_trace_file=profiler_trace_file,
41 prof_skip_first=0,
42 prof_wait=0,
43 prof_warmup=1,
44 prof_active=4,
45 prof_repeat=1,
46 torch_profiler_trace_dir=torch_trace_dir)
47
48# Run training
49trainer.fit()
It is required to specify an output profiler_trace_file
during Trainer
initialization to enable profiling.
The profiler_trace_file
will contain the profiling trace data once the profiling run completes. By default, the Profiler
,
DataLoaderProfiler
and SystemProfiler
will be active. The TorchProfiler
is disabled by default.
To activate the TorchProfiler
, the torch_profiler_trace_dir
must be specified in addition to the profiler_trace_file
argument.
The torch_profiler_trace_dir
will contain the Torch Profiler traces once the profiling run completes. The Profiler
will
automatically merge the Torch traces in the torch_profiler_trace_dir
into the profiler_trace_file
, allowing users to view a unified trace.
The complete traces can be viewed by in a Google Chrome browser navigating to chrome://tracing
and loading the profiler_trace_file
.
Here is an example trace file:
Additonal details an be found in the Profiler Guide.
Classes
Record when something happens or how long something takes. |
|
Action states for the |
|
Base class for profiler event handlers. |
- class composer.profiler.Marker(profiler, name, actions, record_instant_on_start, record_instant_on_finish, categories)#
Record when something happens or how long something takes.
Used by the
Engine
to measure the duration ofEvent
during training.Note
Marker
should not be instantiated directly; instead useProfiler.marker()
.Markers can record the following types of events:
Duration: Records the start and stop time of an event of interest (
Marker.start()
,Marker.finish()
).Instant: Record time a particular event occurs, but not the full duration (
Marker.instant()
).Counter: The value of a variable at given time (
Marker.counter()
).
A
Marker
can also be used as a context manager or decorator to record a duration:Use a
Marker
with a context manager:>>> def something_to_measure(): ... print("something_to_measure") >>> marker = profiler.marker("foo") >>> with marker: ... something_to_measure() something_to_measure
Use a
Marker
as a decorator:>>> marker = profiler.marker("foo") >>> @marker ... def something_to_measure(): ... print("something_to_measure") >>> something_to_measure() something_to_measure
- counter(values)[source]#
Record a counter event.
To record a counter event:
>>> marker = profiler.marker("foo") >>> counter_event = 5 >>> marker.counter({"counter_event": counter_event}) >>> counter_event = 10 >>> marker.counter({"counter_event": counter_event})
- finish()[source]#
Record the end of a duration event.
See
Marker.start()
for a usage example.
- instant()[source]#
Record an instant event.
To record an instant event:
>>> def something_to_measure(): ... print("something_to_measure") >>> marker = profiler.marker("instant") >>> marker.instant() >>> something_to_measure() something_to_measure
- start()[source]#
Record the start of a duration event.
To record the duration of an event, invoke
Marker.start()
followed byMarker.finish()
:>>> def something_to_measure(): ... print("something_to_measure") >>> marker = profiler.marker("foo") >>> marker.start() >>> something_to_measure() something_to_measure >>> marker.finish()
- class composer.profiler.Profiler(state, event_handlers=(), skip_first=0, wait=0, warmup=1, active=4, repeat=1, merged_trace_file='merged_profiler_trace.json')#
Records the duration of Trainer
Event
using theMarker
API.Specifically, it records:
The duration of each section of the training loop, such as the time it takes to perform a forward pass, backward pass, batch, epoch, etc.
The latency of each algorithm and callback adds when executing on each event.
The
event_handlers
then record and save this data to a trace file. If noevent_handlers
is specified, theJSONTraceHandler
is used by default.Note
The Composer
Trainer
creates an instance ofProfiler
whenmerged_trace_file
is provided. The user should not create and directly register an instance ofProfiler
when using the ComposerTrainer
.- Parameters
state (State) โ The state.
event_handlers (Sequence[ProfilerEventHandler]) โ Event handlers which record and save profiling data to traces.
skip_first (int, optional) โ Number of batches to skip profiling at epoch start. Defaults to
0
.wait (int, optional) โ For each profiling cycle, number of batches to skip at the beginning of the cycle. Defaults to
0
.warmup (int, optional) โ For each profiling cycle, number of batches to be in the warmup state after skipping
wait
batches. Defaults to1
.active (int, optional) โ For each profiling cycle, number of batches to record after warming up. Defaults to
4
.repeat (int, optional) โ Number of profiling cycles to perform per epoch. Set to
0
to record the entire epoch. Defaults to1
.merged_trace_file (str, optional) โ Name of the trace file, relative to the run directory. Defaults to
merged_profiler_trace.json
.
- property event_handlers#
Profiler event handlers.
- get_action(batch_idx)[source]#
Get the current
ProfilerAction
for the profiler, based upon the parametersskip_first
,wait
,warmup
,active
, andrepeat
.The profiler skips the first
skip_first
batches in every epoch. Then, it performs a cycle of skippingwait
batches, warming up forwarmup
batches, and recordingactive
batches. It repeats this cylce up torepeat
times per epoch (or for the entire epoch, ifrepeat
is 0). This logic repeats every epoch.- Parameters
batch_idx (int) โ The index of the current batch.
- Returns
ProfilerAction โ The current action.
- marker(name, actions=(<ProfilerAction.WARMUP: 'warmup'>, <ProfilerAction.ACTIVE: 'active'>), record_instant_on_start=False, record_instant_on_finish=False, categories=())[source]#
Create and get an instance of a
Marker
.If a
Marker
with the specifiedname
does not already exist, it will be created. Otherwise, the existing instance will be returned.For example:
>>> marker = profiler.marker("foo") >>> marker <composer.profiler.Marker object at ...>
Note
Profiler.marker()
should be used to construct markers.Marker
should not be instantiated directly by the user.Please see
Marker.start()
andMarker.finish()
for usage on creating markers to measure duration events,Marker.instant()
for usage on creating markers to mark instant events andMarker.counter()
for usage on creating markers for counting.- Parameters
actions (Sequence[ProfilerAction], optional) โ
ProfilerAction
states to record on. Defaults to (ProfilerAction.WARMUP
,ProfilerAction.ACTIVE
).record_instant_on_start (bool, optional) โ Whether to record an instant event whenever the marker is started. Defaults to
False
.record_instant_on_finish (bool, optional) โ Whether to record an instant event whenever the marker is finished. Defaults to
False
.categories (Union[List[str], Tuple[str, ...]], optional) โ Categories for this marker. Defaults to
None
.
- Returns
Marker โ Instance of
Marker
.
- class composer.profiler.ProfilerAction(value)#
Bases:
composer.utils.string_enum.StringEnum
Action states for the
Profiler
that define whether or not events are being recorded to the trace file.- SKIP#
Do not record new events to the trace. Any events started during
ACTIVE
orWARMUP
will be recorded upon finish.
- WARMUP#
Record all events to the trace except those requiring a warmup period to initialize data structures (e.g., torch.profiler).
- ACTIVE#
Record all events to the trace.
- class composer.profiler.ProfilerEventHandler#
Bases:
composer.core.callback.Callback
,abc.ABC
Base class for profiler event handlers.
Event handlers are responsible for logging trace
Marker
s and saving them to a file in a given trace format for viewing.Subclasses should implement
process_duration_event()
,process_instant_event()
andprocess_counter_event()
. These methods are invoked by theProfiler
whenever there is an event to record.Since
ProfilerEventHandler
subclassesCallback
, event handlers can run onEvent
s (such as onEvent.INIT
to open files or onEvent.BATCH_END
to periodically dump data to files) and useCallback.close()
to perform any cleanup.- process_counter_event(name, categories, wall_clock_time_ns, global_rank, pid, values)[source]#
Called by the
Profiler
whenever there is an counter event to record.- Parameters
name (str) โ The name of the event.
categories (List[str] | Tuple[str, ...]) โ The categories for the event.
wall_clock_time_ns (int) โ The
time.time_ns()
corresponding to the event.global_rank (int) โ The
global_rank
corresponding to the event.pid (int) โ The
pid
corresponding to the event.values (Dict[str, int | float]) โ The values corresponding to this counter event.
- process_duration_event(name, categories, is_start, timestamp, wall_clock_time_ns, global_rank, pid)[source]#
Called by the
Profiler
whenever there is a duration event to record.This method is called twice for each duration event โ once with
is_start = True
, and then again withis_start = False
. Interleaving events are not permitted. Specifically, for each event (identified by thename
), a call withis_start = True
will be followed by a call withis_start = False
before another call withis_start = True
.- Parameters
name (str) โ The name of the event.
categories (Union[List[str], Tuple[str, ...]]) โ The categories for the event.
is_start (bool) โ Whether the event is a start event or end event.
timestamp (Timestamp) โ Snapshot of the training time.
wall_clock_time_ns (int) โ The
time.time_ns()
corresponding to the event.global_rank (int) โ The
global_rank
corresponding to the event.pid (int) โ The
pid
corresponding to the event.
- process_instant_event(name, categories, timestamp, wall_clock_time_ns, global_rank, pid)[source]#
Called by the
Profiler
whenever there is an instant event to record.- Parameters
name (str) โ The name of the event.
categories (List[str] | Tuple[str, ...]) โ The categories for the event.
timestamp (Timestamp) โ Snapshot of current training time.
wall_clock_time_ns (int) โ The
time.time_ns()
corresponding to the event.global_rank (int) โ The
global_rank
corresponding to the event.pid (int) โ The
pid
corresponding to the event.