composer.profiler#

Modules

`composer.profiler.dataloader_profiler`	Profiler to measure the time it takes the data loader to return a batch.
`composer.profiler.json_trace`	Outputs profiling data in JSON trace format.
`composer.profiler.json_trace_merger`	Merge trace files together.
`composer.profiler.profiler_hparams`	Example usage and definition of hparams.
`composer.profiler.system_profiler`	Profiler to record system level metrics.
`composer.profiler.torch_profiler`	Profiler to collect `torch` performance metrics during training.

Performance profiling tools.

The profiler gathers performance metrics during a training run that can be used to diagnose bottlenecks and facilitate model development.

The metrics gathered include:

Duration of each Event during training
Time taken by the data loader to return a batch
Host metrics such as CPU, system memory, disk and network utilization over time
Execution order, latency and attributes of PyTorch operators and GPU kernels (see torch.profiler)

The following example demonstrates how to setup and perform profiling on a simple training run.

# Copyright 2021 MosaicML. All Rights Reserved.

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from composer import Trainer
from composer.models import MNIST_Classifier

# Specify Dataset and Instantiate DataLoader
batch_size = 2048
data_directory = "../data"

mnist_transforms = transforms.Compose([transforms.ToTensor()])

train_dataset = datasets.MNIST(data_directory, train=True, download=True, transform=mnist_transforms)
train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              shuffle=False,
                              drop_last=True,
                              pin_memory=True,
                              persistent_workers=True,
                              num_workers=8)

# Instantiate Model
model = MNIST_Classifier(num_classes=10)

# Instantiate the trainer
profiler_trace_file = "profiler_traces.json"
torch_trace_dir = "torch_profiler"

trainer = Trainer(model=model,
                  train_dataloader=train_dataloader,
                  eval_dataloader=train_dataloader,
                  max_duration=2,
                  device="gpu",
                  validate_every_n_batches=-1,
                  validate_every_n_epochs=-1,
                  precision="amp",
                  train_subset_num_batches=16,
                  profiler_trace_file=profiler_trace_file,
                  prof_skip_first=0,
                  prof_wait=0,
                  prof_warmup=1,
                  prof_active=4,
                  prof_repeat=1,
                  torch_profiler_trace_dir=torch_trace_dir)

# Run training
trainer.fit()

It is required to specify an output profiler_trace_file during Trainer initialization to enable profiling. The profiler_trace_file will contain the profiling trace data once the profiling run completes. By default, the Profiler, DataLoaderProfiler and SystemProfiler will be active. The TorchProfiler is disabled by default.

To activate the TorchProfiler, the torch_profiler_trace_dir must be specified in addition to the profiler_trace_file argument. The torch_profiler_trace_dir will contain the Torch Profiler traces once the profiling run completes. The Profiler will automatically merge the Torch traces in the torch_profiler_trace_dir into the profiler_trace_file, allowing users to view a unified trace.

The complete traces can be viewed by in a Google Chrome browser navigating to chrome://tracing and loading the profiler_trace_file. Here is an example trace file:

Additonal details an be found in the Profiler Guide.

Classes

`Marker`	Record when something happens or how long something takes.
`Profiler`	Records the duration of Trainer `Event` using the `Marker` API.
`ProfilerAction`	Action states for the `Profiler` that define whether or not events are being recorded to the trace file.
`ProfilerEventHandler`	Base class for profiler event handlers.

class composer.profiler.Marker(profiler, name, actions, record_instant_on_start, record_instant_on_finish, categories)#

Record when something happens or how long something takes.

Used by the Engine to measure the duration of Event during training.

Note

Marker should not be instantiated directly; instead use Profiler.marker().

Markers can record the following types of events:

Duration: Records the start and stop time of an event of interest (Marker.start(), Marker.finish()).
Instant: Record time a particular event occurs, but not the full duration (Marker.instant()).
Counter: The value of a variable at given time (Marker.counter()).

A Marker can also be used as a context manager or decorator to record a duration:

Use a Marker with a context manager:

>>> def something_to_measure():
...     print("something_to_measure")
>>> marker = profiler.marker("foo")
>>> with marker:
...     something_to_measure()
something_to_measure

Use a Marker as a decorator:

>>> marker = profiler.marker("foo")
>>> @marker
... def something_to_measure():
...     print("something_to_measure")
>>> something_to_measure()
something_to_measure

counter(values)[source]#

Record a counter event.

To record a counter event:

>>> marker = profiler.marker("foo")
>>> counter_event = 5
>>> marker.counter({"counter_event": counter_event})
>>> counter_event = 10
>>> marker.counter({"counter_event": counter_event})

finish()[source]#

Record the end of a duration event.

See Marker.start() for a usage example.

instant()[source]#

Record an instant event.

To record an instant event:

>>> def something_to_measure():
...     print("something_to_measure")
>>> marker = profiler.marker("instant")
>>> marker.instant()
>>> something_to_measure()
something_to_measure

start()[source]#

Record the start of a duration event.

To record the duration of an event, invoke Marker.start() followed by Marker.finish():

>>> def something_to_measure():
...     print("something_to_measure")
>>> marker = profiler.marker("foo")
>>> marker.start()
>>> something_to_measure()
something_to_measure
>>> marker.finish()

class composer.profiler.Profiler(state, event_handlers=(), skip_first=0, wait=0, warmup=1, active=4, repeat=1, merged_trace_file='merged_profiler_trace.json')#

Records the duration of Trainer Event using the Marker API.

Specifically, it records:

The duration of each section of the training loop, such as the time it takes to perform a forward pass, backward pass, batch, epoch, etc.
The latency of each algorithm and callback adds when executing on each event.

The event_handlers then record and save this data to a trace file. If no event_handlers is specified, the JSONTraceHandler is used by default.

Note

The Composer Trainer creates an instance of Profiler when merged_trace_file is provided. The user should not create and directly register an instance of Profiler when using the Composer Trainer.

Parameters

state (State) – The state.
event_handlers (Sequence[ProfilerEventHandler]) – Event handlers which record and save profiling data to traces.
skip_first (int, optional) – Number of batches to skip profiling at epoch start. Defaults to 0.
wait (int, optional) – For each profiling cycle, number of batches to skip at the beginning of the cycle. Defaults to 0.
warmup (int, optional) – For each profiling cycle, number of batches to be in the warmup state after skipping wait batches. Defaults to 1.
active (int, optional) – For each profiling cycle, number of batches to record after warming up. Defaults to 4.
repeat (int, optional) – Number of profiling cycles to perform per epoch. Set to 0 to record the entire epoch. Defaults to 1.
merged_trace_file (str, optional) – Name of the trace file, relative to the run directory. Defaults to merged_profiler_trace.json.

property event_handlers#: Profiler event handlers.

get_action(batch_idx)[source]#

Get the current ProfilerAction for the profiler, based upon the parameters skip_first, wait, warmup, active, and repeat.

The profiler skips the first skip_first batches in every epoch. Then, it performs a cycle of skipping wait batches, warming up for warmup batches, and recording active batches. It repeats this cylce up to repeat times per epoch (or for the entire epoch, if repeat is 0). This logic repeats every epoch.

Parameters: batch_idx (int) – The index of the current batch.
Returns: ProfilerAction – The current action.

marker(name, actions=(<ProfilerAction.WARMUP: 'warmup'>, <ProfilerAction.ACTIVE: 'active'>), record_instant_on_start=False, record_instant_on_finish=False, categories=())[source]#

Create and get an instance of a Marker.

If a Marker with the specified name does not already exist, it will be created. Otherwise, the existing instance will be returned.

For example:

>>> marker = profiler.marker("foo")
>>> marker
<composer.profiler.Marker object at ...>

Note

Profiler.marker() should be used to construct markers. Marker should not be instantiated directly by the user.

Please see Marker.start() and Marker.finish() for usage on creating markers to measure duration events, Marker.instant() for usage on creating markers to mark instant events and Marker.counter() for usage on creating markers for counting.

Parameters

name (str) – The name for the Marker.
actions (Sequence[ProfilerAction], optional) – ProfilerAction states to record on. Defaults to (ProfilerAction.WARMUP, ProfilerAction.ACTIVE).
record_instant_on_start (bool, optional) – Whether to record an instant event whenever the marker is started. Defaults to False.
record_instant_on_finish (bool, optional) – Whether to record an instant event whenever the marker is finished. Defaults to False.
categories (Union[List[str], Tuple[str, ...]], optional) – Categories for this marker. Defaults to None.

Returns

Marker – Instance of Marker.

class composer.profiler.ProfilerAction(value)#

Bases: composer.utils.string_enum.StringEnum

Action states for the Profiler that define whether or not events are being recorded to the trace file.

SKIP#: Do not record new events to the trace. Any events started during ACTIVE or WARMUP will be recorded upon finish.

WARMUP#: Record all events to the trace except those requiring a warmup period to initialize data structures (e.g., torch.profiler).

ACTIVE#: Record all events to the trace.

class composer.profiler.ProfilerEventHandler#

Bases: composer.core.callback.Callback, abc.ABC

Base class for profiler event handlers.

Event handlers are responsible for logging trace Markers and saving them to a file in a given trace format for viewing.

Subclasses should implement process_duration_event(), process_instant_event() and process_counter_event(). These methods are invoked by the Profiler whenever there is an event to record.

Since ProfilerEventHandler subclasses Callback, event handlers can run on Events (such as on Event.INIT to open files or on Event.BATCH_END to periodically dump data to files) and use Callback.close() to perform any cleanup.

process_counter_event(name, categories, wall_clock_time_ns, global_rank, pid, values)[source]#

Called by the Profiler whenever there is an counter event to record.

Parameters

name (str) – The name of the event.
categories (List[str] | Tuple[str, ...]) – The categories for the event.
wall_clock_time_ns (int) – The time.time_ns() corresponding to the event.
global_rank (int) – The global_rank corresponding to the event.
pid (int) – The pid corresponding to the event.
values (Dict[str, int | float]) – The values corresponding to this counter event.

process_duration_event(name, categories, is_start, timestamp, wall_clock_time_ns, global_rank, pid)[source]#

Called by the Profiler whenever there is a duration event to record.

This method is called twice for each duration event – once with is_start = True, and then again with is_start = False. Interleaving events are not permitted. Specifically, for each event (identified by the name), a call with is_start = True will be followed by a call with is_start = False before another call with is_start = True.

Parameters

name (str) – The name of the event.
categories (Union[List[str], Tuple[str, ...]]) – The categories for the event.
is_start (bool) – Whether the event is a start event or end event.
timestamp (Timestamp) – Snapshot of the training time.
wall_clock_time_ns (int) – The time.time_ns() corresponding to the event.
global_rank (int) – The global_rank corresponding to the event.
pid (int) – The pid corresponding to the event.

process_instant_event(name, categories, timestamp, wall_clock_time_ns, global_rank, pid)[source]#

Called by the Profiler whenever there is an instant event to record.

Parameters

name (str) – The name of the event.
categories (List[str] | Tuple[str, ...]) – The categories for the event.
timestamp (Timestamp) – Snapshot of current training time.
wall_clock_time_ns (int) – The time.time_ns() corresponding to the event.
global_rank (int) – The global_rank corresponding to the event.
pid (int) – The pid corresponding to the event.