๐Ÿ“Š Evaluation#

To track training progress, validation datasets can be provided to the Composer Trainer through the eval_dataloader parameter. The trainer will compute evaluation metrics on the evaluation dataset at a frequency specified by the the Trainer parameter eval_interval.

from composer import Trainer

trainer = Trainer(
    ...,
    eval_dataloader=my_eval_dataloader,
    eval_interval="1ep",  # Default is every epoch
)

The metrics should be provided by ComposerModel.metrics().

Multiple Datasets#

If there are multiple validation datasets that may have different metrics, use Evaluator to specify each pair of dataloader and metrics. This class is just a container for a few attributes:

For example, the GLUE tasks for language models can be specified as in the following example:

from composer.core import Evaluator
from torchmetrics import Accuracy, MetricCollection
from composer.models.nlp_metrics import BinaryF1Score

glue_mrpc_task = Evaluator(
    label='glue_mrpc',
    dataloader=mrpc_dataloader,
    metrics=MetricCollection([BinaryF1Score(), Accuracy()])
)

glue_mnli_task = Evaluator(
    label='glue_mnli',
    dataloader=mnli_dataloader,
    metrics=Accuracy()
)

trainer = Trainer(
    ...,
    eval_dataloader=[glue_mrpc_task, glue_mnli_task],
    ...
)

In this case, the metrics from ComposerModel.metrics() will be ignored since they are explicitly provided above.