# MultiLabelMetric¶

class mmcls.evaluation.MultiLabelMetric(thr=None, topk=None, items=('precision', 'recall', 'f1-score'), average='macro', collect_device='cpu', prefix=None)[source]

A collection of precision, recall, f1-score and support for multi-label tasks.

The collection of metrics is for single-label multi-class classification. And all these metrics are based on the confusion matrix of every category:

All metrics can be formulated use variables above:

Precision is the fraction of correct predictions in all predictions:

$\text{Precision} = \frac{TP}{TP+FP}$

Recall is the fraction of correct predictions in all targets:

$\text{Recall} = \frac{TP}{TP+FN}$

F1-score is the harmonic mean of the precision and recall:

$\text{F1-score} = \frac{2\times\text{Recall}\times\text{Precision}}{\text{Recall}+\text{Precision}}$

Support is the number of samples:

$\text{Support} = TP + TN + FN + FP$
Parameters
• thr (float, optional) – Predictions with scores under the threshold are considered as negative. If None, the topk predictions will be considered as positive. If the topk is also None, use thr=0.5 as default. Defaults to None.

• topk (int, optional) – Predictions with the k-th highest scores are considered as positive. If None, use thr to determine positive predictions. If both thr and topk are not None, use thr. Defaults to None.

• items (Sequence[str]) – The detailed metric items to evaluate, select from “precision”, “recall”, “f1-score” and “support”. Defaults to ('precision', 'recall', 'f1-score').

• average (str | None) –

How to calculate the final metrics from the confusion matrix of every category. It supports three modes:

• ”macro”: Calculate metrics for each category, and calculate the mean value over all categories.

• ”micro”: Average the confusion matrix over all categories and calculate metrics on the mean confusion matrix.

• None: Calculate metrics of every category and output directly.

Defaults to “macro”.

• collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

• prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

Examples

>>> import torch
>>> from mmcls.evaluation import MultiLabelMetric
>>> # ------ The Basic Usage for category indices labels -------
>>> y_pred = [[0], [1], [0, 1], [3]]
>>> y_true = [[0, 3], [0, 2], [1], [3]]
>>> # Output precision, recall, f1-score and support
>>> MultiLabelMetric.calculate(
...     y_pred, y_true, pred_indices=True, target_indices=True, num_classes=4)
(tensor(50.), tensor(50.), tensor(45.8333), tensor(6))
>>> # ----------- The Basic Usage for one-hot labels -----------
>>> y_pred = torch.tensor([[1, 1, 0, 0],
...                        [1, 1, 0, 0],
...                        [0, 0, 1, 0],
...                        [0, 1, 0, 0],
...                        [0, 1, 0, 0]])
>>> y_true = torch.Tensor([[1, 1, 0, 0],
...                        [0, 0, 1, 0],
...                        [1, 1, 1, 0],
...                        [1, 0, 0, 0],
...                        [1, 0, 0, 0]])
>>> MultiLabelMetric.calculate(y_pred, y_true)
(tensor(43.7500), tensor(31.2500), tensor(33.3333), tensor(8))
>>> # --------- The Basic Usage for one-hot pred scores ---------
>>> y_pred = torch.rand(y_true.size())
>>> y_pred
tensor([[0.4575, 0.7335, 0.3934, 0.2572],
[0.1318, 0.1004, 0.8248, 0.6448],
[0.8349, 0.6294, 0.7896, 0.2061],
[0.4037, 0.7308, 0.6713, 0.8374],
[0.3779, 0.4836, 0.0313, 0.0067]])
>>> # Calculate with different threshold.
>>> MultiLabelMetric.calculate(y_pred, y_true, thr=0.1)
(tensor(42.5000), tensor(75.), tensor(53.1746), tensor(8))
>>> # Calculate with topk.
>>> MultiLabelMetric.calculate(y_pred, y_true, topk=1)
(tensor(62.5000), tensor(31.2500), tensor(39.1667), tensor(8))
>>>
>>> # ------------------- Use with Evalutor -------------------
>>> from mmcls.structures import ClsDataSample
>>> from mmengine.evaluator import Evaluator
>>> data_sampels = [
...     ClsDataSample().set_pred_score(pred).set_gt_score(gt)
...     for pred, gt in zip(torch.rand(1000, 5), torch.randint(0, 2, (1000, 5)))]
>>> evaluator = Evaluator(metrics=MultiLabelMetric(thr=0.5))
>>> evaluator.process(data_sampels)
>>> evaluator.evaluate(1000)
{
'multi-label/precision': 50.72898037055408,
'multi-label/recall': 50.06836461357571,
'multi-label/f1-score': 50.384466955258475
}
>>> # Evaluate on each class by using topk strategy
>>> evaluator = Evaluator(metrics=MultiLabelMetric(topk=1, average=None))
>>> evaluator.process(data_sampels)
>>> evaluator.evaluate(1000)
{
'multi-label/precision_top1_classwise': [48.22, 50.54, 50.99, 44.18, 52.5],
'multi-label/recall_top1_classwise': [18.92, 19.22, 19.92, 20.0, 20.27],
'multi-label/f1-score_top1_classwise': [27.18, 27.85, 28.65, 27.54, 29.25]
}

static calculate(pred, target, pred_indices=False, target_indices=False, average='macro', thr=None, topk=None, num_classes=None)[source]

Calculate the precision, recall, f1-score.

Parameters
• pred (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, num_classes) or a sequence of index/onehot format labels.

• target (torch.Tensor | np.ndarray | Sequence) – The prediction results. A torch.Tensor or np.ndarray with shape (N, num_classes) or a sequence of index/onehot format labels.

• pred_indices (bool) – Whether the pred is a sequence of category index labels. If True, num_classes must be set. Defaults to False.

• target_indices (bool) – Whether the target is a sequence of category index labels. If True, num_classes must be set. Defaults to False.

• average (str | None) –

How to calculate the final metrics from the confusion matrix of every category. It supports three modes:

• ”macro”: Calculate metrics for each category, and calculate the mean value over all categories.

• ”micro”: Average the confusion matrix over all categories and calculate metrics on the mean confusion matrix.

• None: Calculate metrics of every category and output directly.

Defaults to “macro”.

• thr (float, optional) – Predictions with scores under the thresholds are considered as negative. Defaults to None.

• topk (int, optional) – Predictions with the k-th highest scores are considered as positive. Defaults to None.

• num_classes (Optional, int) – The number of classes. If the pred is indices instead of onehot, this argument is required. Defaults to None.

Returns

The tuple contains precision, recall and f1-score. And the type of each item is:

• torch.Tensor: A tensor for each metric. The shape is (1, ) if average is not None, and (C, ) if average is None.

Return type

Tuple

Notes

If both thr and topk are set, use thr to determine positive predictions. If neither is set, use thr=0.5 as default.

compute_metrics(results)[source]

Compute the metrics from processed results.

Parameters

results (list) – The processed results of each batch.

Returns

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

Return type

Dict

process(data_batch, data_samples)[source]

Process one batch of data samples.

The processed results should be stored in self.results`, which will be used to computed the metrics when all batches have been processed.

Parameters
• data_batch – A batch of data from the dataloader.

• data_samples (Sequence[dict]) – A batch of outputs from the model.