A key aspect of judging whether a classifier is fit for purpose is measuring its predictive performance.

Any commercial project that involves machine learning is well advised to establish the minimum predictive performance that a classifier has to achieve in order to be viable. In a similar vein, it is useful to establish a baseline performance, i.e. the performance achieved by a simple, unsophisticated, straightforward approach. A more complex solution can only be justified if it outperforms the baseline. Performance comparisons can also be made between different candidate approaches, so as to find an adequate trade-off between performance and other aspects such as runtime complexity for training and classification, maintainability, interpretability, required integration effort, and monetary cost. Finally, performance metrics enable the systematic exploration of modifications to a particular approach. Tuning parameters and adding new features ought to result in performance gains, and, perhaps, simplifications can be introduced without adversly affecting performance. Thus, performance metrics can be used to establish a feedback loop for experimentation.

There are a number of commonly used performance metrics, all of which are derived from a confusion matrix. A confusion matrix is constructed by feeding a data set with known correct actual classifications (i.e. a test/validation set) through a classifier and recording the number of times each *{actual class, predicted class}*-pair occurred.

In the remainder of this post, I will use two example scenarios (one binary, one multi-class) to examine accuracy, precision and recall.

## Binary case

In this hypothetical example scenario, we assume that we have trained a document classifier for categorizing contract documents into high-risk vs low-risk. We intend to apply the classifier to a large collection of contract documents in order to flag up high-risk contracts for expert review. Our test set comprises 100 documents, 95 of which are low-risk and 5 of which are high-risk. Feeding the test set through the classifier yields the following confusion matrix:

actual_high_risk actual_low_risk predicted_high_risk 1 1 predicted_low_risk 4 94

The cells in a 2×2 confusion matrix are referred to by the following names:

actual_true actual_false predicted_true true pos. (tp) false positives (fp) predicted_false false neg. (fn) true negatives (tn)

One somewhat obvious performance metric is to calculate the fraction of correct pairs (i.e. where actual = predicted) out of all pairs. This metric is known as accuracy and is calculated as follows:

accuracy = (tp + tn) / (tp + fp + fn + tn)

The accuracy derived from the example confusion matrix is 0.95. This is high, but also misleading as there is an imbalance between the high-risk and low-risk classes. The classifier dealt well with the low-risk documents, and as that class dominates the data set, it hides the poor performance on the high-risk documents. How can we claim that the classifier performs poorly with respect to high-risk documents?

x

Firstly, the classifier predicted high-risk for 2 documents, only 1 of which was actually high-risk. This metric is known as precision.

precision = tp / (tp + fp)

In the example, the precision is 0.5.

Secondly, there are 5 documents which are actually high-risk, but the classifier only identified 1 of them. This metric is known as recall.

recall = tp / (tp + fn)

In the example, the recall is 0.2.

In the example scenario, it is trivial to achieve a perfect recall by classifying all documents as high-risk, but the price of this is an extremely low precision. Conversely, a high precision could be obtained by making a single correct high-risk classification, but the price of this would be an extremely low recall. However, it is not generally trivial to achieve high precision and high recall at the same time, and for this reason, precision and recall should both be included in reports.

Depending on the task, precision and recall are not necessarily of equal importance. In the example scenario, it could be the case that the expert review is costly and that therefore as little time as possible should be spent on dealing with false alarms (precision more important than recall). The opposite extreme is conceivable, too: The cost of missing a high-risk document could be so great that dealing with a large number of false alarms is acceptable (recall more important than precision).

## Multi-class

In order to extend accuracy, precision and recall to the multi-class case, let’s imagine another document classification scenario: Incoming news items are to be assigned one of the following categories: Business (B), Politics (P), Sports (S), Technology (T), Health (H). Feeding a test set through the classifier has produced the following confusion matrix:

a_B a_P a_S a_T a_H p_B 17 3 2 3 2 p_P 1 12 0 1 2 p_S 0 1 16 0 1 p_T 3 3 1 13 1 p_H 0 0 0 0 16

By convention, the classes are listed in the same order in the rows as in the columns, and therefore, the correct classifications (actual class = predicted class) are located on the diagonal from top left to bottom right. Accuracy is the sum of the diagonal divided by the total. (This definition also works for the binary case.) Accuracy is a classifier-level metric and, as in the binary case, is sensitive to class imbalances.

Precision and recall are reported on a per-class basis. This also works analogously to the binary case, except that

- false positives have to be added up along the respective class’s row, and
- false negatives have to be added up along the respective class’s column.

Below is an illustration for the Technology class. (The blank cells are true negatives for the Technology class.)

a_B a_P a_S a_T a_H p_B fn p_P fn p_S fn p_T fp fp fp tp fp p_H fn

The precision for the Technology class is thus 13 / (13 + 3 + 3 + 1 + 1) = 0.62.

The recall for the Technology class is 13 / (13 + 3 + 1) = 0.76.

Should you want to characterize the classifier’s overall performance, there are several choices besides the accuracy metric described above:

- Macro-average of the per-class metrics: sum of the per-class precision (recall) divided by the number of classes; gives equal weight to all classes.
- Weighted-average of the per-class metrics: sum of (per-class metric * relative class frequency); sensitive to imbalances.
- Micro-average precision, recall: sum up each class’s relevant contributions (tp, fp, fn) and put the results into the respective formula (see binary case definitions); somewhat surprisingly, it turns out that micro-averaged precision is always equal to micro-averaged recall. The reason for this is that a false positive for one class is always also a false negative for another class. Finally, both metrics are equal to accuracy, as they are summing up the diagonal and dividing it by the total.
- Average precision, recall, accuracy: for each class, create a this-class-versus-the-others 2×2 confusion matrix, derive the desired metric, sum up, divide by the number of classes.

## Elixir implementation

As far as I am aware, there is currently no widely available confusion matrix module for Elixir. I am providing one in the hope that it might be useful to anybody experimenting with classifiers in Elixir. The implementation was quite straightforward and the most fun/challenging part turned out to be to_string()/1.

## Conclusion

In this blog, I have examined why performance metrics are useful and have illustrated accuracy, precision and recall for binary and multi-class classification tasks. I have also made available a simple confusion matrix implementation in Elixir.

Where might one go from here?

The material presented in this blog is is a good foundation for exploring related topics such as:

- other metrics, for example F1, the harmonic mean of precision and recall
- other types of tasks, which require slightly different performance metrics:
- Hierarchical classification tasks, where each instance belongs to exactly one class and the classes form a hierarchy
- Multi-label classification tasks, where each instance belongs to some subset of the classes, i.e. may be tagged with zero, one or multiple labels
- Regression tasks, where the predictions are continuous rather than discrete

- n-fold cross validation