Confusion Matrix
In the previous chapters of our Machine Learning tutorial (Neural Networks with Python and Numpy and Neural Networks from Scratch ) we implemented various algorithms, but we didn't properly measure the quality of the output. The main reason was that we used very simple and small datasets to learn and test. In the chapter Neural Network: Testing with MNIST, we will work with large datasets and ten classes, so we need proper evaluations tools. We will introduce in this chapter the concepts of the confusion matrix:
A confusion matrix is a matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one. Each row of the confusion matrix represents the instances of an actual class and each column represents the instances of a predicted class. This is the way we keep it in this chapter of our tutorial, but it can be the other way around as well, i.e. rows for predicted classes and columns for actual classes. The name confusion matrix reflects the fact that it makes it easy for us to see what kind of confusions occur in our classification algorithms. For example the algorithms should have predicted a sample as $c_i$ because the actual class is $c_i$, but the algorithm came out with $c_j$. In this case of mislabelling the element $cm[i, j]$ will be incremented by one, when the confusion matrix is constructed.
We will define methods to calculate the confusion matrix, precision and recall in the following class.
2class Case
In a 2class case, i.e. "negative" and "positive", the confusion matrix may look like this:
predicted  

actual  negative  positive 

negative  11 
0  
positive 
1  12 

The fields of the matrix mean the following:
predicted  

actual  negative  positive 

negative  TN True positive 
FP False Positive 

positive 
FN False negative 
TP True positive 

We can define now some important performance measures used in machine learning:
Accuracy:
$$AC = \frac {TN + TP}{TN + FP + FN + TP}$$
The accuracy is not always an adequate performance measure. Let us assume we have 1000 samples. 995 of these are negative and 5 are positive cases. Let us further assume we have a classifier, which classifies whatever it will be presented as negative. The accuracy will be a surprising 99.5%, even though the classifier could not recognize any positive samples.
Recall aka. True Positive Rate:
$$recall = \frac {TP}{FN + TP}$$
True Negative Rate:
$$TNR = \frac {FP}{TN + FP}$$
Precision:
$$precision: \frac {TP}{FP + TP} $$
Multiclass Case
To measure the results of machine learning algorithms, the previous confusion matrix will not be sufficient. We will need a generalization for the multiclass case.
Let us assume that we have a sample of 25 animals, e.g. 7 cats, 8 dogs, and 10 snakes, most probably Python snakes. The confusion matrix of our recognition algorithm may look like the following table:
predicted  

actual  dog  cat  snake  
dog  6 
2  0  
cat  1  6 
0 

snake  1 
1  8 
In this confusion matrix, the system correctly predicted six of the eight actual dogs, but in two cases it took a dog for a cat. The seven acutal cats were correctly recognized in six cases but in one case a cat was taken to be a dog. Usually, it is hard to take a snake for a dog or a cat, but this is what happened to our classifier in two cases. Yet, eight out of ten snakes had been correctly recognized. (Most probably this machine learning algorithm was not written in a Python program, because Python should properly recognize its own species :) )
You can see that all correct predictions are located in the diagonal of the table, so prediction errors can be easily found in the table, as they will be represented by values outside the diagonal.
We can generalize this to the multiclass case. To do this we summarize over the rows and columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that a given row of the matrix corresponds to specific value for the "truth", we have:
$$Precision_i = \frac{M_{ii}}{\sum_j M_{ji}}$$
$$Recall_i = \frac{M_{ii}}{\sum_j M_{ij}}$$
This means, precision is the fraction of cases where the algorithm correctly predicted class i out of all instances where the algorithm predicted i (correctly and incorrectly). recall on the other hand is the fraction of cases where the algorithm correctly predicted i out of all of the cases which are labelled as i.
Let us apply this to our example:
The precision for our animals can be calculated as
$$precision_{dogs} = 6 / (6 + 1 + 1) = 3/4 = 0.75$$
$$precision_{cats} = 6 / (2 + 6 + 1) = 6/9 = 0.67$$
$$precision_{snakes} = 8 / (0 + 0 + 8) = 1$$
The recall is calculated like this:
$$recall_{dogs} = 6 / (6 + 2 + 0) = 3/4 = 0.75$$
$$recall_{cats} = 6 / (1 + 6 + 0) = 6/7 = 0.86$$
$$recall_{snakes} = 8 / (1 + 1 + 8) = 4/5 = 0.8$$
Example
We are ready now to code this into Python. The following code shows a confusion matrix for a multiclass machine learning problem with ten labels, so for example an algorithms for recognizing the ten digits from handwritten characters.
If you are not familiar with Numpy and Numpy arrays, we recommend our tutorial on Numpy.
import numpy as np cm = np.array( [[5825, 1, 49, 23, 7, 46, 30, 12, 21, 26], [ 1, 6654, 48, 25, 10, 32, 19, 62, 111, 10], [ 2, 20, 5561, 69, 13, 10, 2, 45, 18, 2], [ 6, 26, 99, 5786, 5, 111, 1, 41, 110, 79], [ 4, 10, 43, 6, 5533, 32, 11, 53, 34, 79], [ 3, 1, 2, 56, 0, 4954, 23, 0, 12, 5], [ 31, 4, 42, 22, 45, 103, 5806, 3, 34, 3], [ 0, 4, 30, 29, 5, 6, 0, 5817, 2, 28], [ 35, 6, 63, 58, 8, 59, 26, 13, 5394, 24], [ 16, 16, 21, 57, 216, 68, 0, 219, 115, 5693]])
The functions 'precision' and 'recall' calculate values for a label, whereas the function 'precision_macro_average' the precision for the whole classification problem calculates.
def precision(label, confusion_matrix): col = confusion_matrix[:, label] return confusion_matrix[label, label] / col.sum() def recall(label, confusion_matrix): row = confusion_matrix[label, :] return confusion_matrix[label, label] / row.sum() def precision_macro_average(confusion_matrix): rows, columns = confusion_matrix.shape sum_of_precisions = 0 for label in range(rows): sum_of_precisions += precision(label, confusion_matrix) return sum_of_precisions / rows def recall_macro_average(confusion_matrix): rows, columns = confusion_matrix.shape sum_of_recalls = 0 for label in range(columns): sum_of_recalls += recall(label, confusion_matrix) return sum_of_recalls / columns
print("label precision recall") for label in range(10): print(f"{label:5d} {precision(label, cm):9.3f} {recall(label, cm):6.3f}")
label precision recall 0 0.983 0.964 1 0.987 0.954 2 0.933 0.968 3 0.944 0.924 4 0.947 0.953 5 0.914 0.980 6 0.981 0.953 7 0.928 0.982 8 0.922 0.949 9 0.957 0.887
print("precision total:", precision_macro_average(cm)) print("recall total:", recall_macro_average(cm))
precision total: 0.949688556405 recall total: 0.951453154788
def accuracy(confusion_matrix): diagonal_sum = confusion_matrix.trace() sum_of_all_elements = confusion_matrix.sum() return diagonal_sum / sum_of_all_elements
accuracy(cm)After having executed the Python code above we received the following:
0.95038333333333336