For the following, assume a $K=2$ binary classifier.
The training set contains some set of positive cases, $P$, and negative cases, $N$ so that the total number of training examples in our set is $M = \|P\| + \|M\|$.
$|P|$ and $|N|$ are condition positive and condition negative, respectively.
Prevalence: Condition positive divided by total population.
The classifier is attempting to identify the positive cases and its operation is denoted by the $\hat{\cdot}$ symbol.
After applying the classifier to the training set, we have the following four sets:
True positives, $\hat{P}$.
False positives, $\hat{N}$.
True negatives, $\bar{N} = N - \hat{N}$.
False negatives, $\bar{P} = P - \hat{P}$.
It helps to think of the hat indicating elements the test thinks are positive and the bar indicating elements that the test thinks are negative.
Confusion Matrix
If we have $K$ classes, the confusion matrix is $K \times K$. The row and column indices correspond to the class indicies and element $ij$ is the number of times a datapoint from class $i$ is identified by the algorithm as class $j$. A perfect predictor has a diagonal confusion matrix.
For this discussion, a conservative test is one that classifies as positive only when it is very sure (very few false positives). A sloppy test is one that is very quick to classify positive cases (lots of false positives).
Metrics that quantify how positive cases are classified (sum to 1):
A conservative test would have a high false negative rate and a sloppy test would have a low one.
These two metrics tend to reward the same kind of test (conservative does poorly and sloppy does well). This is likely why most people only talk about sensitivity.
Metrics that quantify how negative cases are classified (sum to 1):
False Positive Rate (False Alarm Rate/Fall-Out/Type I Error):
Indicates the proportion of negative samples that are classified as positive.
Conservative tests do well here because they only classify as positive if they’re very sure. A sloppy test will score poorly here since it lumps in a bunch of negative cases with it’s positive classifications.
Conservative tests do well here since they mistake very few negative cases for positive. Sloppy tests do poorly.
Both of these metrics reward a conservative test.
The classic tradeoff made by ROC curves is true positive rate (sensitivity) versus false positive rate (false alarm rate) to strike the right balance between conservative and sloppy testing.
There are a variety of other metrics that attempt to quantify overall test performance:
Accuracy is a poor metric since only 3% (clustering) or 4.8% (DI definition) of DI clients are chronic.
Confidence is important. If a test indicates a client is chronic, we want to make sure that they actually are.
Sensitivity is also important since we want to ensure we’re catching everyone who needs help.
We can use false discovery rate to estimate how many folks are getting help who may not need it.
Receiver Operating Characteristic (ROC) Curves
An ROC curve plots true positive rate (sensitivity) versus false positive rate (false alarm rate).
True positive rate rewards a sloppy test and false positive rate rewards a conservative test.
Any binary classifier can be represented as a single point on the curve but often classifiers that use a threshold are represented as curves that are generated by sweeping the threshold over all possible values.
For a very low threshold, all values are classified as positive so true positive rate is 1 (all positives are classified as positive) and false alarm rate is 1 (all negatives are classified as positive).
For a very high threshold, all values are classified as negative so the true positive rate is 0 (no positives are classified as positive) and false alarm rate is 0 (no negatives are classified as positive).
The worst possible performance is the straight line on the ROC curve where sensitivity equals false alarm rate.
Example:
Consider a case with 100 cases where $|P|$ = 10 and $|N|$ = 90.
The other points on the slope 1 line would be achieved by tests that randomly assign positive tests with a different probability (ie. a random test that classifies 20\% cases as positive would have a sensitivity and false alarm rate of 0.2).