Algorithm: Classification Metrics from Scratch
Implement the core classification evaluation metrics: precision, recall, F1 score, confusion matrix, and ROC-AUC. Understanding these metrics deeply is critical for every ML engineer -- choosing the wrong metric can make or break a model deployment.
Problem Statement
Implement:
confusion_matrix(y_true, y_pred)-- count TP, FP, FN, TNprecision(y_true, y_pred)-- TP / (TP + FP)recall(y_true, y_pred)-- TP / (TP + FN)f1_score(y_true, y_pred)-- harmonic mean of precision and recallroc_auc(y_true, y_scores)-- area under the ROC curve from continuous scores
Handle edge cases: empty predictions, all-positive/all-negative, ties in scores.
Inputs: Ground truth labels y_true (0 or 1), predicted labels y_pred (0 or 1), and predicted scores y_scores (continuous values) for AUC.
Outputs: Scalar metric values.
Precision answers "of all the items I predicted positive, how many are truly positive?" Recall answers "of all the truly positive items, how many did I find?" These are often in tension: a model that predicts everything as positive has perfect recall but low precision. The F1 score is their harmonic mean, penalizing models that sacrifice one for the other. AUC measures the probability that a random positive is ranked higher than a random negative, evaluating the model across all possible thresholds.
┌──────────────────────────────────────────────────────────────────┐
│ Confusion Matrix and Metric Relationships │
│ │
│ Predicted │
│ Pos Neg │
│ ┌─────────┬─────────┐ │
│ Actual │ TP │ FN │ Recall = TP / (TP + FN) │
│ Pos │ (hit) │ (miss) │ "How many positives did I find?"│
│ ├─────────┼─────────┤ │
│ Actual │ FP │ TN │ │
│ Neg │ (false │(correct │ │
│ │ alarm) │ reject) │ │
│ └─────────┴─────────┘ │
│ │ │
│ Precision = TP / (TP + FP) │
│ "Of my positive predictions, how many are right?" │
│ │
│ F1 Score = 2 * P * R / (P + R) — harmonic mean │
│ (penalizes models that sacrifice one metric for the other) │
│ │
│ ROC Curve: │
│ TPR │ ╱────── │
│ 1.0 │ ╱ │
│ │ ╱ AUC = area │
│ │ ╱ under curve │
│ 0.5 │ ╱ ╱ random (AUC=0.5) │
│ │ ╱ ╱ │
│ 0.0 │╱──────────── FPR │
│ 0.0 1.0 │
└──────────────────────────────────────────────────────────────────┘
When computing precision and recall, always handle the zero-denominator edge case. If no positive predictions were made (TP + FP = 0), precision is undefined -- return 0.0 by convention. Similarly, if there are no actual positives (TP + FN = 0), recall is undefined. Forgetting this check causes division-by-zero errors in production systems, especially during early training when a model might predict all negatives.
A common interview question is "when would you optimize for precision vs. recall?" Use concrete examples: in spam filtering, high precision avoids annoying users with false positives (legitimate emails marked as spam). In cancer screening, high recall avoids missing actual cancer cases (false negatives are life-threatening). The F-beta score generalizes F1 with a parameter beta: F-beta weights recall beta times more than precision. F2 score is common in medical applications.
Hints
- Build the confusion matrix first -- all other metrics derive from TP, FP, FN, TN.
- Precision = TP / (TP + FP). If TP + FP = 0 (no positive predictions), return 0.
- Recall = TP / (TP + FN). If TP + FN = 0 (no positive samples), return 0.
- F1 = 2 * precision * recall / (precision + recall). Handle the zero denominator case.
- For ROC-AUC: sort samples by score descending, sweep the threshold from high to low, and track the true positive rate and false positive rate at each step. Compute the area using the trapezoidal rule.
Solution
import numpy as np
from typing import Tuple, Dict
def confusion_matrix(
y_true: np.ndarray, y_pred: np.ndarray
) -> Dict[str, int]:
"""Compute binary confusion matrix components."""
y_true = np.asarray(y_true, dtype=int)
y_pred = np.asarray(y_pred, dtype=int)
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
return {"tp": tp, "fp": fp, "fn": fn, "tn": tn}
def precision_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Precision = TP / (TP + FP)."""
cm = confusion_matrix(y_true, y_pred)
denominator = cm["tp"] + cm["fp"]
if denominator == 0:
return 0.0
return cm["tp"] / denominator
def recall_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Recall = TP / (TP + FN)."""
cm = confusion_matrix(y_true, y_pred)
denominator = cm["tp"] + cm["fn"]
if denominator == 0:
return 0.0
return cm["tp"] / denominator
def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""F1 = 2 * precision * recall / (precision + recall)."""
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
if p + r == 0:
return 0.0
return 2.0 * p * r / (p + r)
def roc_curve(
y_true: np.ndarray, y_scores: np.ndarray
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
Compute the ROC curve (FPR, TPR) at various thresholds.
Returns:
fpr: False positive rates at each threshold.
tpr: True positive rates at each threshold.
thresholds: The score thresholds.
"""
y_true = np.asarray(y_true, dtype=int)
y_scores = np.asarray(y_scores, dtype=float)
# Sort by score descending
sorted_indices = np.argsort(-y_scores)
y_sorted = y_true[sorted_indices]
scores_sorted = y_scores[sorted_indices]
total_positives = np.sum(y_true == 1)
total_negatives = np.sum(y_true == 0)
tpr_list = [0.0]
fpr_list = [0.0]
threshold_list = [scores_sorted[0] + 1.0] # threshold above all scores
tp = 0
fp = 0
for i in range(len(y_sorted)):
if y_sorted[i] == 1:
tp += 1
else:
fp += 1
# Only record a point when the score changes (or at the last element)
if i == len(y_sorted) - 1 or scores_sorted[i] != scores_sorted[i + 1]:
tpr = tp / total_positives if total_positives > 0 else 0.0
fpr = fp / total_negatives if total_negatives > 0 else 0.0
tpr_list.append(tpr)
fpr_list.append(fpr)
threshold_list.append(scores_sorted[i])
return np.array(fpr_list), np.array(tpr_list), np.array(threshold_list)
def roc_auc_score(y_true: np.ndarray, y_scores: np.ndarray) -> float:
"""
Compute Area Under the ROC Curve using the trapezoidal rule.
"""
fpr, tpr, _ = roc_curve(y_true, y_scores)
# Trapezoidal rule: sum of trapezoids between consecutive points
auc = 0.0
for i in range(1, len(fpr)):
auc += (fpr[i] - fpr[i - 1]) * (tpr[i] + tpr[i - 1]) / 2.0
return auc
def average_precision_score(y_true: np.ndarray, y_scores: np.ndarray) -> float:
"""
Compute Average Precision (area under the precision-recall curve).
"""
y_true = np.asarray(y_true, dtype=int)
y_scores = np.asarray(y_scores, dtype=float)
sorted_indices = np.argsort(-y_scores)
y_sorted = y_true[sorted_indices]
tp = 0
ap = 0.0
for i in range(len(y_sorted)):
if y_sorted[i] == 1:
tp += 1
precision_at_i = tp / (i + 1)
ap += precision_at_i
total_positives = np.sum(y_true == 1)
return ap / total_positives if total_positives > 0 else 0.0
# ---------- demo ----------
if __name__ == "__main__":
np.random.seed(42)
# Generate predictions
y_true = np.array([1, 1, 0, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0])
cm = confusion_matrix(y_true, y_pred)
print("=== Confusion Matrix ===")
print(f" TP={cm['tp']}, FP={cm['fp']}")
print(f" FN={cm['fn']}, TN={cm['tn']}")
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"\nPrecision: {p:.4f}")
print(f"Recall: {r:.4f}")
print(f"F1 Score: {f1:.4f}")
# ROC-AUC with continuous scores
y_scores = np.array([0.9, 0.4, 0.2, 0.8, 0.3, 0.7, 0.6, 0.1, 0.5, 0.15])
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)
ap = average_precision_score(y_true, y_scores)
print(f"\nROC-AUC: {auc:.4f}")
print(f"Avg Precision: {ap:.4f}")
# ROC curve points
print("\nROC Curve:")
for f, t, th in zip(fpr, tpr, thresholds):
print(f" FPR={f:.3f}, TPR={t:.3f}, threshold={th:.3f}")
# Edge cases
print("\n=== Edge Cases ===")
# All positive predictions
all_pos = np.ones(5, dtype=int)
print(f"All positive: P={precision_score(y_true[:5], all_pos):.4f}, "
f"R={recall_score(y_true[:5], all_pos):.4f}")
# No positive predictions
all_neg = np.zeros(5, dtype=int)
print(f"All negative: P={precision_score(y_true[:5], all_neg):.4f}, "
f"R={recall_score(y_true[:5], all_neg):.4f}")
# Perfect classifier
print(f"Perfect: AUC={roc_auc_score(y_true, y_true.astype(float)):.4f}")
# Random classifier
random_scores = np.random.rand(1000)
random_labels = np.random.randint(0, 2, 1000)
print(f"Random: AUC={roc_auc_score(random_labels, random_scores):.4f} (should be ~0.5)")
Walkthrough
-
Confusion matrix -- The foundation: count how many predictions fall into each of the four categories (TP, FP, FN, TN). All threshold-dependent metrics derive from these counts.
-
Precision -- Of all samples predicted positive (TP + FP), what fraction is truly positive? Important when false positives are costly (e.g., spam filter: marking legitimate email as spam).
-
Recall -- Of all truly positive samples (TP + FN), what fraction did we catch? Important when false negatives are costly (e.g., disease screening: missing a sick patient).
-
F1 score -- The harmonic mean of precision and recall. It is zero if either is zero, and it equals 1 only when both are perfect. The harmonic mean penalizes extreme imbalance more than the arithmetic mean.
-
ROC-AUC -- Sweep the classification threshold from high to low. At each threshold, compute (FPR, TPR) and plot the point. The area under this curve measures ranking quality. AUC = 0.5 for random, 1.0 for perfect. Key insight: AUC equals the probability that a random positive is scored higher than a random negative.
-
Average Precision -- The area under the precision-recall curve. More informative than AUC for imbalanced datasets where negatives greatly outnumber positives.
Complexity Analysis
- Confusion matrix: O(n) for a single pass over predictions.
- Precision/Recall/F1: O(n) each (one pass for confusion matrix).
- ROC-AUC: O(n log n) for sorting + O(n) for the sweep.
- Average Precision: O(n log n) for sorting + O(n) for the sweep.
All metrics are trivially parallelizable and take negligible time compared to model inference.
Interview Tips
Critical knowledge: (1) When to use which metric -- AUC for balanced datasets and ranking tasks, F1 for imbalanced tasks, precision for cost-sensitive false positives, recall for cost-sensitive false negatives. (2) The precision-recall tradeoff: you can always increase recall by lowering the threshold, but precision will drop. (3) Why accuracy is misleading for imbalanced data: a 99% accuracy model on a 99/1 class split might predict everything as negative. (4) Micro vs. macro averaging for multi-class: micro aggregates all TP/FP/FN globally, macro averages per-class metrics. (5) Calibration: AUC does not measure calibration -- a well-calibrated model has predicted probabilities that match empirical frequencies.
Quiz
Quiz — 3 Questions
For a highly imbalanced dataset (1% positive, 99% negative), why is ROC-AUC potentially misleading?
What does AUC = 0.5 signify about a classifier's performance?
What is the difference between micro-averaging and macro-averaging for multi-class metrics?