Algorithm: Classification Metrics from Scratch

Implement the core classification evaluation metrics: precision, recall, F1 score, confusion matrix, and ROC-AUC. Understanding these metrics deeply is critical for every ML engineer -- choosing the wrong metric can make or break a model deployment.

Problem Statement

Implement:

confusion_matrix(y_true, y_pred) -- count TP, FP, FN, TN
precision(y_true, y_pred) -- TP / (TP + FP)
recall(y_true, y_pred) -- TP / (TP + FN)
f1_score(y_true, y_pred) -- harmonic mean of precision and recall
roc_auc(y_true, y_scores) -- area under the ROC curve from continuous scores

Handle edge cases: empty predictions, all-positive/all-negative, ties in scores.

Inputs: Ground truth labels y_true (0 or 1), predicted labels y_pred (0 or 1), and predicted scores y_scores (continuous values) for AUC.

Outputs: Scalar metric values.

Key Concept

Precision answers "of all the items I predicted positive, how many are truly positive?" Recall answers "of all the truly positive items, how many did I find?" These are often in tension: a model that predicts everything as positive has perfect recall but low precision. The F1 score is their harmonic mean, penalizing models that sacrifice one for the other. AUC measures the probability that a random positive is ranked higher than a random negative, evaluating the model across all possible thresholds.

Interactive · Confusion Matrix and Derived Metrics

┌──────────────────────────────────────────────────────────────────┐
│           Confusion Matrix and Metric Relationships               │
│                                                                  │
│                    Predicted                                      │
│                  Pos     Neg                                      │
│           ┌─────────┬─────────┐                                  │
│  Actual   │   TP    │   FN    │  Recall = TP / (TP + FN)        │
│  Pos      │  (hit)  │ (miss)  │  "How many positives did I find?"│
│           ├─────────┼─────────┤                                  │
│  Actual   │   FP    │   TN    │                                  │
│  Neg      │ (false  │(correct │                                  │
│           │  alarm) │ reject) │                                  │
│           └─────────┴─────────┘                                  │
│              │                                                    │
│              Precision = TP / (TP + FP)                           │
│              "Of my positive predictions, how many are right?"    │
│                                                                  │
│   F1 Score = 2 * P * R / (P + R) — harmonic mean                │
│   (penalizes models that sacrifice one metric for the other)      │
│                                                                  │
│   ROC Curve:                                                      │
│   TPR │      ╱──────                                             │
│   1.0 │     ╱                                                    │
│       │    ╱     AUC = area                                      │
│       │   ╱      under curve                                     │
│   0.5 │  ╱  ╱ random (AUC=0.5)                                  │
│       │ ╱ ╱                                                      │
│   0.0 │╱──────────── FPR                                         │
│       0.0          1.0                                            │
└──────────────────────────────────────────────────────────────────┘

Warning

When computing precision and recall, always handle the zero-denominator edge case. If no positive predictions were made (TP + FP = 0), precision is undefined -- return 0.0 by convention. Similarly, if there are no actual positives (TP + FN = 0), recall is undefined. Forgetting this check causes division-by-zero errors in production systems, especially during early training when a model might predict all negatives.

Interview Tip

A common interview question is "when would you optimize for precision vs. recall?" Use concrete examples: in spam filtering, high precision avoids annoying users with false positives (legitimate emails marked as spam). In cancer screening, high recall avoids missing actual cancer cases (false negatives are life-threatening). The F-beta score generalizes F1 with a parameter beta: F-beta weights recall beta times more than precision. F2 score is common in medical applications.

Hints

Info

Build the confusion matrix first -- all other metrics derive from TP, FP, FN, TN.
Precision = TP / (TP + FP). If TP + FP = 0 (no positive predictions), return 0.
Recall = TP / (TP + FN). If TP + FN = 0 (no positive samples), return 0.
F1 = 2 * precision * recall / (precision + recall). Handle the zero denominator case.
For ROC-AUC: sort samples by score descending, sweep the threshold from high to low, and track the true positive rate and false positive rate at each step. Compute the area using the trapezoidal rule.

Solution

import numpy as np
from typing import Tuple, Dict


def confusion_matrix(
    y_true: np.ndarray, y_pred: np.ndarray
) -> Dict[str, int]:
    """Compute binary confusion matrix components."""
    y_true = np.asarray(y_true, dtype=int)
    y_pred = np.asarray(y_pred, dtype=int)

    tp = int(np.sum((y_true == 1) & (y_pred == 1)))
    fp = int(np.sum((y_true == 0) & (y_pred == 1)))
    fn = int(np.sum((y_true == 1) & (y_pred == 0)))
    tn = int(np.sum((y_true == 0) & (y_pred == 0)))

    return {"tp": tp, "fp": fp, "fn": fn, "tn": tn}


def precision_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Precision = TP / (TP + FP)."""
    cm = confusion_matrix(y_true, y_pred)
    denominator = cm["tp"] + cm["fp"]
    if denominator == 0:
        return 0.0
    return cm["tp"] / denominator


def recall_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Recall = TP / (TP + FN)."""
    cm = confusion_matrix(y_true, y_pred)
    denominator = cm["tp"] + cm["fn"]
    if denominator == 0:
        return 0.0
    return cm["tp"] / denominator


def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """F1 = 2 * precision * recall / (precision + recall)."""
    p = precision_score(y_true, y_pred)
    r = recall_score(y_true, y_pred)
    if p + r == 0:
        return 0.0
    return 2.0 * p * r / (p + r)


def roc_curve(
    y_true: np.ndarray, y_scores: np.ndarray
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute the ROC curve (FPR, TPR) at various thresholds.

    Returns:
        fpr: False positive rates at each threshold.
        tpr: True positive rates at each threshold.
        thresholds: The score thresholds.
    """
    y_true = np.asarray(y_true, dtype=int)
    y_scores = np.asarray(y_scores, dtype=float)

    # Sort by score descending
    sorted_indices = np.argsort(-y_scores)
    y_sorted = y_true[sorted_indices]
    scores_sorted = y_scores[sorted_indices]

    total_positives = np.sum(y_true == 1)
    total_negatives = np.sum(y_true == 0)

    tpr_list = [0.0]
    fpr_list = [0.0]
    threshold_list = [scores_sorted[0] + 1.0]  # threshold above all scores

    tp = 0
    fp = 0

    for i in range(len(y_sorted)):
        if y_sorted[i] == 1:
            tp += 1
        else:
            fp += 1

        # Only record a point when the score changes (or at the last element)
        if i == len(y_sorted) - 1 or scores_sorted[i] != scores_sorted[i + 1]:
            tpr = tp / total_positives if total_positives > 0 else 0.0
            fpr = fp / total_negatives if total_negatives > 0 else 0.0
            tpr_list.append(tpr)
            fpr_list.append(fpr)
            threshold_list.append(scores_sorted[i])

    return np.array(fpr_list), np.array(tpr_list), np.array(threshold_list)


def roc_auc_score(y_true: np.ndarray, y_scores: np.ndarray) -> float:
    """
    Compute Area Under the ROC Curve using the trapezoidal rule.
    """
    fpr, tpr, _ = roc_curve(y_true, y_scores)

    # Trapezoidal rule: sum of trapezoids between consecutive points
    auc = 0.0
    for i in range(1, len(fpr)):
        auc += (fpr[i] - fpr[i - 1]) * (tpr[i] + tpr[i - 1]) / 2.0

    return auc


def average_precision_score(y_true: np.ndarray, y_scores: np.ndarray) -> float:
    """
    Compute Average Precision (area under the precision-recall curve).
    """
    y_true = np.asarray(y_true, dtype=int)
    y_scores = np.asarray(y_scores, dtype=float)

    sorted_indices = np.argsort(-y_scores)
    y_sorted = y_true[sorted_indices]

    tp = 0
    ap = 0.0

    for i in range(len(y_sorted)):
        if y_sorted[i] == 1:
            tp += 1
            precision_at_i = tp / (i + 1)
            ap += precision_at_i

    total_positives = np.sum(y_true == 1)
    return ap / total_positives if total_positives > 0 else 0.0


# ---------- demo ----------
if __name__ == "__main__":
    np.random.seed(42)

    # Generate predictions
    y_true = np.array([1, 1, 0, 1, 0, 1, 0, 0, 1, 0])
    y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0])

    cm = confusion_matrix(y_true, y_pred)
    print("=== Confusion Matrix ===")
    print(f"  TP={cm['tp']}, FP={cm['fp']}")
    print(f"  FN={cm['fn']}, TN={cm['tn']}")

    p = precision_score(y_true, y_pred)
    r = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"\nPrecision: {p:.4f}")
    print(f"Recall:    {r:.4f}")
    print(f"F1 Score:  {f1:.4f}")

    # ROC-AUC with continuous scores
    y_scores = np.array([0.9, 0.4, 0.2, 0.8, 0.3, 0.7, 0.6, 0.1, 0.5, 0.15])
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    auc = roc_auc_score(y_true, y_scores)
    ap = average_precision_score(y_true, y_scores)

    print(f"\nROC-AUC:  {auc:.4f}")
    print(f"Avg Precision: {ap:.4f}")

    # ROC curve points
    print("\nROC Curve:")
    for f, t, th in zip(fpr, tpr, thresholds):
        print(f"  FPR={f:.3f}, TPR={t:.3f}, threshold={th:.3f}")

    # Edge cases
    print("\n=== Edge Cases ===")
    # All positive predictions
    all_pos = np.ones(5, dtype=int)
    print(f"All positive: P={precision_score(y_true[:5], all_pos):.4f}, "
          f"R={recall_score(y_true[:5], all_pos):.4f}")

    # No positive predictions
    all_neg = np.zeros(5, dtype=int)
    print(f"All negative: P={precision_score(y_true[:5], all_neg):.4f}, "
          f"R={recall_score(y_true[:5], all_neg):.4f}")

    # Perfect classifier
    print(f"Perfect: AUC={roc_auc_score(y_true, y_true.astype(float)):.4f}")

    # Random classifier
    random_scores = np.random.rand(1000)
    random_labels = np.random.randint(0, 2, 1000)
    print(f"Random: AUC={roc_auc_score(random_labels, random_scores):.4f} (should be ~0.5)")

Walkthrough

Confusion matrix -- The foundation: count how many predictions fall into each of the four categories (TP, FP, FN, TN). All threshold-dependent metrics derive from these counts.
Precision -- Of all samples predicted positive (TP + FP), what fraction is truly positive? Important when false positives are costly (e.g., spam filter: marking legitimate email as spam).
Recall -- Of all truly positive samples (TP + FN), what fraction did we catch? Important when false negatives are costly (e.g., disease screening: missing a sick patient).
F1 score -- The harmonic mean of precision and recall. It is zero if either is zero, and it equals 1 only when both are perfect. The harmonic mean penalizes extreme imbalance more than the arithmetic mean.
ROC-AUC -- Sweep the classification threshold from high to low. At each threshold, compute (FPR, TPR) and plot the point. The area under this curve measures ranking quality. AUC = 0.5 for random, 1.0 for perfect. Key insight: AUC equals the probability that a random positive is scored higher than a random negative.
Average Precision -- The area under the precision-recall curve. More informative than AUC for imbalanced datasets where negatives greatly outnumber positives.

Complexity Analysis

Confusion matrix: O(n) for a single pass over predictions.
Precision/Recall/F1: O(n) each (one pass for confusion matrix).
ROC-AUC: O(n log n) for sorting + O(n) for the sweep.
Average Precision: O(n log n) for sorting + O(n) for the sweep.

All metrics are trivially parallelizable and take negligible time compared to model inference.

Interview Tips

Interview Tip

Critical knowledge: (1) When to use which metric -- AUC for balanced datasets and ranking tasks, F1 for imbalanced tasks, precision for cost-sensitive false positives, recall for cost-sensitive false negatives. (2) The precision-recall tradeoff: you can always increase recall by lowering the threshold, but precision will drop. (3) Why accuracy is misleading for imbalanced data: a 99% accuracy model on a 99/1 class split might predict everything as negative. (4) Micro vs. macro averaging for multi-class: micro aggregates all TP/FP/FN globally, macro averages per-class metrics. (5) Calibration: AUC does not measure calibration -- a well-calibrated model has predicted probabilities that match empirical frequencies.

Quiz

Quiz — 3 Questions

For a highly imbalanced dataset (1% positive, 99% negative), why is ROC-AUC potentially misleading?

What does AUC = 0.5 signify about a classifier's performance?

What is the difference between micro-averaging and macro-averaging for multi-class metrics?

Metrics Computation