Implement Linear Regression from Scratch

Build a complete linear regression model using only NumPy, training it with gradient descent. This is the foundational optimization loop behind all of deep learning.

Problem Statement

Implement a LinearRegression class that:

Initializes weights and bias to zeros (or small random values)
Implements a predict(X) method computing X @ w + b
Implements a fit(X, y, lr, epochs) method using batch gradient descent to minimize MSE loss
Returns the training loss history

Inputs: Feature matrix X of shape (n_samples, n_features), target vector y of shape (n_samples,).

Outputs: Learned weight vector w of shape (n_features,), scalar bias b, and a list of per-epoch MSE losses.

Key Concept

Linear regression minimizes Mean Squared Error: L = (1/n) * ||Xw + b - y||^2. The gradient with respect to w is (2/n) * X^T (Xw + b - y) and with respect to b is (2/n) * sum(Xw + b - y). Gradient descent iteratively updates parameters in the direction that reduces the loss.

Interactive · Gradient Descent Step for Linear Regression

┌──────────────────────────────────────────────────────────────────┐
│            Gradient Descent for Linear Regression                │
│                                                                  │
│   Epoch 0           Epoch k           Epoch final                │
│                                                                  │
│   w = [0,0,0]       w ≈ [1.5,-0.7,0.3]   w ≈ [2.0,-1.0,0.5]   │
│   b = 0.0           b ≈ 2.1              b ≈ 3.0                │
│                                                                  │
│   For each epoch:                                                │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐     │
│   │  y_hat =    │    │ error =     │    │ dw = (2/n) *    │     │
│   │  X @ w + b  │──▶ │ y_hat - y   │──▶ │   X^T @ error   │     │
│   │  (forward)  │    │ (residual)  │    │ db = (2/n) *    │     │
│   └─────────────┘    └─────────────┘    │   sum(error)    │     │
│                                          └───────┬─────────┘     │
│                                                  │               │
│                                                  ▼               │
│                                          ┌─────────────────┐     │
│                                          │ w -= lr * dw    │     │
│                                          │ b -= lr * db    │     │
│                                          │ (param update)  │     │
│                                          └─────────────────┘     │
│                                                                  │
│   Loss: ████████████  ──▶  █████  ──▶  █  (decreasing)          │
└──────────────────────────────────────────────────────────────────┘

Hints

Info

Start by computing predictions: y_hat = X @ w + b.
Compute the residual: error = y_hat - y.
The MSE loss is mean(error ** 2).
Gradient for w: (2/n) * X.T @ error. Gradient for b: (2/n) * sum(error).
Update: w -= lr * dw, b -= lr * db.
Repeat for the specified number of epochs and record the loss each time.

Warning

A common mistake is forgetting to divide the gradient by n (the number of samples). Without this normalization, the gradient magnitude scales with the dataset size, making the learning rate dataset-dependent. Also watch out for using X @ error instead of X.T @ error -- the transpose is critical for getting shapes right.

Solution

import numpy as np
from typing import Tuple, List


class LinearRegression:
    """Linear regression trained with batch gradient descent."""

    def __init__(self, n_features: int) -> None:
        self.w = np.zeros(n_features)  # weight vector
        self.b = 0.0                   # bias scalar

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Compute predictions: y_hat = X @ w + b."""
        return X @ self.w + self.b

    def compute_loss(self, y_hat: np.ndarray, y: np.ndarray) -> float:
        """Mean Squared Error."""
        return float(np.mean((y_hat - y) ** 2))

    def compute_gradients(
        self, X: np.ndarray, y: np.ndarray, y_hat: np.ndarray
    ) -> Tuple[np.ndarray, float]:
        """Compute gradients of MSE w.r.t. w and b."""
        n = len(y)
        error = y_hat - y                    # (n,)
        dw = (2.0 / n) * (X.T @ error)      # (n_features,)
        db = (2.0 / n) * np.sum(error)       # scalar
        return dw, db

    def fit(
        self,
        X: np.ndarray,
        y: np.ndarray,
        lr: float = 0.01,
        epochs: int = 1000,
    ) -> List[float]:
        """Train with batch gradient descent. Returns loss history."""
        losses: List[float] = []
        for _ in range(epochs):
            y_hat = self.predict(X)
            loss = self.compute_loss(y_hat, y)
            losses.append(loss)
            dw, db = self.compute_gradients(X, y, y_hat)
            self.w -= lr * dw
            self.b -= lr * db
        return losses


# ---------- demo ----------
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 3)
    true_w = np.array([2.0, -1.0, 0.5])
    y = X @ true_w + 3.0 + np.random.randn(200) * 0.1

    model = LinearRegression(n_features=3)
    losses = model.fit(X, y, lr=0.01, epochs=500)

    print(f"Learned w: {model.w}")   # should be close to [2, -1, 0.5]
    print(f"Learned b: {model.b:.4f}")  # should be close to 3.0
    print(f"Final loss: {losses[-1]:.6f}")

Walkthrough

Initialization -- Weights are set to zeros. This is fine for linear regression because the loss surface is convex -- gradient descent is guaranteed to find the global minimum regardless of initialization.
Forward pass -- predict computes the linear combination X @ w + b. This is a single matrix-vector multiply plus a broadcast addition.
Loss computation -- MSE averages the squared residuals. Dividing by n makes the gradient magnitude independent of dataset size.
Gradient computation -- We derive the gradients analytically. X.T @ error performs the chain rule in a single vectorized operation: each feature's gradient is the dot product of that feature column with the error vector.
Parameter update -- The standard gradient descent step subtracts lr * gradient from each parameter. The learning rate controls step size.
Training loop -- We repeat for a fixed number of epochs and collect losses to monitor convergence.

Complexity Analysis

Time per epoch: O(n * d) for the matrix-vector product, where n = samples and d = features. The gradient computation is also O(n * d) due to X.T @ error.
Total training time: O(epochs * n * d).
Space: O(n * d) for storing the data matrix, O(d) for the weight vector.

The closed-form solution (normal equation) runs in O(n * d^2 + d^3) time but avoids iteration. Gradient descent is preferred when d or n is very large because it avoids the O(d^3) matrix inversion.

Interview Tips

Interview Tip

Interviewers look for: (1) correct gradient derivation -- know the matrix calculus, (2) proper vectorization -- no Python for-loops over samples, (3) awareness of the normal equation alternative and when gradient descent is preferred, (4) understanding of learning rate sensitivity and convergence behavior. Bonus: mention feature scaling (standardization) as a practical prerequisite for gradient descent to converge well.

Quiz

Quiz — 3 Questions

Why do we divide the gradient by n (the number of samples)?

Why is zero initialization acceptable for linear regression weights but not for neural network weights?

When would you prefer the normal equation (closed-form solution) over gradient descent for linear regression?

Linear Regression + Gradient Descent