Implement Linear Regression from Scratch
Build a complete linear regression model using only NumPy, training it with gradient descent. This is the foundational optimization loop behind all of deep learning.
Problem Statement
Implement a LinearRegression class that:
- Initializes weights and bias to zeros (or small random values)
- Implements a
predict(X)method computingX @ w + b - Implements a
fit(X, y, lr, epochs)method using batch gradient descent to minimize MSE loss - Returns the training loss history
Inputs: Feature matrix X of shape (n_samples, n_features), target vector y of shape (n_samples,).
Outputs: Learned weight vector w of shape (n_features,), scalar bias b, and a list of per-epoch MSE losses.
Linear regression minimizes Mean Squared Error: L = (1/n) * ||Xw + b - y||^2. The gradient with respect to w is (2/n) * X^T (Xw + b - y) and with respect to b is (2/n) * sum(Xw + b - y). Gradient descent iteratively updates parameters in the direction that reduces the loss.
┌──────────────────────────────────────────────────────────────────┐
│ Gradient Descent for Linear Regression │
│ │
│ Epoch 0 Epoch k Epoch final │
│ │
│ w = [0,0,0] w ≈ [1.5,-0.7,0.3] w ≈ [2.0,-1.0,0.5] │
│ b = 0.0 b ≈ 2.1 b ≈ 3.0 │
│ │
│ For each epoch: │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ y_hat = │ │ error = │ │ dw = (2/n) * │ │
│ │ X @ w + b │──▶ │ y_hat - y │──▶ │ X^T @ error │ │
│ │ (forward) │ │ (residual) │ │ db = (2/n) * │ │
│ └─────────────┘ └─────────────┘ │ sum(error) │ │
│ └───────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ w -= lr * dw │ │
│ │ b -= lr * db │ │
│ │ (param update) │ │
│ └─────────────────┘ │
│ │
│ Loss: ████████████ ──▶ █████ ──▶ █ (decreasing) │
└──────────────────────────────────────────────────────────────────┘
Hints
- Start by computing predictions:
y_hat = X @ w + b. - Compute the residual:
error = y_hat - y. - The MSE loss is
mean(error ** 2). - Gradient for
w:(2/n) * X.T @ error. Gradient forb:(2/n) * sum(error). - Update:
w -= lr * dw,b -= lr * db. - Repeat for the specified number of epochs and record the loss each time.
A common mistake is forgetting to divide the gradient by n (the number of samples). Without this normalization, the gradient magnitude scales with the dataset size, making the learning rate dataset-dependent. Also watch out for using X @ error instead of X.T @ error -- the transpose is critical for getting shapes right.
Solution
import numpy as np
from typing import Tuple, List
class LinearRegression:
"""Linear regression trained with batch gradient descent."""
def __init__(self, n_features: int) -> None:
self.w = np.zeros(n_features) # weight vector
self.b = 0.0 # bias scalar
def predict(self, X: np.ndarray) -> np.ndarray:
"""Compute predictions: y_hat = X @ w + b."""
return X @ self.w + self.b
def compute_loss(self, y_hat: np.ndarray, y: np.ndarray) -> float:
"""Mean Squared Error."""
return float(np.mean((y_hat - y) ** 2))
def compute_gradients(
self, X: np.ndarray, y: np.ndarray, y_hat: np.ndarray
) -> Tuple[np.ndarray, float]:
"""Compute gradients of MSE w.r.t. w and b."""
n = len(y)
error = y_hat - y # (n,)
dw = (2.0 / n) * (X.T @ error) # (n_features,)
db = (2.0 / n) * np.sum(error) # scalar
return dw, db
def fit(
self,
X: np.ndarray,
y: np.ndarray,
lr: float = 0.01,
epochs: int = 1000,
) -> List[float]:
"""Train with batch gradient descent. Returns loss history."""
losses: List[float] = []
for _ in range(epochs):
y_hat = self.predict(X)
loss = self.compute_loss(y_hat, y)
losses.append(loss)
dw, db = self.compute_gradients(X, y, y_hat)
self.w -= lr * dw
self.b -= lr * db
return losses
# ---------- demo ----------
if __name__ == "__main__":
np.random.seed(42)
X = np.random.randn(200, 3)
true_w = np.array([2.0, -1.0, 0.5])
y = X @ true_w + 3.0 + np.random.randn(200) * 0.1
model = LinearRegression(n_features=3)
losses = model.fit(X, y, lr=0.01, epochs=500)
print(f"Learned w: {model.w}") # should be close to [2, -1, 0.5]
print(f"Learned b: {model.b:.4f}") # should be close to 3.0
print(f"Final loss: {losses[-1]:.6f}")
Walkthrough
-
Initialization -- Weights are set to zeros. This is fine for linear regression because the loss surface is convex -- gradient descent is guaranteed to find the global minimum regardless of initialization.
-
Forward pass --
predictcomputes the linear combinationX @ w + b. This is a single matrix-vector multiply plus a broadcast addition. -
Loss computation -- MSE averages the squared residuals. Dividing by
nmakes the gradient magnitude independent of dataset size. -
Gradient computation -- We derive the gradients analytically.
X.T @ errorperforms the chain rule in a single vectorized operation: each feature's gradient is the dot product of that feature column with the error vector. -
Parameter update -- The standard gradient descent step subtracts
lr * gradientfrom each parameter. The learning rate controls step size. -
Training loop -- We repeat for a fixed number of epochs and collect losses to monitor convergence.
Complexity Analysis
- Time per epoch: O(n * d) for the matrix-vector product, where n = samples and d = features. The gradient computation is also O(n * d) due to
X.T @ error. - Total training time: O(epochs * n * d).
- Space: O(n * d) for storing the data matrix, O(d) for the weight vector.
The closed-form solution (normal equation) runs in O(n * d^2 + d^3) time but avoids iteration. Gradient descent is preferred when d or n is very large because it avoids the O(d^3) matrix inversion.
Interview Tips
Interviewers look for: (1) correct gradient derivation -- know the matrix calculus, (2) proper vectorization -- no Python for-loops over samples, (3) awareness of the normal equation alternative and when gradient descent is preferred, (4) understanding of learning rate sensitivity and convergence behavior. Bonus: mention feature scaling (standardization) as a practical prerequisite for gradient descent to converge well.
Quiz
Quiz — 3 Questions
Why do we divide the gradient by n (the number of samples)?
Why is zero initialization acceptable for linear regression weights but not for neural network weights?
When would you prefer the normal equation (closed-form solution) over gradient descent for linear regression?