Akshay’s Gradient
ML Codingintermediate30 min

Feature Hashing

Algorithm: Feature Hashing (The Hashing Trick)

Implement feature hashing -- a memory-efficient technique for converting arbitrary features into a fixed-size vector without maintaining a vocabulary. Used in large-scale ML systems at companies like Google, Facebook, and Microsoft.

Problem Statement

Implement:

  1. feature_hash(features, n_buckets) -- hash a list of feature strings into a fixed-size vector
  2. signed_feature_hash(features, n_buckets) -- hash with a sign function to reduce collisions
  3. FeatureHasher class -- a scikit-learn-compatible transformer with fit_transform

Inputs: A list of feature strings (e.g., ["word=hello", "bigram=hello_world", "user_id=12345"]), number of hash buckets.

Outputs: A sparse vector of shape (n_buckets,).

Key Concept

Feature hashing maps arbitrary features to a fixed-size vector using a hash function, without maintaining a feature-to-index dictionary. The key trick: use a second hash to assign each feature a sign (+1 or -1). This ensures that collisions (multiple features mapping to the same bucket) tend to cancel out rather than accumulate, preserving inner products in expectation.

Interactive · Feature Hashing with the Sign Trick
┌──────────────────────────────────────────────────────────────────┐
│         Feature Hashing with Sign Trick                           │
│                                                                  │
│   Features: ["color=red", "size=L", "brand=nike", "price=99"]   │
│   n_buckets = 8                                                  │
│                                                                  │
│   Feature        hash1 % 8   hash2 → sign   Bucket update       │
│   ─────────────  ─────────   ────────────    ──────────────      │
│   "color=red"    → 3          → +1          bucket[3] += 1      │
│   "size=L"       → 6          → -1          bucket[6] -= 1      │
│   "brand=nike"   → 3          → -1          bucket[3] -= 1  ←collision│
│   "price=99"     → 1          → +1          bucket[1] += 1      │
│                                                                  │
│   Without sign trick:                                            │
│   bucket[3] = 2  (collision accumulates — biased!)               │
│                                                                  │
│   With sign trick:                                               │
│   bucket[3] = +1 + (-1) = 0  (collision cancels — unbiased!)    │
│                                                                  │
│   ┌──────────────────────────────────────┐                       │
│   │ Result: [0, 1, 0, 0, 0, 0, -1, 0]   │                      │
│   └──────────────────────────────────────┘                       │
│                                                                  │
│   Property: E[h(x) · h(y)] = x · y                              │
│   Variance decreases as n_buckets increases                      │
└──────────────────────────────────────────────────────────────────┘
Warning

Do not use Python's built-in hash() for feature hashing in production -- its output varies between Python processes (due to hash randomization since Python 3.3). Use a deterministic hash function like MurmurHash, xxHash, or hashlib's MD5/SHA. Non-deterministic hashing means the same feature maps to different buckets across runs, making model serialization and reproducibility impossible.

Hints

Info
  1. Use Python's built-in hash() or hashlib to map feature strings to bucket indices: index = hash(feature) % n_buckets.
  2. For the sign trick: use a second hash (or a different bit of the same hash) to determine +1 or -1.
  3. For weighted features, multiply the value by the sign before adding to the bucket.
  4. Collisions are expected! The sign trick ensures they cancel in expectation (unbiased).
  5. Use scipy sparse matrices for memory efficiency when n_buckets is large.

Solution

import numpy as np
import hashlib
from typing import List, Dict, Optional, Union, Tuple
from collections import Counter


def _hash_to_index(feature: str, n_buckets: int) -> int:
    """Hash a feature string to a bucket index."""
    # Use a stable hash (Python's hash() varies between runs)
    h = int(hashlib.md5(feature.encode()).hexdigest(), 16)
    return h % n_buckets


def _hash_to_sign(feature: str) -> int:
    """Hash a feature string to a sign (+1 or -1)."""
    h = int(hashlib.sha1(feature.encode()).hexdigest(), 16)
    return 1 if h % 2 == 0 else -1


def feature_hash(
    features: List[str], n_buckets: int
) -> np.ndarray:
    """
    Basic feature hashing without sign trick.

    Args:
        features: List of feature strings.
        n_buckets: Size of the output vector.

    Returns:
        vector: (n_buckets,) count-based feature vector.
    """
    vector = np.zeros(n_buckets)
    for feature in features:
        idx = _hash_to_index(feature, n_buckets)
        vector[idx] += 1.0
    return vector


def signed_feature_hash(
    features: List[str],
    n_buckets: int,
    feature_values: Optional[List[float]] = None,
) -> np.ndarray:
    """
    Feature hashing with sign trick (reduces collision bias).

    Args:
        features: List of feature strings.
        n_buckets: Size of the output vector.
        feature_values: Optional values for each feature (default 1.0).

    Returns:
        vector: (n_buckets,) hashed feature vector.
    """
    vector = np.zeros(n_buckets)
    values = feature_values or [1.0] * len(features)

    for feature, value in zip(features, values):
        idx = _hash_to_index(feature, n_buckets)
        sign = _hash_to_sign(feature)
        vector[idx] += sign * value

    return vector


class FeatureHasher:
    """Feature hashing transformer (scikit-learn style)."""

    def __init__(
        self,
        n_buckets: int = 2**18,
        use_sign: bool = True,
        alternate_sign: bool = True,
    ) -> None:
        self.n_buckets = n_buckets
        self.use_sign = use_sign

    def transform_single(
        self, features: Dict[str, float]
    ) -> np.ndarray:
        """Hash a single sample's features."""
        keys = list(features.keys())
        values = list(features.values())
        if self.use_sign:
            return signed_feature_hash(keys, self.n_buckets, values)
        return feature_hash(keys, self.n_buckets)

    def transform(
        self, samples: List[Dict[str, float]]
    ) -> np.ndarray:
        """Hash multiple samples into a feature matrix."""
        matrix = np.zeros((len(samples), self.n_buckets))
        for i, sample in enumerate(samples):
            matrix[i] = self.transform_single(sample)
        return matrix


def demonstrate_collision_behavior() -> None:
    """Show why the sign trick helps with collisions."""
    n_buckets = 10  # intentionally small to force collisions
    n_features = 100

    features = [f"feature_{i}" for i in range(n_features)]

    # Without sign trick
    vec_unsigned = feature_hash(features, n_buckets)

    # With sign trick
    vec_signed = signed_feature_hash(features, n_buckets)

    print("Without sign trick (all positive, collisions accumulate):")
    print(f"  Vector: {vec_unsigned}")
    print(f"  Sum: {vec_unsigned.sum():.0f} (should be {n_features})")
    print(f"  Max bucket: {vec_unsigned.max():.0f}")

    print("\nWith sign trick (collisions tend to cancel):")
    print(f"  Vector: {vec_signed}")
    print(f"  Sum: {vec_signed.sum():.1f} (closer to 0 due to cancellation)")
    print(f"  L2 norm: {np.linalg.norm(vec_signed):.1f}")


# ---------- demo ----------
if __name__ == "__main__":
    # Basic usage
    print("=== Basic Feature Hashing ===")
    features = ["color=red", "size=large", "brand=nike", "category=shoes"]
    n_buckets = 16

    vec = signed_feature_hash(features, n_buckets)
    print(f"Features: {features}")
    print(f"Hashed vector ({n_buckets} buckets): {vec}")
    print(f"Non-zero entries: {np.count_nonzero(vec)}")

    # FeatureHasher with weighted features
    print("\n=== FeatureHasher ===")
    samples = [
        {"color=red": 1, "size=large": 1, "price": 99.99},
        {"color=blue": 1, "size=small": 1, "price": 49.99},
        {"color=red": 1, "size=medium": 1, "price": 79.99},
    ]

    hasher = FeatureHasher(n_buckets=32, use_sign=True)
    X = hasher.transform(samples)
    print(f"Feature matrix shape: {X.shape}")
    print(f"Non-zero ratio: {(X != 0).mean():.2%}")

    # Similarity should be higher for similar samples
    from numpy.linalg import norm
    sim_01 = np.dot(X[0], X[1]) / (norm(X[0]) * norm(X[1]) + 1e-8)
    sim_02 = np.dot(X[0], X[2]) / (norm(X[0]) * norm(X[2]) + 1e-8)
    print(f"Similarity(sample 0, sample 1): {sim_01:.4f}")
    print(f"Similarity(sample 0, sample 2): {sim_02:.4f}")

    # Collision demonstration
    print("\n=== Collision Behavior ===")
    demonstrate_collision_behavior()

    # Show that inner products are approximately preserved
    print("\n=== Inner Product Preservation ===")
    n_buckets_list = [32, 128, 512, 2048]
    features_a = [f"feat_{i}" for i in range(50)]
    features_b = [f"feat_{i}" for i in range(25, 75)]  # 50% overlap

    for nb in n_buckets_list:
        va = signed_feature_hash(features_a, nb)
        vb = signed_feature_hash(features_b, nb)
        approx_overlap = np.dot(va, vb)
        print(f"  n_buckets={nb:4d}: dot_product={approx_overlap:.1f} (true overlap=25)")

Walkthrough

  1. Hash function -- We use MD5 for bucket assignment and SHA1 for sign assignment. Using two different hash functions ensures the sign is independent of the bucket, which is necessary for the unbiasedness proof. In production, MurmurHash is preferred for speed.

  2. Basic hashing -- Each feature is mapped to a bucket index. If two features collide (same bucket), their counts add up. This is biased: collisions always increase the bucket value.

  3. Sign trick -- Each feature gets a random sign (+1 or -1) based on a second hash. When two features collide with opposite signs, they cancel. In expectation, the dot product between two hashed vectors equals the dot product between the original sparse vectors (unbiased estimator).

  4. FeatureHasher class -- Supports weighted features (like TF-IDF values or prices) by multiplying the value by the sign before adding to the bucket.

  5. Inner product preservation -- The sign trick guarantees E[<h(x), h(y)>] = <x, y>. The variance decreases as n_buckets increases. With n_buckets >= 2^18, the approximation error is typically negligible.

Complexity Analysis

  • Time: O(n_features) per sample -- one hash computation per feature. No sorting or dictionary lookup needed.
  • Space: O(n_buckets) per sample. The key advantage: n_buckets is fixed regardless of the feature vocabulary size. A vocabulary of 10 billion features can be hashed into 2^18 = 262K buckets.
  • Comparison with vocabulary: A traditional approach requires O(|vocabulary|) memory and an explicit mapping. Feature hashing requires O(n_buckets) with no preprocessing.

Interview Tips

Interview Tip

Key points to explain: (1) Why feature hashing over vocabularies -- handles new features without retraining, constant memory, no OOV problem. (2) The sign trick's unbiasedness proof: E[sign(a) * sign(b)] = 1 if a=b, 0 otherwise. (3) When NOT to use it: when features are dense (not sparse), or when you need to interpret which features are important. (4) Real-world uses: Vowpal Wabbit (online learning), click-through rate prediction (billions of user/item features), spam filters. (5) The collision rate: with n features and b buckets, expected collisions follow the birthday problem: ~n^2/(2b). Use b >> n^2 for low collision rates.

Quiz

Quiz — 3 Questions

What is the purpose of the sign trick in feature hashing?

What is the main disadvantage of feature hashing compared to a traditional vocabulary-based approach?

Following the birthday problem, approximately how many hash buckets are needed to keep the collision probability below 50% for 1000 features?

Mark as Complete

Finished reviewing this topic? Mark it complete to track your progress.