Akshay’s Gradient
ML Codingbeginner25 min

Cosine Similarity Search

Algorithm: Efficient Batch Cosine Similarity

Implement efficient cosine similarity search -- the distance metric used in embedding spaces, semantic search, and recommendation systems. Optimize for batch queries against large databases.

Problem Statement

Implement:

  1. cosine_similarity(a, b) -- cosine similarity between two vectors
  2. batch_cosine_similarity(queries, database) -- all-pairs cosine similarity matrix
  3. top_k_similar(queries, database, k) -- find the k most similar items for each query
  4. Handle edge cases: zero vectors, numerical precision

Inputs: Query vectors (n_queries, d), database vectors (n_database, d).

Outputs: Similarity matrix (n_queries, n_database) or top-k indices and scores.

Key Concept

Cosine similarity measures the angle between two vectors: cos(a, b) = (a . b) / (||a|| * ||b||). Values range from -1 (opposite) to 1 (identical direction). It is invariant to vector magnitude, making it ideal for comparing embeddings where direction encodes meaning and magnitude may vary due to model quirks. The key to efficient computation: pre-normalize all vectors, then cosine similarity reduces to a simple dot product.

Interactive · Batch Cosine Similarity via Pre-Normalization
┌──────────────────────────────────────────────────────────────────┐
│       Efficient Batch Cosine Similarity                           │
│                                                                  │
│   Step 1: Pre-normalize (done once)                              │
│   ┌─────────────────────┐    ┌─────────────────────┐             │
│   │ Queries (n_q, d)    │    │ Database (n_db, d)   │            │
│   │ q_i = q_i / ||q_i|| │    │ d_j = d_j / ||d_j|| │            │
│   └──────────┬──────────┘    └──────────┬──────────┘             │
│              │                          │                         │
│              ▼                          ▼                         │
│   Step 2: Single matrix multiply                                 │
│   ┌──────────────────────────────────────────┐                   │
│   │  Similarity = Q_norm  @  D_norm^T        │                   │
│   │  (n_q, d)       (d, n_db)                │                   │
│   │                                          │                   │
│   │  Result: (n_q, n_db) similarity matrix   │                   │
│   │  Each entry = cos(q_i, d_j) = dot product│                   │
│   └──────────────────┬───────────────────────┘                   │
│                      │                                            │
│   Step 3: Top-k      ▼                                           │
│   ┌──────────────────────────────────────────┐                   │
│   │  For each query row:                     │                   │
│   │  argpartition → O(n_db) top-k selection  │                   │
│   │  (faster than O(n_db log n_db) full sort)│                   │
│   └──────────────────────────────────────────┘                   │
│                                                                  │
│   Key insight: normalize once, then dot product = cosine sim     │
│   Speedup: single BLAS matmul vs. n_q * n_db individual cosines │
└──────────────────────────────────────────────────────────────────┘
Warning

For very large databases (millions of vectors), computing the full similarity matrix will cause out-of-memory errors. A 1M-vector database with d=768 produces a similarity matrix of 1M * 768 * 4 bytes per query. Tile the computation by processing chunks of the database at a time, keeping only the running top-k results per query.

Hints

Info
  1. Pre-normalize: divide each vector by its L2 norm. Then cosine similarity = dot product.
  2. Batch computation: similarities = normalized_queries @ normalized_database.T.
  3. This is a single matrix multiply, fully utilizing GPU parallelism.
  4. For top-k: use torch.topk on the similarity matrix (faster than sorting).
  5. Handle zero vectors by adding a small epsilon to norms: norm = max(||x||, 1e-8).

Solution

import numpy as np
import torch
from typing import Tuple


def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity between two vectors using NumPy."""
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a < 1e-8 or norm_b < 1e-8:
        return 0.0
    return float(dot / (norm_a * norm_b))


def batch_cosine_similarity_np(
    queries: np.ndarray, database: np.ndarray
) -> np.ndarray:
    """
    Compute all-pairs cosine similarity.

    Args:
        queries:  (n_queries, d)
        database: (n_database, d)

    Returns:
        similarities: (n_queries, n_database) cosine similarity matrix
    """
    # Normalize each vector to unit length
    q_norms = np.linalg.norm(queries, axis=1, keepdims=True)
    d_norms = np.linalg.norm(database, axis=1, keepdims=True)

    # Avoid division by zero
    q_norms = np.maximum(q_norms, 1e-8)
    d_norms = np.maximum(d_norms, 1e-8)

    q_normalized = queries / q_norms
    d_normalized = database / d_norms

    # Cosine similarity = dot product of normalized vectors
    return q_normalized @ d_normalized.T


def top_k_similar_np(
    queries: np.ndarray, database: np.ndarray, k: int
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Find top-k most similar items for each query.

    Returns:
        indices: (n_queries, k) indices of most similar items
        scores:  (n_queries, k) similarity scores
    """
    similarities = batch_cosine_similarity_np(queries, database)

    # argpartition is O(n) per row vs O(n log n) for argsort
    n_db = similarities.shape[1]
    if k >= n_db:
        sorted_idx = np.argsort(-similarities, axis=1)
        return sorted_idx, np.take_along_axis(similarities, sorted_idx, axis=1)

    # Partial sort: get top-k indices (unordered)
    top_k_unsorted = np.argpartition(-similarities, k, axis=1)[:, :k]

    # Get the scores for top-k indices
    top_k_scores = np.take_along_axis(similarities, top_k_unsorted, axis=1)

    # Sort the top-k by score (descending)
    sorted_within = np.argsort(-top_k_scores, axis=1)
    top_k_sorted = np.take_along_axis(top_k_unsorted, sorted_within, axis=1)
    scores_sorted = np.take_along_axis(top_k_scores, sorted_within, axis=1)

    return top_k_sorted, scores_sorted


# PyTorch version (GPU-accelerated)
def batch_cosine_similarity_torch(
    queries: torch.Tensor, database: torch.Tensor
) -> torch.Tensor:
    """GPU-accelerated batch cosine similarity."""
    queries_norm = torch.nn.functional.normalize(queries, p=2, dim=1)
    database_norm = torch.nn.functional.normalize(database, p=2, dim=1)
    return queries_norm @ database_norm.T


def top_k_similar_torch(
    queries: torch.Tensor, database: torch.Tensor, k: int
) -> Tuple[torch.Tensor, torch.Tensor]:
    """GPU-accelerated top-k similar search."""
    similarities = batch_cosine_similarity_torch(queries, database)
    scores, indices = torch.topk(similarities, k, dim=1)
    return indices, scores


# ---------- demo ----------
if __name__ == "__main__":
    np.random.seed(42)
    n_queries, n_database, d = 5, 10000, 128

    # Generate random embeddings
    queries = np.random.randn(n_queries, d).astype(np.float32)
    database = np.random.randn(n_database, d).astype(np.float32)

    # Test basic cosine similarity
    sim = cosine_similarity_np(queries[0], database[0])
    print(f"Single pair similarity: {sim:.4f}")

    # Test batch similarity
    sim_matrix = batch_cosine_similarity_np(queries, database)
    print(f"Similarity matrix shape: {sim_matrix.shape}")
    print(f"Value range: [{sim_matrix.min():.4f}, {sim_matrix.max():.4f}]")

    # Test top-k
    k = 5
    indices, scores = top_k_similar_np(queries, database, k)
    print(f"\nTop-{k} results:")
    for q in range(n_queries):
        print(f"  Query {q}: indices={indices[q]}, scores={scores[q].round(4)}")

    # Verify: top-k should match brute-force sort
    for q in range(n_queries):
        sorted_all = np.argsort(-sim_matrix[q])[:k]
        assert set(indices[q]) == set(sorted_all), f"Mismatch for query {q}"
    print("Top-k verification passed.")

    # Test edge case: zero vector
    zero_vec = np.zeros(d)
    sim = cosine_similarity_np(zero_vec, database[0])
    print(f"\nZero vector similarity: {sim}")  # Should be 0.0

    # Test PyTorch version
    q_torch = torch.from_numpy(queries)
    db_torch = torch.from_numpy(database)
    sim_torch = batch_cosine_similarity_torch(q_torch, db_torch).numpy()
    max_diff = np.abs(sim_matrix - sim_torch).max()
    print(f"NumPy vs PyTorch max difference: {max_diff:.2e}")

Walkthrough

  1. Pre-normalization -- The key optimization. By dividing each vector by its L2 norm once, cosine similarity becomes a simple dot product. This avoids computing norms for every pair.

  2. Matrix multiplication -- normalized_queries @ normalized_database.T computes all n_q * n_db similarities in a single BLAS call. On a GPU, this is orders of magnitude faster than a Python loop.

  3. Top-k selection -- np.argpartition does a partial sort in O(n) expected time per row, finding the top-k elements without sorting the entire array. Then we sort only the k elements for final ordering.

  4. Zero vector handling -- A zero vector has no direction, so cosine similarity is undefined. We clamp norms to a small epsilon and return 0.0 similarity, which is the standard convention.

  5. PyTorch version -- F.normalize handles the normalization, and torch.topk efficiently selects top-k on GPU. This is the production pattern used in vector search systems.

Complexity Analysis

  • Batch similarity: O(n_q * n_db * d) for the matrix multiplication. This is the bottleneck.
  • Pre-normalization: O((n_q + n_db) * d) -- done once.
  • Top-k selection: O(n_q * n_db) for argpartition, plus O(n_q * k * log k) for sorting the top-k.
  • Memory: O(n_q * n_db) for the full similarity matrix. For very large databases, this must be tiled.

For a database of 1M vectors with d=768, the similarity matrix is 1M * 768 * 4 bytes = ~3GB. Tiling (computing in chunks) is necessary to avoid OOM.

Interview Tips

Interview Tip

Key topics: (1) Cosine vs. L2 distance vs. dot product -- for normalized vectors, ranking by cosine = ranking by dot product = ranking by L2 distance. They are equivalent after normalization. (2) Why normalization matters: unnormalized embeddings can have arbitrary magnitudes, making dot product unreliable. (3) Scaling: for billions of vectors, you cannot compute all pairwise similarities. Use ANN indices (FAISS, HNSW) after cosine normalization. (4) MIPS (Maximum Inner Product Search) reduces to nearest neighbor search after normalization. (5) Quantization: reduce float32 vectors to int8 for 4x memory savings with minimal recall loss.

Quiz

Quiz — 3 Questions

After normalizing all vectors to unit length, cosine similarity is equivalent to which operation?

Why is cosine similarity preferred over Euclidean distance for comparing text embeddings?

Why does np.argpartition provide a speedup over np.argsort for top-k selection?

Mark as Complete

Finished reviewing this topic? Mark it complete to track your progress.