Algorithm: Efficient Batch Cosine Similarity
Implement efficient cosine similarity search -- the distance metric used in embedding spaces, semantic search, and recommendation systems. Optimize for batch queries against large databases.
Problem Statement
Implement:
cosine_similarity(a, b)-- cosine similarity between two vectorsbatch_cosine_similarity(queries, database)-- all-pairs cosine similarity matrixtop_k_similar(queries, database, k)-- find the k most similar items for each query- Handle edge cases: zero vectors, numerical precision
Inputs: Query vectors (n_queries, d), database vectors (n_database, d).
Outputs: Similarity matrix (n_queries, n_database) or top-k indices and scores.
Cosine similarity measures the angle between two vectors: cos(a, b) = (a . b) / (||a|| * ||b||). Values range from -1 (opposite) to 1 (identical direction). It is invariant to vector magnitude, making it ideal for comparing embeddings where direction encodes meaning and magnitude may vary due to model quirks. The key to efficient computation: pre-normalize all vectors, then cosine similarity reduces to a simple dot product.
┌──────────────────────────────────────────────────────────────────┐
│ Efficient Batch Cosine Similarity │
│ │
│ Step 1: Pre-normalize (done once) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Queries (n_q, d) │ │ Database (n_db, d) │ │
│ │ q_i = q_i / ||q_i|| │ │ d_j = d_j / ||d_j|| │ │
│ └──────────┬──────────┘ └──────────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ Step 2: Single matrix multiply │
│ ┌──────────────────────────────────────────┐ │
│ │ Similarity = Q_norm @ D_norm^T │ │
│ │ (n_q, d) (d, n_db) │ │
│ │ │ │
│ │ Result: (n_q, n_db) similarity matrix │ │
│ │ Each entry = cos(q_i, d_j) = dot product│ │
│ └──────────────────┬───────────────────────┘ │
│ │ │
│ Step 3: Top-k ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ For each query row: │ │
│ │ argpartition → O(n_db) top-k selection │ │
│ │ (faster than O(n_db log n_db) full sort)│ │
│ └──────────────────────────────────────────┘ │
│ │
│ Key insight: normalize once, then dot product = cosine sim │
│ Speedup: single BLAS matmul vs. n_q * n_db individual cosines │
└──────────────────────────────────────────────────────────────────┘
For very large databases (millions of vectors), computing the full similarity matrix will cause out-of-memory errors. A 1M-vector database with d=768 produces a similarity matrix of 1M * 768 * 4 bytes per query. Tile the computation by processing chunks of the database at a time, keeping only the running top-k results per query.
Hints
- Pre-normalize: divide each vector by its L2 norm. Then cosine similarity = dot product.
- Batch computation:
similarities = normalized_queries @ normalized_database.T. - This is a single matrix multiply, fully utilizing GPU parallelism.
- For top-k: use
torch.topkon the similarity matrix (faster than sorting). - Handle zero vectors by adding a small epsilon to norms:
norm = max(||x||, 1e-8).
Solution
import numpy as np
import torch
from typing import Tuple
def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two vectors using NumPy."""
dot = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a < 1e-8 or norm_b < 1e-8:
return 0.0
return float(dot / (norm_a * norm_b))
def batch_cosine_similarity_np(
queries: np.ndarray, database: np.ndarray
) -> np.ndarray:
"""
Compute all-pairs cosine similarity.
Args:
queries: (n_queries, d)
database: (n_database, d)
Returns:
similarities: (n_queries, n_database) cosine similarity matrix
"""
# Normalize each vector to unit length
q_norms = np.linalg.norm(queries, axis=1, keepdims=True)
d_norms = np.linalg.norm(database, axis=1, keepdims=True)
# Avoid division by zero
q_norms = np.maximum(q_norms, 1e-8)
d_norms = np.maximum(d_norms, 1e-8)
q_normalized = queries / q_norms
d_normalized = database / d_norms
# Cosine similarity = dot product of normalized vectors
return q_normalized @ d_normalized.T
def top_k_similar_np(
queries: np.ndarray, database: np.ndarray, k: int
) -> Tuple[np.ndarray, np.ndarray]:
"""
Find top-k most similar items for each query.
Returns:
indices: (n_queries, k) indices of most similar items
scores: (n_queries, k) similarity scores
"""
similarities = batch_cosine_similarity_np(queries, database)
# argpartition is O(n) per row vs O(n log n) for argsort
n_db = similarities.shape[1]
if k >= n_db:
sorted_idx = np.argsort(-similarities, axis=1)
return sorted_idx, np.take_along_axis(similarities, sorted_idx, axis=1)
# Partial sort: get top-k indices (unordered)
top_k_unsorted = np.argpartition(-similarities, k, axis=1)[:, :k]
# Get the scores for top-k indices
top_k_scores = np.take_along_axis(similarities, top_k_unsorted, axis=1)
# Sort the top-k by score (descending)
sorted_within = np.argsort(-top_k_scores, axis=1)
top_k_sorted = np.take_along_axis(top_k_unsorted, sorted_within, axis=1)
scores_sorted = np.take_along_axis(top_k_scores, sorted_within, axis=1)
return top_k_sorted, scores_sorted
# PyTorch version (GPU-accelerated)
def batch_cosine_similarity_torch(
queries: torch.Tensor, database: torch.Tensor
) -> torch.Tensor:
"""GPU-accelerated batch cosine similarity."""
queries_norm = torch.nn.functional.normalize(queries, p=2, dim=1)
database_norm = torch.nn.functional.normalize(database, p=2, dim=1)
return queries_norm @ database_norm.T
def top_k_similar_torch(
queries: torch.Tensor, database: torch.Tensor, k: int
) -> Tuple[torch.Tensor, torch.Tensor]:
"""GPU-accelerated top-k similar search."""
similarities = batch_cosine_similarity_torch(queries, database)
scores, indices = torch.topk(similarities, k, dim=1)
return indices, scores
# ---------- demo ----------
if __name__ == "__main__":
np.random.seed(42)
n_queries, n_database, d = 5, 10000, 128
# Generate random embeddings
queries = np.random.randn(n_queries, d).astype(np.float32)
database = np.random.randn(n_database, d).astype(np.float32)
# Test basic cosine similarity
sim = cosine_similarity_np(queries[0], database[0])
print(f"Single pair similarity: {sim:.4f}")
# Test batch similarity
sim_matrix = batch_cosine_similarity_np(queries, database)
print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Value range: [{sim_matrix.min():.4f}, {sim_matrix.max():.4f}]")
# Test top-k
k = 5
indices, scores = top_k_similar_np(queries, database, k)
print(f"\nTop-{k} results:")
for q in range(n_queries):
print(f" Query {q}: indices={indices[q]}, scores={scores[q].round(4)}")
# Verify: top-k should match brute-force sort
for q in range(n_queries):
sorted_all = np.argsort(-sim_matrix[q])[:k]
assert set(indices[q]) == set(sorted_all), f"Mismatch for query {q}"
print("Top-k verification passed.")
# Test edge case: zero vector
zero_vec = np.zeros(d)
sim = cosine_similarity_np(zero_vec, database[0])
print(f"\nZero vector similarity: {sim}") # Should be 0.0
# Test PyTorch version
q_torch = torch.from_numpy(queries)
db_torch = torch.from_numpy(database)
sim_torch = batch_cosine_similarity_torch(q_torch, db_torch).numpy()
max_diff = np.abs(sim_matrix - sim_torch).max()
print(f"NumPy vs PyTorch max difference: {max_diff:.2e}")
Walkthrough
-
Pre-normalization -- The key optimization. By dividing each vector by its L2 norm once, cosine similarity becomes a simple dot product. This avoids computing norms for every pair.
-
Matrix multiplication --
normalized_queries @ normalized_database.Tcomputes alln_q * n_dbsimilarities in a single BLAS call. On a GPU, this is orders of magnitude faster than a Python loop. -
Top-k selection --
np.argpartitiondoes a partial sort in O(n) expected time per row, finding the top-k elements without sorting the entire array. Then we sort only the k elements for final ordering. -
Zero vector handling -- A zero vector has no direction, so cosine similarity is undefined. We clamp norms to a small epsilon and return 0.0 similarity, which is the standard convention.
-
PyTorch version --
F.normalizehandles the normalization, andtorch.topkefficiently selects top-k on GPU. This is the production pattern used in vector search systems.
Complexity Analysis
- Batch similarity: O(n_q * n_db * d) for the matrix multiplication. This is the bottleneck.
- Pre-normalization: O((n_q + n_db) * d) -- done once.
- Top-k selection: O(n_q * n_db) for argpartition, plus O(n_q * k * log k) for sorting the top-k.
- Memory: O(n_q * n_db) for the full similarity matrix. For very large databases, this must be tiled.
For a database of 1M vectors with d=768, the similarity matrix is 1M * 768 * 4 bytes = ~3GB. Tiling (computing in chunks) is necessary to avoid OOM.
Interview Tips
Key topics: (1) Cosine vs. L2 distance vs. dot product -- for normalized vectors, ranking by cosine = ranking by dot product = ranking by L2 distance. They are equivalent after normalization. (2) Why normalization matters: unnormalized embeddings can have arbitrary magnitudes, making dot product unreliable. (3) Scaling: for billions of vectors, you cannot compute all pairwise similarities. Use ANN indices (FAISS, HNSW) after cosine normalization. (4) MIPS (Maximum Inner Product Search) reduces to nearest neighbor search after normalization. (5) Quantization: reduce float32 vectors to int8 for 4x memory savings with minimal recall loss.
Quiz
Quiz — 3 Questions
After normalizing all vectors to unit length, cosine similarity is equivalent to which operation?
Why is cosine similarity preferred over Euclidean distance for comparing text embeddings?
Why does np.argpartition provide a speedup over np.argsort for top-k selection?