It uses clustering, matryoshka embeddings, binary quantization, and SIMD operations. Written in rust of course 🦀
At Exa, we’re redesigning web search from scratch to handle complex queries that Google can’t. To do that, we needed to build our own web-scale vector database.
In this blog post, we’ll describe in detail how that vector database works—from high-level algorithms, to low-level optimizations that we haven’t seen in existing literature.
The end result is a database that can search billions of vectors in under 100ms with high recall, all while enabling keyword + metadata filters and using less memory than a gaming PC.
Exa needs to handle complex semantic queries like “AI startups in the Bay Area building hardware” and return a list of companies that exactly match that criteria.
To capture the semantic meaning of a document, we convert the document into a vector (aka an embedding). Documents with similar meanings become embeddings with similar numerical values.
To search through the documents, we convert the user’s query into an embedding, compare it against the embeddings of all the web documents we’ve indexed, and return the closest matches.
So, we need to store all our document embeddings in a way that’s memory-efficient and enables fast search, without losing precision.
We combined five optimizations to significantly improve memory usage and speed.
Turns out, you can take any of these 4096-dimensional embeddings, chop off everything except the first 256 elements, and retain much of its meaning.
How? We trained our embeddings model—i.e., the algorithm that transforms raw web pages into their vector representations—using Matryoshka.[1]
Matryoshka is a technique that forces prefixes of embeddings to be approximations of the whole embedding. Meaning: the first 2048 dimensions are a good embedding, the first 1024 dimensions are also a good embedding, etc— down to, say, the first 32 dimensions.
We cut our vectors to 256 dimensions, which reduced memory usage by 20x.
Now each embedding comprises 256 floating-point numbers, but that’s still too slow to fully search through.
How can we speed it up?
Well, each of these floats occupies 16 bits of space. We used binary quantization to replace the 16-bit floats with 1-bit values. For each float in the vector: if it’s greater than 0, set it to 1; else, set it to -1.
This cut our memory usage another 16 times!
The standard search method for binary-quantized embeddings is to compute the Hamming distance (i.e. number of elements that are different) from the binary-quantized query embedding to each binary document vector.
But this was lossy and inefficient. So we developed a novel hybrid approach: Keep using binary document embeddings—but for the query embedding, use uncompressed floating-points.
Then use dot product as a similarity metric between the query vector and each of the binary-quantized document vectors. (Imagine two vectors in geometric space—the dot product measures the angle between them. A large dot product means the vectors are pointing in the same direction and thus very similar.)
We hyper-optimized our dot product calculations with a strange but effective method:
This cuts down the number of computations by 1/4.[2]
We further optimized lookup times by loading the entire lookup table into CPU registers. This enables much faster lookups than keeping it in RAM.[3]
Thus far, we’ve still been looking through every document embedding on every search. Clustering fixes that.
We divided our documents into 100,000 clusters, each grouped by similarity. When we search, we figure out which cluster our query embedding belongs in, and only search through that cluster and nearby ones.
With clustering, we can get a roughly 1000x improvement in throughput without much accuracy degradation 🤯.
You might wonder—haven’t we just cut out a lot of information by truncating and squashing the vectors so aggressively?
Yes, these optimizations are lossy. But nothing some reranking can’t fix. After the initial search, we fetch extra results and rerank them with the uncompressed data.
We deliberately designed the vector database to allow for complex filters, like filtering by date range, domain, or keyword. Basically we want our vector database to act like a database!
Unlike graph-based methods like HNSW, our clustering compressed methods fit nicely with filters. For each filterable value (category, date, etc), we create inverted indexes. These are hashmaps from the filterable value to a list of document IDs that match the filter. If the user specifies a filter, then we only run our search algorithm over the documents that match the filter in the inverted index.
This is only the beginning. Turns out a standard vector database is not enough for perfect web search. More coming there :)
P.S. If this post left you hungry for more, come help us fix web search.
SEE MORE
It uses clustering, matryoshka embeddings, binary quantization, and SIMD operations. Written in rust of course 🦀
The Exa Team
December 17, 2024
Exa is pitching a new spin on generative search. It uses the tech behind large language models to return lists of results that it claims are more on point than those from its rivals, including Google and OpenAI.
The Exa Team
December 3, 2024
While there’s no shortage of startups aiming to replace Google with AI-powered search, a startup called Exa has a different idea. Search for the AIs.
The Exa Team
July 16, 2024