NNDB: how many vector queries per second can one CPU box really do?
This is the short version. The full writeup with interactive charts is at nndb.fuxing.dev, the trail of all 66 experiments is at nndb.fuxing.dev/notes, and the code is at github.com/fuxingloh/nndb.
A while back (a few weeks from writing this article) I ran into a vector-search problem space, and my mind naturally gravitated to it due to my background in the spatial-search space. The shape of it felt familiar, index and partition the data, route the query to indexes, rerank the candidates after fusion. But there’s one difference that changes everything. In spatial, the partition is exact: a geohash prefix is the answer to “what’s near here”, so the candidate set is bounded and the query cost is known before you run it. In high-dimensional embedding space there is no prefix. Every cheap way to partition the space is lossy, the true neighbor can always be sitting in a cell you didn’t visit, so correctness stops being a guarantee and becomes a dial called recall, and the known-O(complexity) lookup I was used to becomes a scan you have to make fast.
How everyone else does it
Since nobody has an exact router, everyone approximates, and part of the study was reading how the market players cope. Once you line them up, the whole landscape is two decisions. Decision one: skip structure. How do you avoid touching most of the corpus? Either you cluster it into IVF cells and route each query to a few, or you build a proximity graph and greedy-walk toward the neighborhood. This choice is really about where the vectors live: a graph walk is a chain of dependent random hops, so it wants everything in RAM; IVF cells are contiguous blocks you stream, so they survive disk and object storage. Decision two: scan representation. Once you’re down to candidates, what do you actually compare? Full-precision floats (exact but fat), or compressed codes with a full-precision rerank to buy the recall back.
| Engine | Skip Structure | Priorities |
|---|---|---|
| pgvector | IVFFlat or HNSW | simplicity inside Postgres, scale is not the goal |
| hnswlib | HNSW graph in RAM | the reference graph implementation, latency above all |
| Qdrant | filterable HNSW graph in RAM | low latency with filters, quantized codes + oversampling rerank |
| Weaviate | HNSW, or no structure at all (flat BQ) | below a certain scale, a fast binary scan beats building a graph |
| Faiss | IVF | billion-scale batch throughput, 4-bit SIMD fast scan |
| ScaNN | learned partitions | recall-per-bit, codes trained to preserve ranking not vectors |
| DiskANN | Vamana graph on SSD | billion scale on one machine, graph laid out for disk |
| LanceDB | IVF over a columnar disk format | storage cost, quantized codes streamed from disk |
| turbopuffer | IVF over object storage | cheapest possible storage, RAM holds only the hot cells |
| Milvus | wraps Faiss and HNSW, pick per collection | operability, segments and updates over raw speed |
| Vespa | HNSW beside inverted indexes | vectors as one signal in a bigger ranking pipeline |
Read the choices column by column and a pattern falls out. Nobody past toy scale scans floats, everyone compresses and reranks, the argument is only over which codes. The graph engines are betting on RAM and latency; the IVF engines are betting on cheap storage and throughput; Weaviate’s flat+BQ mode is the honest admission that below a certain corpus size, the fanciest structure loses to a brutally fast scan. And whichever skip structure they pick, every one of them ends up needing the same inner layer, scan compressed codes fast, rerank a shortlist exactly. That layer is what NNDB is, the scan inside one IVF cell pushed as far as the hardware allows, with the router above it deliberately out of scope. The detail I only appreciated later: Weaviate’s flat+BQ and Qdrant’s oversampling+rerank are the exact binary funnel this study converged on independently, it’s already the production answer.
Why I dug in
Today every LLM application sits on top of this, retrieval is embeddings, and embeddings mean top-k vector search at serving latency. So I had two reasons to dig in. The first is past experiences and curiosity. The second is that this problem is hardware-adjacent to LLM inference itself: a vector scan streams a big block of memory and does a little math per byte, which is the same shape as attention streaming a KV cache, and both slam into the same wall, memory bandwidth, not compute. Studying the small version of the problem, where I can measure everything on one box, felt like the honest way to understand the limitations the big one is running into. I could have just read the papers and studied how everyone else does it, but applied research is how it actually sticks for me, build it, measure it, be wrong about it, then understand it.
So I built NNDB, an in-memory vector-search engine from scratch in Rust. Not to ship a product, but as a
curiosity run (using /goal and /loop to help me achieve it, run on Opus 4.8, if you’re curious). The thing I
really wanted to understand is the boundary between memory-bandwidth-bound and CPU-bound, because that boundary
is where all the interesting tradeoffs live. Every change became a numbered experiment with the measurements that
justified or killed it, and the repo turned into a lab notebook. Sixty-six experiments later the study is done,
and the full writeup with interactive charts is at nndb.fuxing.dev.
Here’s the TLDR up front: you can always get vector search under 10ms if you really want to. The question is what it costs you. Latency is purchasable with compute dollars, so the real study is the tradeoff, recall against QPS against bytes against silicon, and what the best implementation looks like once you understand which wall you’re actually hitting.
The wall is memory, not compute
Brute-force f32 search over 1M Cohere embeddings does about 9 QPS. My first instinct was to speed up the math, make the distance computation itself faster with SIMD and FMA, and it worked, briefly, then stopped. The bottleneck was never compute. Every query streams 4 GB from RAM, and you can’t out-compute the memory bus. Once I saw that, everything collapsed into two levers: send fewer bytes, or spend cheaper instructions on each byte. The other sixty-odd experiments are just those two levers.
Where it ended up
Where it ended up is a 1-bit binary funnel. Keep one sign bit per dimension, which is 32× smaller, scan
everything with popcount to get a shortlist, then re-rank only the shortlist against the real vectors. Comparing
two binary codes is XOR then count-the-differing-bits, and counting bits is a single instruction in the CPU’s
SIMD set, VPOPCNTDQ chews through 512 bits at a time. That’s why the scan is so damn fast: the entire
per-vector comparison is one instruction, no multiplies, no floats.
The same box that did 9 QPS does ~960 QPS at 0.995 recall, and the whole index is 128 MB. An AMD Zen5 box at the
same price does ~2,310. A full 192-core socket gets to ~19,900 and then the memory bus saturates. On 256-bit
Matryoshka codes the same funnel does 0.991 recall at ~5,000 QPS with a 3 ms p50.
What I’d keep
If I had to keep one lesson, it’s that on a CPU the representation that wins is the one whose per-vector operation is a single SIMD instruction. I tried a scheme with better recall-per-bit and it lost to plain popcount anyway, because its inner loop was a gather the CPU can’t vectorize. That observation decided almost every call after it. The other thing I learned the hard way is that shrinking the data re-prices every trick you’ve already tuned, because it moves you across the bandwidth-bound/compute-bound boundary. Query tiling bought me +73% while the scan was bandwidth-bound, then cost 24% once the codes fit in cache. A tuning verdict doesn’t survive a regime change, and it flipped on me twice.
The dead ends are all still in the log, scalar product quantization, HNSW inside a cell, register tiling, prefetch. Knowing what doesn’t work, and why, was half the point of doing this.
Why it stops here
The study stops here, and the cache flip above is actually why. Once my test corpus got small enough to sit entirely in L3, the lab stopped resembling the real problem. The real problem is billions of vectors that will never all be hot at once, and at that scale the game shifts from making one scan fast to orchestration: noticing which IVF cells are hot, keeping those resident in cache while the cold ones live in RAM or on disk, deciding what to evict and when. I can’t test any of that honestly on a benchmark box, because the answers depend on a real workload’s access pattern, how the corpus grows, and what recall the product actually needs. I have the single-box story characterized, throughput predictable from cores and corpus size, the bandwidth wall mapped, a rule for which instance to buy. The rest needs real usage, not more lab time.
If you want the full detail, the writeup is at nndb.fuxing.dev and every experiment is logged at nndb.fuxing.dev/notes.