NNDB: how many vector queries per second can one CPU box really do?

Jul 1, 2026(2 days ago)-

Fuxing Loh

This is the short version. The full writeup with interactive charts is at nndb.fuxing.dev, the trail of all 66 experiments is at nndb.fuxing.dev/notes, and the code is at github.com/fuxingloh/nndb.

A while back (a few weeks from writing this article) I ran into a vector-search problem space, and my mind naturally gravitated to it due to my background in the spatial-search space. The shape of it felt familiar, index and partition the data, route the query to indexes, rerank the candidates after fusion. But there’s one difference that changes everything. In spatial, the partition is exact: a geohash prefix is the answer to “what’s near here”, so the candidate set is bounded and the query cost is known before you run it. In high-dimensional embedding space there is no prefix. Every cheap way to partition the space is lossy, the true neighbor can always be sitting in a cell you didn’t visit, so correctness stops being a guarantee and becomes a dial called recall, and the known-O(complexity) lookup I was used to becomes a scan you have to make fast.

How everyone else does it

Since nobody has an exact router, everyone approximates, and part of the study was reading how the market players cope. Once you line them up, the whole landscape is two decisions. Decision one: skip structure. How do you avoid touching most of the corpus? Either you cluster it into IVF cells and route each query to a few, or you build a proximity graph and greedy-walk toward the neighborhood. This choice is really about where the vectors live: a graph walk is a chain of dependent random hops, so it wants everything in RAM; IVF cells are contiguous blocks you stream, so they survive disk and object storage. Decision two: scan representation. Once you’re down to candidates, what do you actually compare? Full-precision floats (exact but fat), or compressed codes with a full-precision rerank to buy the recall back.

Engine	Skip Structure	Priorities
pgvector	IVFFlat or HNSW	simplicity inside Postgres, scale is not the goal
hnswlib	HNSW graph in RAM	the reference graph implementation, latency above all
Qdrant	filterable HNSW graph in RAM	low latency with filters, quantized codes + oversampling rerank
Weaviate	HNSW, or no structure at all (flat BQ)	below a certain scale, a fast binary scan beats building a graph
Faiss	IVF	billion-scale batch throughput, 4-bit SIMD fast scan
ScaNN	learned partitions	recall-per-bit, codes trained to preserve ranking not vectors
DiskANN	Vamana graph on SSD	billion scale on one machine, graph laid out for disk
LanceDB	IVF over a columnar disk format	storage cost, quantized codes streamed from disk
turbopuffer	IVF over object storage	cheapest possible storage, RAM holds only the hot cells
Milvus	wraps Faiss and HNSW, pick per collection	operability, segments and updates over raw speed
Vespa	HNSW beside inverted indexes	vectors as one signal in a bigger ranking pipeline

Read the choices column by column and a pattern falls out. Nobody past toy scale scans floats, everyone compresses and reranks, the argument is only over which codes. The graph engines are betting on RAM and latency; the IVF engines are betting on cheap storage and throughput; Weaviate’s flat+BQ mode is the honest admission that below a certain corpus size, the fanciest structure loses to a brutally fast scan. And whichever skip structure they pick, every one of them ends up needing the same inner layer, scan compressed codes fast, rerank a shortlist exactly. That layer is what NNDB is, the scan inside one IVF cell pushed as far as the hardware allows, with the router above it deliberately out of scope. The detail I only appreciated later: Weaviate’s flat+BQ and Qdrant’s oversampling+rerank are the exact binary funnel this study converged on independently, it’s already the production answer.

Why I dug in

Today every LLM application sits on top of this, retrieval is embeddings, and embeddings mean top-k vector search at serving latency. So I had two reasons to dig in. The first is past experiences and curiosity. The second is that this problem is hardware-adjacent to LLM inference itself: a vector scan streams a big block of memory and does a little math per byte, which is the same shape as attention streaming a KV cache, and both slam into the same wall, memory bandwidth, not compute. Studying the small version of the problem, where I can measure everything on one box, felt like the honest way to understand the limitations the big one is running into. I could have just read the papers and studied how everyone else does it, but applied research is how it actually sticks for me, build it, measure it, be wrong about it, then understand it.

So I built NNDB, an in-memory vector-search engine from scratch in Rust. Not to ship a product, but as a curiosity run (using /goal and /loop to help me achieve it, run on Opus 4.8, if you’re curious). The thing I really wanted to understand is the boundary between memory-bandwidth-bound and CPU-bound, because that boundary is where all the interesting tradeoffs live. Every change became a numbered experiment with the measurements that justified or killed it, and the repo turned into a lab notebook. Sixty-six experiments later the study is done, and the full writeup with interactive charts is at nndb.fuxing.dev.

Here’s the TLDR up front: you can always get vector search under 10ms if you really want to. The question is what it costs you. Latency is purchasable with compute dollars, so the real study is the tradeoff, recall against QPS against bytes against silicon, and what the best implementation looks like once you understand which wall you’re actually hitting.

The wall is memory, not compute

Brute-force f32 search over 1M Cohere embeddings does about 9 QPS. My first instinct was to speed up the math, make the distance computation itself faster with SIMD and FMA, and it worked, briefly, then stopped. The bottleneck was never compute. Every query streams 4 GB from RAM, and you can’t out-compute the memory bus. Once I saw that, everything collapsed into two levers: send fewer bytes, or spend cheaper instructions on each byte. The other sixty-odd experiments are just those two levers.

Where it ended up

Where it ended up is a 1-bit binary funnel. Keep one sign bit per dimension, which is 32× smaller, scan everything with popcount to get a shortlist, then re-rank only the shortlist against the real vectors. Comparing two binary codes is XOR then count-the-differing-bits, and counting bits is a single instruction in the CPU’s SIMD set, VPOPCNTDQ chews through 512 bits at a time. That’s why the scan is so damn fast: the entire per-vector comparison is one instruction, no multiplies, no floats. The same box that did 9 QPS does ~960 QPS at 0.995 recall, and the whole index is 128 MB. An AMD Zen5 box at the same price does ~2,310. A full 192-core socket gets to ~19,900 and then the memory bus saturates. On 256-bit Matryoshka codes the same funnel does 0.991 recall at ~5,000 QPS with a 3 ms p50.

What I’d keep

If I had to keep one lesson, it’s that on a CPU the representation that wins is the one whose per-vector operation is a single SIMD instruction. I tried a scheme with better recall-per-bit and it lost to plain popcount anyway, because its inner loop was a gather the CPU can’t vectorize. That observation decided almost every call after it. The other thing I learned the hard way is that shrinking the data re-prices every trick you’ve already tuned, because it moves you across the bandwidth-bound/compute-bound boundary. Query tiling bought me +73% while the scan was bandwidth-bound, then cost 24% once the codes fit in cache. A tuning verdict doesn’t survive a regime change, and it flipped on me twice.

The dead ends are all still in the log, scalar product quantization, HNSW inside a cell, register tiling, prefetch. Knowing what doesn’t work, and why, was half the point of doing this.

Pushing the frontier

The cache flip also meant the lab had stopped resembling the real problem, so I pushed the corpus to 10M vectors, 320 MB of codes, far past any cache, and expected to be back at the DRAM wall. Measuring properly said no: the scan was compute-bound, which means the kernel itself was the only thing left to attack. So the last arc (067–074) is tail-end optimization, reading the actual disassembly and doing autopsies on the emitted kernel. It turned out roughly half the hot loop’s uops were safety scaffolding and answer-format plumbing the source never wrote, bounds checks, per-lane extractions, re-fetched loop invariants, and the binding resource on Zen5 wasn’t even the ALUs, it was dispatch width, 8 uops per cycle. Once the safe-Rust tricks ran out, the checks came out the honest way, unsafe width casts and raw intrinsics, so the vector units are actually saturated doing the math instead of proving to the compiler that the math is allowed. Kernel-only changes, recall bit-identical at every step: 650 → 1,471 QPS, 2.26×. The loop is now ~20 uops per document and essentially irreducible at this ISA. And a fourth law fell out, found three separate times on the way: the arithmetic was never the cost, converting the answer into the consumer’s format was.

Even at “irreducible” there’s more hyper-optimization on the table, because the two walls take different levers and you always end up leaning on one of them. When the memory bus is the wall, the lever is bytes streamed: better embedding models are effectively a compression scheme, Matryoshka took the codes from 1024 bits to 256 without losing the neighbors, and a model trained to survive 128-bit binary quantization would halve the stream again. When compute is the wall, the lever is tiling: fit as many queries into a tile as the registers and cache allow, so every code loaded is scored against a full tile and the vector units never wait on the next byte. The game doesn’t end, it just alternates which wall you’re pushing.

Why it stops here

What ends the study isn’t running out of frontier, it’s that the next questions aren’t answerable from a benchmark box. The real problem is billions of vectors that will never all be hot at once, and at that scale the game shifts from making one scan fast to orchestration: noticing which IVF cells are hot, keeping those resident in cache while the cold ones live in RAM or on disk, deciding what to evict and when. I can’t test any of that honestly in a lab, because the answers depend on a real workload’s access pattern, how the corpus grows, and what recall the product actually needs. I have the single-box story characterized, throughput predictable from cores and corpus size, the bandwidth wall mapped, a rule for which instance to buy, and a kernel with nothing left to give. The rest needs real usage, not more lab time.

If you want the full detail, the writeup is at nndb.fuxing.dev and every experiment is logged at nndb.fuxing.dev/notes.