Flash-KMeans: Exact K-Means 200x Faster on GPUs
Researchers open-sourced Flash-KMeans, an exact GPU k-means that runs over 200x faster than FAISS by moving data smarter, not by approximating.
Evgenii Arsentev · PhDResearchers from UC Berkeley and UT Austin have released Flash-KMeans, an open-source GPU implementation of standard Lloyd's k-means clustering that reports running over 200x faster than FAISS. The notable part: it 'does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU.' It ships under Apache 2.0 and installs with a single 'pip install flash-kmeans.'
The benchmarks, measured on an NVIDIA H200 in FP16 at 128 dimensions, are steep: 33x faster than NVIDIA's own cuML, up to 17.9x end-to-end versus the best baseline, and up to 10.5x faster than fastkmeans on an out-of-core run over 1 billion points. The two custom kernels post their own gains — up to 21.2x and 6.3x — that combine into those end-to-end numbers.
How it gets the speedup
Two memory tricks do the work. FlashAssign fuses the distance computation with the online argmin, so the full N×K distance matrix is never written out to memory — dropping IO complexity from O(NK) to O(Nd+Kd). Sort-Inverse Update then sorts points by cluster ID, converting scattered atomic writes into contiguous segment reductions and cutting atomic contention. Neither step touches the result: you get identical clusters, just far less data shuffled across the GPU's memory hierarchy.
Why it matters
K-means is quiet plumbing underneath a lot of modern AI: vector-search indexing, sparse-attention routing, KV-cache compression, low-bit KV quantization, and diffusion transformers. These pipelines call clustering over and over inside training and inference loops, where the cost that actually bites is latency per call, not textbook big-O. A drop-in exact speedup means the same answers for less time and money — and because it's exact, there's no accuracy trade-off to argue about before adopting it.
If you run anything with a vector database or your own embeddings pipeline, this is the kind of release worth a quick benchmark on your real data before you accept it. Exact-but-faster is the rare free lunch: same output, lower bill. The honest test is your own workload on your own hardware — a 200x headline on an H200 won't be your number, but even a fraction of it is worth ten minutes to check.

Author
Evgenii Arsentev
PhD · Chief Product Officer at a healthtech company
Want to actually build this?
Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.
◉ Start the free courseSource: marktechpost.com