Flash-KMeans: Exact K-Means 200x Faster on GPUs

Researchers open-sourced Flash-KMeans, an exact GPU k-means that runs over 200x faster than FAISS by moving data smarter, not by approximating.

↻ Published 2026-06-15◷ 4 min readEA

Evgenii Arsentev · PhD

Researchers from UC Berkeley and UT Austin have released Flash-KMeans, an open-source GPU implementation of standard Lloyd's k-means clustering that reports running over 200x faster than FAISS. The notable part: it 'does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU.' It ships under Apache 2.0 and installs with a single 'pip install flash-kmeans.'

The benchmarks, measured on an NVIDIA H200 in FP16 at 128 dimensions, are steep: 33x faster than NVIDIA's own cuML, up to 17.9x end-to-end versus the best baseline, and up to 10.5x faster than fastkmeans on an out-of-core run over 1 billion points. The two custom kernels post their own gains — up to 21.2x and 6.3x — that combine into those end-to-end numbers.

How it gets the speedup

Two memory tricks do the work. FlashAssign fuses the distance computation with the online argmin, so the full N×K distance matrix is never written out to memory — dropping IO complexity from O(NK) to O(Nd+Kd). Sort-Inverse Update then sorts points by cluster ID, converting scattered atomic writes into contiguous segment reductions and cutting atomic contention. Neither step touches the result: you get identical clusters, just far less data shuffled across the GPU's memory hierarchy.

Why it matters

K-means is quiet plumbing underneath a lot of modern AI: vector-search indexing, sparse-attention routing, KV-cache compression, low-bit KV quantization, and diffusion transformers. These pipelines call clustering over and over inside training and inference loops, where the cost that actually bites is latency per call, not textbook big-O. A drop-in exact speedup means the same answers for less time and money — and because it's exact, there's no accuracy trade-off to argue about before adopting it.

ℹWhat I'd actually do

If you run anything with a vector database or your own embeddings pipeline, this is the kind of release worth a quick benchmark on your real data before you accept it. Exact-but-faster is the rare free lunch: same output, lower bill. The honest test is your own workload on your own hardware — a 200x headline on an H200 won't be your number, but even a fraction of it is worth ten minutes to check.

#ai#open-source#gpu#research

Author

Evgenii Arsentev

PhD · Chief Product Officer at a healthtech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: marktechpost.com