AI Responds 15x Faster — Without Any Quality Loss

UC San Diego's DFlash drafts entire token blocks at once instead of word by word, reaching up to 15x throughput on NVIDIA Blackwell. Open source, works with vLLM out of the box.

4 min readEAEvgenii ArsentevEvgenii Arsentev · PhD

Fifteen times faster. That's the throughput gain NVIDIA measured when they ran DFlash, a new open-source technique from UC San Diego's z-lab, on their Blackwell GPUs using a 120B-parameter model. The benchmark is not a lab toy — Blackwell is NVIDIA's latest production hardware, already running in major cloud data centers.

One block at a time instead of one word at a time

Standard AI generation works like someone typing a message letter by letter: the model picks one token, outputs it, then picks the next. Multiply that by hundreds of tokens per response and you can feel the latency pile up. DFlash flips the logic. A small, lightweight helper model drafts an entire block of tokens in one shot — think of it like stamping a whole word at once instead of writing each letter separately. The main model then scans the block, accepts the good parts, and fixes the rest in parallel.

What makes DFlash faster than earlier attempts at the same idea is the way the helper model is informed. It gets a direct feed of the main model's internal signals — so its draft stays close to what the main model would have written anyway. More of the block gets accepted on the first try, fewer tokens need correction, and the whole thing resolves much faster.

The numbers

On Qwen3-8B, the research paper reports an average 4.86x speedup, peaking at 6.08x lossless acceleration. NVIDIA tested DFlash on a 120B model on Blackwell hardware and measured up to 15x throughput at the same interactivity level. It also outpaced the previous best method in this category by roughly 2.5x. The gains hold across multiple model families: Qwen3 (8B and 27B), LLaMA 3.1, and Gemma 4 31B all showed significant improvements.

What this means if you're building with AI

If you run open AI models on your own servers — for a product, an internal tool, or a coding assistant — DFlash means the same hardware you already have can serve dramatically more users. A 5x speedup in practice translates directly to roughly 5x more requests per server per minute. That changes the economics of running a product on open models versus paying for hosted APIs.

DFlash is fully open source. The team published checkpoints on Hugging Face and built integrations into vLLM, the Transformers library, and TensorRT-LLM — the three most widely used frameworks for deploying open models. There are no proprietary dependencies and no license restrictions. This is not a benchmark announcement with a 12-month waitlist — it is shipping code you can run today.

What I'd actually do

If you're running Qwen3 or LLaMA 3.1 on your own infrastructure, pull the DFlash checkpoint and benchmark it against your typical workload before anything else. A genuine 5x speedup on a single model means five times as many users per server — that is a direct cost reduction worth measuring. If you're on hosted APIs like Claude or GPT, there's nothing to change right now, but techniques like DFlash are exactly what will make those APIs noticeably snappier over the next year or two as providers adopt them on their end.

#AI Infrastructure#Open Source#Speed

Related guides

EAEvgenii Arsentev

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

Source: marktechpost.com