Google's New AI Model Writes Text Many Times Faster

Google DeepMind released DiffusionGemma, an open Apache-2.0 model that writes text in blocks via diffusion, hitting 1,100+ tokens per second with 3.8B active params.

↻ Published 2026-06-21◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Google DeepMind has released DiffusionGemma-26B-A4B-it, an open multimodal model that abandons the standard token-by-token approach almost every chatbot uses today. Instead of predicting the next word, then the next, it generates whole blocks of text at once and refines them through diffusion-style denoising — the same broad idea behind image generators, applied to language. The payoff is raw speed: the model reaches over 1,100 tokens per second through parallel decoding, far beyond what a comparable autoregressive model manages.

Under the hood it's a mixture-of-experts design: 26 billion total parameters but only about 3.8 billion active at any step, with 8 of 128 experts firing per token. It takes text, images and video as input, handles a 256K context window, supports 35-plus languages, includes a reasoning mode, and ships under the permissive Apache 2.0 license. On benchmarks it posts solid numbers — 77.6% on MMLU Pro, 70.5% on MATH-Vision — while generally trailing the standard Gemma 4 line. The trade is deliberate: a little accuracy for a lot of throughput.

Why diffusion for text is a big deal

Autoregressive models are sequential by nature — each token waits on the one before it, which caps how fast you can go no matter how many GPUs you throw at the problem. Diffusion sidesteps that bottleneck by working on many positions in parallel and cleaning them up over a few passes. We've seen text-diffusion research for a while, but a first-party release from Google at this scale, with open weights and a real multimodal stack, is the strongest signal yet that the approach is leaving the lab.

The architecture also leans on sparsity: keeping only 3.8B parameters active means the model is cheaper to run than its 26B headline suggests, which is exactly what you want when speed is the whole point.

Why it matters for you

If you build anything on top of language models, latency and cost are usually the two walls you hit first. A model that streams 1,100+ tokens per second changes what feels possible — live document analysis, OCR, interactive agents that don't make you wait. And because the weights are open under Apache 2.0, you can run it on your own hardware instead of metering every call against someone else's API.

My take: I wouldn't swap a top reasoning model for this on tasks where accuracy is everything — Gemma 4 still edges it. But for high-volume, latency-sensitive work where 'fast and good enough' beats 'slow and perfect,' DiffusionGemma is worth a serious look, if only to see how a non-autoregressive model behaves in your own pipeline.

ℹWhat I'd actually do

Pull the weights from Hugging Face and benchmark it against your current model on your own workload — measure tokens per second and quality side by side. If your bottleneck is speed or per-call cost rather than peak accuracy, the diffusion approach may quietly pay for itself.

#Google#open models#diffusion LLM#Gemma

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: huggingface.co