Skip to content
Carlos KiK
Go back

DiffusionGemma Is A Reminder That Token-By-Token Is Not Sacred

Most people experience language models as if text must be generated one token at a time.

That has become so normal that it feels like a law of nature.

It is not.

Google’s June 10 DiffusionGemma developer guide is useful because it points at a different shape of text generation. Instead of producing every token strictly left to right, DiffusionGemma starts with a block of placeholder tokens and refines the whole canvas through repeated denoising steps.

Built on the Gemma 4 backbone, the experimental model is a 26B mixture-of-experts system that activates 3.8B parameters during inference. Google says the approach can deliver up to 4x faster token generation on GPUs, with reported speeds above 700 tokens per second on an RTX 5090 and above 1000 tokens per second on a single H100.

The headline is speed.

The deeper idea is self-correction.

Autoregressive models get stuck

Traditional autoregressive models generate text from left to right.

That works well for many tasks, but it has a structural weakness: once a token is produced, the model is mostly committed to it. If a later constraint reveals that an earlier choice was wrong, the model has to work around the mistake instead of cleanly revising the whole answer.

DiffusionGemma changes that rhythm.

During denoising, every position in the current canvas can attend to the other positions. That gives the model a chance to resolve global constraints and correct earlier uncertainty before the block is finalized.

Google uses Sudoku as the demonstration because Sudoku is hostile to left-to-right guessing. A valid digit depends on constraints across rows, columns, and boxes. The model needs global pressure, not just fluent continuation.

That is why the architecture is interesting beyond the puzzle demo.

Local inference needs new shapes

The local AI story is often told as a model-size story: smaller models, quantization, better laptops, faster chips.

Those matter.

But inference architecture matters too.

Google says DiffusionGemma shifts the bottleneck from memory bandwidth toward compute by giving the GPU a larger parallel workload. The model also supports a block-autoregressive approach for longer sequences, where each denoised 256-token block is committed before the next block starts.

That is a practical compromise: parallel refinement where it helps, sequential stability where it is still needed.

The vLLM support matters because experimental architectures only become useful when developers can actually serve them. Google says DiffusionGemma can be deployed through vLLM’s OpenAI-compatible local server, which makes the experiment easier to test inside existing tools.

The model stack is still not settled

This is not a declaration that diffusion text models replace standard LLMs.

It is a reminder that the model stack is still in motion.

As agents and local workflows become more demanding, developers will care about latency, self-correction, controllability, serving cost, context behavior, and hardware fit. Token-by-token generation will remain important, but it is not the only possible path.

DiffusionGemma is interesting because it challenges the default assumption.

Maybe some tasks should be written like a sentence.

Maybe some should be solved like a canvas.

Source: Google Developers Blog


Share this post on:

Next Post
Claude Fable 5 Shows The Frontier Model Split Getting Real