GenAI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale.
Standard language models work like a typewriter: one token at a time, left to right, with no ability to revise a committed output. That pattern works in the cloud, where batch sizes keep GPUs saturated. For local inference or low-concurrency deployments, the GPU is idle most of the time.
Google’s DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source vLLM inference platform. It generates a 256-token block in parallel rather than sequentially, with every token position attending to every other. Google says DiffusionGemma generates text up to 4x faster than standard models on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. On H200, it hits 1,288 — roughly six times a standard autoregressive baseline, according to vLLM benchmark results published today.
Despite the speed gains, Google did not oversell the release. The company’s launch post acknowledged directly that DiffusionGemma’s overall output quality is lower than standard Gemma 4, adding “For applications that demand maximum quality, we recommend deploying standard Gemma 4.”
What DiffusionGemma does
DiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes over the entire block at once. On each pass, it evaluates every position and locks in the ones it is most confident about. Uncertain positions get randomized and reconsidered on the next pass, with the model using what it resolved in the previous round to inform the next attempt. The block converges progressively until enough positions stabilize to anchor the rest.
Two things follow from that architecture.
-
Self-correction. An autoregressive model that commits to a wrong token is stuck with it, because subsequent tokens are already conditioned on the mistake. DiffusionGemma can identify low-confidence positions and re-evaluate them on the next pass.
-
Bidirectional context. Every position attends to every other position in the block simultaneously, including tokens that appear later in the sequence. That makes the model structurally better suited to constrained generation tasks where left-to-right generation fails.
Google demonstrated both properties with a fine-tuned Sudoku solver. The base model solved zero puzzles. After fine-tuning on a Sudoku dataset, it reached an 80% success rate and converged in 12 denoising steps rather than 48. The efficiency gain came directly from the model’s ability to self-correct and stop early.
How it was built
DiffusionGemma runs as a 26B Mixture of Experts model that activates only 3.8B parameters during inference. Quantized, it fits within 18GB VRAM on consumer hardware including the Nvidia RTX 4090 and 5090. Google and NVIDIA also optimized for enterprise Hopper and Blackwell servers using NVFP4 kernels.
The vLLM integration required new work because DiffusionGemma does not fit the standard serving model. A typical vLLM batch applies the same attention type to every request. DiffusionGemma requests alternate between causal and bidirectional attention as they cycle through prompt reading, canvas refinement and block commit. The team built per-request attention switching into both the Triton and FlashAttention 4 backends and reused the existing speculative decoding path for the refinement loop.
The new ModelState interface the team built for this integration is designed to support additional diffusion models in vLLM as they emerge.
Where the speed wins and where it does not
DiffusionGemma’s speed advantage is real but conditional. Where it applies depends entirely on deployment context.
The numbers. At batch size 1 on a single H100, vLLM’s published benchmarks put the FP8 model at roughly five times a standard autoregressive baseline. On H200, roughly six times. Those peak figures reflect optimal conditions: single user, dedicated hardware, FP8 quantization.
Where it wins. Local inference, single-user applications and low-concurrency serving. In those conditions the GPU has spare compute and memory bandwidth is the bottleneck. DiffusionGemma’s parallel block generation fills that gap.
Where it does not. High-throughput cloud serving. When a server is batching hundreds of concurrent requests, autoregressive models already saturate available compute and DiffusionGemma’s parallel decoding provides diminishing returns.
The quality ceiling. Guilherme O’Tina, an AI researcher, put a finer point on it on X. “Local artifacts vs hallucinations are different problems and that decides where this actually wins,” O’Tina wrote.
How it compares
Diffusion language models are not new. Researchers have built them at smaller scales for several years, and Inception Labs’ Mercury Coder applied the approach commercially to coding tasks in 2025. What DiffusionGemma adds is scale — a 26B MoE backbone, native vLLM serving and a general-purpose instruction-tuned model rather than a domain-specific one.
The more useful comparison for engineers evaluating this against existing inference tooling is speculative decoding, and the distinction matters. Speculative decoding keeps a standard autoregressive target model and uses a smaller draft model to guess several tokens ahead. The target model verifies them in one pass. If sampling is correct, the output distribution stays identical to the target. The architecture is unchanged.
Andrew Kuncevich, an ML and AI researcher focused on production AI systems, put it directly on X. “DiffusionGemma is different. It does not just guess future tokens. It creates a noisy 256-token canvas and repeatedly denoises the whole block in parallel. So it’s not just a decoding trick — it’s a different generation paradigm,” Kuncevich wrote.
Compared to standard Gemma 4, the trade is speed for quality. Google’s benchmark data shows DiffusionGemma below standard Gemma 4 on general output quality metrics, with the gap varying by task.
On structured constrained tasks, including code infilling, template generation and problems requiring bidirectional constraint propagation, the architecture has a structural advantage that fine-tuning can surface, as the Sudoku result demonstrates. On open-ended generation, standard Gemma 4 remains the stronger option.
What this means for enterprises
DiffusionGemma serves via a standard vLLM OpenAI-compatible endpoint with no diffusion-specific pipeline changes required.
This is not a general-purpose model upgrade.
For teams running local or low-concurrency inference, the architecture choice just expanded. Until now, cutting generation latency on dedicated GPU hardware meant using a smaller model and accepting the quality trade-off. DiffusionGemma offers a third path at the same parameter footprint, on consumer hardware, with same-day vLLM support.
For constrained generation workloads, bidirectional attention is worth evaluating. Code infilling, structured data generation and tasks where correct output depends on context not yet generated are where this architecture has a structural edge.
The ModelState interface built for this integration is designed to generalize as additional diffusion models emerge.
The quality trade-off is real and Google acknowledges it. For teams running local inference on dedicated GPU hardware, this is worth testing.
