SoTA Feed — Every open-weights release from the labs that matter

Ad: Read SoTA Feed without this slot — ad-free site plus a personal ad-free feed URL $3/month

diffusiongemma-26B-A4B-it

Jun 9, 2026 · Google · license: apache-2.0 · view on Hugging Face ↗
52 GB · MoE: 26B total, 4B (≈7.9 GB) active

Hugging Face | GitHub | Launch Blog | Documentation
License: Apache 2.0 | Authors: Google DeepMind

DiffusionGemma is a generative model built by Google DeepMind. Based on the 26B A4B Mixture-of-Experts (MoE) Gemma 4 architecture, DiffusionGemma generates tokens using discrete diffusion. This open-weights model is multimodal, handling text, image, and video inputs to generate text output.

Built on a MoE foundation, DiffusionGemma is designed to improve generation speed (tokens per second) while remaining deployable across various hardware environments. DiffusionGemma builds upon the architectural and capability advancements of Gemma 4, introducing several core features:

Model Overview

DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models. It employs an encoder-decoder architecture specifically optimized for inference speed.

The encoder operates in a prefill capacity, processing the initial prompt and generating the KV cache. The decoder then utilizes bidirectional attention to process an input block (a 'canvas') of tokens, accessing the cached context via cross-attention.

During inference, DiffusionGemma leverages multi-canvas sampling. Rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. Once a canvas is fully denoised, it is processed by the encoder and appended to the KV cache, after which the model generates the next canvas. This block-autoregressive approach facilitates text generation at higher speeds.

DiffusionGemma

Total Parameters25.2B
Active Parameters3.8B
Layers30
Sliding Window1024 tokens
Context LengthUp to 256K tokens
Canvas Length256
Vocabulary Size262K
Expert Count8 active / 128 total and 1 shared
Supported ModalitiesText, Image
Vision Encoder Parameters~550M

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models, with the recommended Entropy Bound (EB) sampler (see Best Practices below).

BenchmarkDiffusionGemma 26B A4BGemma 4 26B A4B
MMLU Pro77.6%82.6%
AIME 2026 no tools69.1%88.3%
LiveCodeBench v669.1%77.1%
Codeforces ELO14291718
GPQA Diamond73.2%82.3%
Tau2 (average over 3)56.2%68.2%
HLE no tools11.0%8.7%
HLE with search11.9%17.2%
BigBench Extra Hard47.6%64.8%
MMMLU81.5%86.3%
Vision
MMMU Pro54.3%73.8%
OmniDocBench 1.5 (average edit distance, lower is better)0.3190.149
MATH-Vision70.5%82.4%
MedXPertQA MM49.0%58.1%
Long Context
MRCR v2 8 needle 128k (average)32.0%44.1%

Core Capabilities

DiffusionGemma handles a broad range of tasks across text and vision. Key capabilities include:

Getting Started

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

pip install -U transformers torch accelerate

Once you have everything installed, you can proceed to load the model with the code below:

from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor

MODEL_ID = "google/diffusiongemma-26B-A4B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

Once the model is loaded, you can start generating output:

# Prompt
message = [
    {"role": "user", "content": "Why is the sky blue?"}
]

# Process input
input_ids = processor.apply_chat_template(
    message,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=512)

# Parse output
text = processor.decode(output[0], skip_special_tokens=False)

Best Practices

For the best performance, use these configurations and best practices:

1. Diffusion Sampling Settings

Use the following standardized sampling configuration across all use cases:

2. Thinking Mode Configuration

Similar to Gemma 4 models, we use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

[!Note] Note that many libraries like transformers handle the complexities of the chat template for you.

3. Multi-Turn Conversations

4. Modality order

5. Variable Image Resolution

Aside from variable aspect ratios, DiffusionGemma supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

6. Video Length

All models support image inputs and can process videos as frames. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

Model Data

Data used for model training and how the data was processed.

Training Dataset

Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025\. Here are the key components:

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

Data Preprocessing

Here are the key data cleaning and filtering methods applied to the training data:

Ethics and Safety

As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, DiffusionGemma undergoes the same rigorous safety evaluations as our proprietary Gemini models.

Evaluation Approach

DiffusionGemma was developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Google’s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:

Evaluation Results

For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous generations of Gemma models. Overall, DiffusionGemma, like Gemma 4 models, significantly outperforms Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was intentionally conducted without safety filters to evaluate the model’s raw capabilities and baseline behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models.

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

Limitations

Ethical Considerations and Risks

In creating an open, vision-language model, we have carefully considered the following:

Risks identified and mitigations:

Benefits

At the time of release, this is a low-latency, high-performance open vision-language model that provides a compelling option for developers and those interested in researching diffusion language models. The model is designed from the ground up for responsible AI development compared to similarly sized models.

← all releases