SoTA Feed — Every open-weights release from the labs that matter

Ad: Read SoTA Feed without this slot — ad-free site plus a personal ad-free feed URL $3/month

MiMo-V2.5-Pro-FP4-DFlash

Jun 8, 2026 · Xiaomi MiMo · license: mit · view on Hugging Face ↗
570 GB · MoE: 1.02T total, 42B (≈23 GB) active · FP4



Xiaomi-MiMo

🤗 HuggingFace  |  📰 Blog
🎨 Xiaomi MiMo API Platform (Request Access)  |  🗨️ Xiaomi MiMo Studio (Free Trial)

Community
WeChat Group  |  Discord  |  Telegram  |  Reddit

MiMo-V2.5-Pro-FP4-DFlash

MiMo-V2.5-Pro-FP4-DFlash is the underlying model that powers MiMo-V2.5-Pro-UltraSpeed:

Together they cut both the per-parameter bit width and the number of backbone forward passes, the two dominant costs of trillion-parameter decoding.

1. Introduction

At the trillion-parameter (1T) scale, even 8-bit (FP8/INT8) inference carries severe memory-footprint and memory-bandwidth costs. Lowering the parameter bit width translates directly into faster decoding. We therefore adopt FP4 quantization and block-diffusion speculative decoding. Key features of this release:

2. FP4 Quantization

We quantize only the MoE experts to MXFP4 (block size 32) and keep attention projections and other modules at higher precision (the attention o_proj of every layer is excluded from FP4). With FP4 QAT, quality stays close to the FP8 baseline:

fp4 compare
BenchmarkMiMo-V2.5-Pro-FP8MiMo-V2.5-Pro-MXFP4Δ
General Agent
Claw-Eval (pass^3)63.867.8+6.27%
Humanity's Last Exam48.047.0-2.08%
Humanity's Last Exam (without tool)34.033.0-2.94%
Code Agent
SWE-Bench Pro57.258.8+2.80%
SWE-bench Verified78.977.4-1.90%

3. Block-Diffusion Speculative Decoding (DFlash)

Conventional speculative decoding relies on a small draft model to guess the next tokens, which the large model then verifies; the rejection-sampling verification keeps the output lossless. Its bottleneck is that draft quality bounds the acceptance rate, while a stronger draft costs more compute.

To break this trade-off we adopt the block-level masked parallel-prediction approach DFlash: the draft fills an entire block of masked positions in one forward pass. We landed this on MiMo-V2.5-Pro with custom optimizations for trillion-scale MoE and long-context serving, using the Muon second-order optimizer and model self-distillation so that even a small mask block keeps a strong acceptance rate while pushing the draft-stage cost close to its limit:

In practice, we further cap the mask block size at 8 to lower verification overhead and raise concurrency.

ScenarioAcceptance Length
WebDev6.30
Math5005.56
HumanEval4.54
MT-Bench3.18
SWE-Bench4.29

4. Model Summary

ComponentBackboneDFlash Drafter
ArchitectureMiMoV2ForCausalLMDFlashDraftModel
Total / Active Params1.02T / 42B5-layer draft
Hidden Size61446144
Num Layers705
Num Attention Heads128128
Num KV Heads8 (GQA)8 (GQA)
Head Dim (QK / V)192 / 128128 / 128
SWA Window Size1281024
Block Size8
Captured Backbone Layers[0, 15, 31, 47, 69]
Backbone RoPE Base5,000,0005,000,000
PrecisionMXFP4 (experts) MixedBF16
Max Context Length1M

5. Deployment

DFlash inference with the FP4 backbone is supported in SGLang. The drafter is launched alongside the backbone via the speculative-decoding flags and inherits the backbone's tensor/expert-parallel topology.

SGLang Deployment

The following is an example of running the model with SGLang. Point --model at this repository and --speculative-draft-model-path at its dflash/ subdirectory.

python3 -m sglang.launch_server \
    --model MiMo-V2.5-Pro-FP4-DFlash \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path MiMo-V2.5-Pro-FP4-DFlash/dflash \
    --speculative-num-draft-tokens 8 \
    --ep-size 16 \
    --tensor-parallel-size 16 \
    --data-parallel-size 2 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --quantization fp8 \
    --attention-backend fa3 \
    --moe-dense-tp-size 1 \
    --dtype bfloat16 \
    --mem-fraction-static 0.65 \
    --context-length 65536 \
    --page-size 1 \
    --trust-remote-code \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --dist-init-addr ${MASTER_ADDR}:20000 \
    --nnodes ${WORLD_SIZE} \
    --node-rank ${RANK} \
    --host 0.0.0.0 \
    --port 29999

Citation

@misc{mimo2026v25pro_fp4dflash,
  title={MiMo-V2.5-Pro-FP4-DFlash},
  author={{Xiaomi MiMo Team}},
  year={2026},
  howpublished={\url{https://huggingface.co/collections/XiaomiMiMo/mimo-v25}},
}

Contact

For questions or feedback, reach us at mimo@xiaomi.com or join our community:

← all releases