SoTA Feed — Every open-weights release from the labs that matter

Ad: Read SoTA Feed without this slot — ad-free site plus a personal ad-free feed URL $3/month

MiMo-V2-Flash

Dec 16, 2025 · Xiaomi MiMo · license: mit · view on Hugging Face ↗
313 GB · MoE: 310B total, 15B (≈15.2 GB) active



Xiaomi-MiMo

| 🤗 HuggingFace  | 📔 Technical Report  | 📰 Blog  |

Play around!   🗨️ Xiaomi MiMo Studio   🎨 Xiaomi MiMo API Platform

MiMo-V2-Flash

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.


1. Introduction

MiMo-V2-Flash creates a new balance between long-context modeling capability and inference efficiency. Key features include:


2. Model Downloads

ModelTotal ParamsActive ParamsContext LengthDownload
MiMo-V2-Flash-Base309B15B256k🤗 HuggingFace
MiMo-V2-Flash309B15B256k🤗 HuggingFace

[!IMPORTANT] We also open-source the 3-layer MTP weights to foster community research.


3. Evaluation Results

Base Model Evaluation

MiMo-V2-Flash-Base demonstrates strong performance across standard benchmarks, surpassing models with significantly larger parameter counts.

CategoryBenchmarkSetting/LengthMiMo-V2-Flash BaseKimi-K2 BaseDeepSeek-V3.1 BaseDeepSeek-V3.2 Exp Base
Params#Activated / #Total-15B / 309B32B / 1043B37B / 671B37B / 671B
GeneralBBH3-shot88.588.788.288.7
MMLU5-shot86.787.887.487.8
MMLU-Redux5-shot90.690.290.090.4
MMLU-Pro5-shot73.269.258.862.1
DROP3-shot84.783.686.386.6
ARC-Challenge25-shot95.996.295.695.5
HellaSwag10-shot88.594.689.289.4
WinoGrande5-shot83.885.385.985.6
TriviaQA5-shot80.385.183.583.9
GPQA-Diamond5-shot55.148.151.052.0
SuperGPQA5-shot41.144.742.343.6
SimpleQA5-shot20.635.326.327.0
MathGSM8K8-shot92.392.191.491.1
MATH4-shot71.070.262.662.5
AIME 24&252-shot35.331.621.624.8
CodeHumanEval+1-shot70.784.864.667.7
MBPP+3-shot71.473.872.269.8
CRUXEval-I1-shot67.574.062.163.9
CRUXEval-O1-shot79.183.576.474.9
MultiPL-E HumanEval0-shot59.560.545.945.7
MultiPL-E MBPP0-shot56.758.852.550.6
BigCodeBench0-shot70.161.763.062.9
LiveCodeBench v61-shot30.826.324.824.9
SWE-Bench (AgentLess)3-shot30.828.224.89.4*
ChineseC-Eval5-shot87.992.590.091.0
CMMLU5-shot87.490.988.888.9
C-SimpleQA5-shot61.577.670.968.0
MultilingualGlobalMMLU5-shot76.680.781.982.0
INCLUDE5-shot71.475.377.277.2
Long ContextNIAH-Multi32K99.399.899.785.6*
64K99.9100.098.685.9*
128K98.699.597.294.3*
256K96.7---
GSM-Infinite Hard16K37.734.641.550.4
32K33.726.138.845.2
64K31.516.034.732.6
128K29.08.828.725.7

\* indicates the model may fail to follow the prompt or format.

Post-training Model Evaluation

Following our Post-Training Paradigm with MOPD and Agentic RL, the model achieves SOTA reasoning and agentic performance.

BenchmarkMiMo-V2 FlashKimi-K2 ThinkingDeepSeek-V3.2 ThinkingGemini-3.0 ProClaude Sonnet 4.5GPT-5 High
Reasoning
MMLU-Pro84.984.685.090.188.287.5
GPQA-Diamond83.784.582.491.983.485.7
HLE (no tools)22.123.925.137.513.726.3
AIME 202594.194.593.195.087.094.6
HMMT Feb. 202584.489.492.597.579.288.3
LiveCodeBench-v680.683.183.390.764.084.5
General Writing
Arena-Hard (Hard Prompt)54.171.953.472.663.371.9
Arena-Hard (Creative Writing)86.280.188.893.676.792.2
Long Context
LongBench V260.645.158.465.661.8-
MRCR45.744.255.589.755.4-
Code Agent
SWE-Bench Verified73.471.373.176.277.274.9
SWE-Bench Multilingual71.761.170.2-68.055.3
Terminal-Bench Hard30.530.635.439.033.330.5
Terminal-Bench 2.038.535.746.454.242.835.2
General Agent
BrowseComp45.4-51.4-24.154.9
BrowseComp (w/ Context Manage)58.360.267.659.2--
\\(\tau^2\\)-Bench80.374.380.385.484.780.2

4. Model Architecture

Hybrid Sliding Window Attention

MiMo-V2-Flash addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA).

Lightweight Multi-Token Prediction (MTP)

Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.


5. Post-Training Technical Highlights

MiMo-V2-Flash leverages a post-training pipeline designed to maximize reasoning and agentic capabilities through innovative distillation and reinforcement learning strategies.

5.1 Multi-Teacher On-Policy Distillation (MOPD)

We introduce Multi-Teacher On-Policy Distillation (MOPD), a new paradigm that formulates knowledge distillation as a reinforcement learning process.

5.2 Scaling Agentic RL

We significantly scale up the agentic training environments to improve intelligence and generalization.

5.3 Advanced RL Infrastructure

To support high-throughput RL training for large-scale MoE models, we implemented several infrastructure optimizations on top of SGLang and Megatron-LM.


6. Inference & Deployment

MiMo-V2-Flash supports FP8 mixed precision inference. We recommend using SGLang for optimal performance.

Quick Start with SGLang

pip install sglang

# Launch server
python3 -m sglang.launch_server \
        --model-path XiaomiMiMo/MiMo-V2-Flash \
        --served-model-name mimo-v2-flash \
        --pp-size 1 \
        --dp-size 2 \
        --enable-dp-attention \
        --tp-size 8 \
        --moe-a2a-backend deepep \
        --page-size 1 \
        --host 0.0.0.0 \
        --port 9001 \
        --trust-remote-code \
        --mem-fraction-static 0.75 \
        --max-running-requests 128 \
        --chunked-prefill-size 16384 \
        --reasoning-parser qwen3 \
        --tool-call-parser mimo \
        --context-length 262144 \
        --attention-backend fa3 \
        --speculative-algorithm EAGLE \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --enable-mtp

# Send request
curl -i http://localhost:9001/v1/chat/completions \
    -H 'Content-Type:application/json' \
    -d  '{
            "messages" : [{
                "role": "user",
                "content": "Nice to meet you MiMo"
            }],
            "model": "mimo-v2-flash",
            "max_tokens": 4096,
            "temperature": 0.8,
            "top_p": 0.95,
            "stream": true,
            "chat_template_kwargs": {
                "enable_thinking": true
            }
        }'

Inference with KTransformers (CPU Offloading)

KTransformers enables efficient MiMo-V2-Flash deployment on consumer-grade hardware by offloading MoE expert computations to CPU, built on top of SGLang. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.

For quick start and benchmarks, visit KTransformers.

Notifications

1. System prompt

[!IMPORTANT] The following system prompts are HIGHLY recommended, please choose from English and Chinese version.

English

You are MiMo, an AI assistant developed by Xiaomi.

Today's date: {date} {week}. Your knowledge cutoff date is December 2024.

Chinese

你是MiMo(中文名称也是MiMo),是小米公司研发的AI智能助手。

今天的日期:{date} {week},你的知识截止日期是2024年12月。

2. Sampling parameters

[!IMPORTANT] Recommended sampling parameters:

top_p=0.95

temperature=0.8 for math, writing, web-dev

temperature=0.3 for agentic taks (e.g., vibe-coding, tool-use)

3. Tool-use practice

[!IMPORTANT] In the thinking mode with multi-turn tool calls, the model returns a reasoning_content field alongside tool_calls. To continue the conversation, the user must persist all history reasoning_content in the messages array of each subsequent request.


7. Citation

If you find our work helpful, please cite our technical report:

@misc{mimo2025flash,
  title={MiMo-V2-Flash Technical Report},
  author={LLM-Core Xiaomi},
  year={2025},
  url={https://github.com/XiaomiMiMo/MiMo-V2-Flash/paper.pdf}
}

8. Contact

Please contact us at mimo@xiaomi.com, join our WeChat group below or open an issue if you have any questions.

← all releases