Dec 16, 2025 · Xiaomi MiMo · license: mit · view on Hugging Face ↗
313 GB · MoE: 310B total, 15B (≈15.2 GB) active

| 🤗 HuggingFace | 📔 Technical Report | 📰 Blog |

Play around! 🗨️ Xiaomi MiMo Studio 🎨 Xiaomi MiMo API Platform

MiMo-V2-Flash

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.

1. Introduction

MiMo-V2-Flash creates a new balance between long-context modeling capability and inference efficiency. Key features include:

Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and an aggressive 128-token window. This reduces KV-cache storage by nearly 6x while maintaining long-context performance via learnable attention sink bias.
Multi-Token Prediction (MTP): Equipped with a lightweight MTP module (0.33B params/block) using dense FFNs. This triples output speed during inference and will be good to accelerates rollout in RL training.
Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 256k length.
Agentic Capabilities: Post-training utilizes Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic RL, achieving superior performance on SWE-Bench and complex reasoning tasks.

2. Model Downloads

Model	Total Params	Active Params	Context Length	Download
MiMo-V2-Flash-Base	309B	15B	256k	🤗 HuggingFace
MiMo-V2-Flash	309B	15B	256k	🤗 HuggingFace

[!IMPORTANT] We also open-source the 3-layer MTP weights to foster community research.

3. Evaluation Results

Base Model Evaluation

MiMo-V2-Flash-Base demonstrates strong performance across standard benchmarks, surpassing models with significantly larger parameter counts.

Category	Benchmark	Setting/Length	MiMo-V2-Flash Base	Kimi-K2 Base	DeepSeek-V3.1 Base	DeepSeek-V3.2 Exp Base
Params	#Activated / #Total	-	15B / 309B	32B / 1043B	37B / 671B	37B / 671B
General	BBH	3-shot	88.5	88.7	88.2	88.7
	MMLU	5-shot	86.7	87.8	87.4	87.8
	MMLU-Redux	5-shot	90.6	90.2	90.0	90.4
	MMLU-Pro	5-shot	73.2	69.2	58.8	62.1
	DROP	3-shot	84.7	83.6	86.3	86.6
	ARC-Challenge	25-shot	95.9	96.2	95.6	95.5
	HellaSwag	10-shot	88.5	94.6	89.2	89.4
	WinoGrande	5-shot	83.8	85.3	85.9	85.6
	TriviaQA	5-shot	80.3	85.1	83.5	83.9
	GPQA-Diamond	5-shot	55.1	48.1	51.0	52.0
	SuperGPQA	5-shot	41.1	44.7	42.3	43.6
	SimpleQA	5-shot	20.6	35.3	26.3	27.0
Math	GSM8K	8-shot	92.3	92.1	91.4	91.1
	MATH	4-shot	71.0	70.2	62.6	62.5
	AIME 24&25	2-shot	35.3	31.6	21.6	24.8
Code	HumanEval+	1-shot	70.7	84.8	64.6	67.7
	MBPP+	3-shot	71.4	73.8	72.2	69.8
	CRUXEval-I	1-shot	67.5	74.0	62.1	63.9
	CRUXEval-O	1-shot	79.1	83.5	76.4	74.9
	MultiPL-E HumanEval	0-shot	59.5	60.5	45.9	45.7
	MultiPL-E MBPP	0-shot	56.7	58.8	52.5	50.6
	BigCodeBench	0-shot	70.1	61.7	63.0	62.9
	LiveCodeBench v6	1-shot	30.8	26.3	24.8	24.9
	SWE-Bench (AgentLess)	3-shot	30.8	28.2	24.8	9.4*
Chinese	C-Eval	5-shot	87.9	92.5	90.0	91.0
	CMMLU	5-shot	87.4	90.9	88.8	88.9
	C-SimpleQA	5-shot	61.5	77.6	70.9	68.0
Multilingual	GlobalMMLU	5-shot	76.6	80.7	81.9	82.0
	INCLUDE	5-shot	71.4	75.3	77.2	77.2
Long Context	NIAH-Multi	32K	99.3	99.8	99.7	85.6*
		64K	99.9	100.0	98.6	85.9*
		128K	98.6	99.5	97.2	94.3*
		256K	96.7	-	-	-
	GSM-Infinite Hard	16K	37.7	34.6	41.5	50.4
		32K	33.7	26.1	38.8	45.2
		64K	31.5	16.0	34.7	32.6
		128K	29.0	8.8	28.7	25.7

\* indicates the model may fail to follow the prompt or format.

Post-training Model Evaluation

Following our Post-Training Paradigm with MOPD and Agentic RL, the model achieves SOTA reasoning and agentic performance.

Benchmark	MiMo-V2 Flash	Kimi-K2 Thinking	DeepSeek-V3.2 Thinking	Gemini-3.0 Pro	Claude Sonnet 4.5	GPT-5 High
Reasoning
MMLU-Pro	84.9	84.6	85.0	90.1	88.2	87.5
GPQA-Diamond	83.7	84.5	82.4	91.9	83.4	85.7
HLE (no tools)	22.1	23.9	25.1	37.5	13.7	26.3
AIME 2025	94.1	94.5	93.1	95.0	87.0	94.6
HMMT Feb. 2025	84.4	89.4	92.5	97.5	79.2	88.3
LiveCodeBench-v6	80.6	83.1	83.3	90.7	64.0	84.5
General Writing
Arena-Hard (Hard Prompt)	54.1	71.9	53.4	72.6	63.3	71.9
Arena-Hard (Creative Writing)	86.2	80.1	88.8	93.6	76.7	92.2
Long Context
LongBench V2	60.6	45.1	58.4	65.6	61.8	-
MRCR	45.7	44.2	55.5	89.7	55.4	-
Code Agent
SWE-Bench Verified	73.4	71.3	73.1	76.2	77.2	74.9
SWE-Bench Multilingual	71.7	61.1	70.2	-	68.0	55.3
Terminal-Bench Hard	30.5	30.6	35.4	39.0	33.3	30.5
Terminal-Bench 2.0	38.5	35.7	46.4	54.2	42.8	35.2
General Agent
BrowseComp	45.4	-	51.4	-	24.1	54.9
BrowseComp (w/ Context Manage)	58.3	60.2	67.6	59.2	-	-
\\(\tau^2\\)-Bench	80.3	74.3	80.3	85.4	84.7	80.2

4. Model Architecture

Hybrid Sliding Window Attention

MiMo-V2-Flash addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA).

Configuration: Stacks of \\(M=8\\) hybrid blocks. Each block contains \\(N=5\\) SWA layers followed by 1 GA layer.
Efficiency: SWA layers use a window size of 128 tokens, reducing KV cache significantly.
Sink Bias: Learnable attention sink bias is applied to maintain performance despite the aggressive window size.

Lightweight Multi-Token Prediction (MTP)

Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.

Structure: Uses a dense FFN (instead of MoE) and SWA (instead of GA) to keep the parameter count low (0.33B per block).
Performance: Facilitates self-speculative decoding, tripling generation speed and mitigating GPU idleness during small-batch RL training.

5. Post-Training Technical Highlights

MiMo-V2-Flash leverages a post-training pipeline designed to maximize reasoning and agentic capabilities through innovative distillation and reinforcement learning strategies.

5.1 Multi-Teacher On-Policy Distillation (MOPD)

We introduce Multi-Teacher On-Policy Distillation (MOPD), a new paradigm that formulates knowledge distillation as a reinforcement learning process.

Dense Token-Level Guidance: Unlike methods relying on sparse sequence-level feedback, MOPD utilizes domain-specific expert models (teachers) to provide supervision at every token position.
On-Policy Optimization: The student model learns from its own generated responses rather than a fixed dataset. This eliminates exposure bias and ensures smaller, more stable gradient updates.
Inherent Reward Robustness: Rewards are derived from the distribution divergence between student and teacher, making the process naturally resistant to reward hacking.

5.2 Scaling Agentic RL

We significantly scale up the agentic training environments to improve intelligence and generalization.

Massive Code Agent Environments: We utilize real-world GitHub issues to create over 100,000 verifiable tasks. Our automated pipeline maintains a Kubernetes cluster capable of running over 10,000 concurrent pods with a 70% environment setup success rate.
Multimodal Verifier for WebDev: For web development tasks, we employ a vision-based verifier that evaluates code execution via recorded videos rather than static screenshots. This reduces visual hallucination and ensures functional correctness.
Cross-Domain Generalization: Our experiments show that large-scale RL training on code agents effectively generalizes to other domains, boosting performance in Math and General Agent tasks.

5.3 Advanced RL Infrastructure

To support high-throughput RL training for large-scale MoE models, we implemented several infrastructure optimizations on top of SGLang and Megatron-LM.

Rollout Routing Replay (R3): Addresses numerical precision inconsistencies in MoE routing between inference and training. R3 reuses the exact routed experts from rollout during the training pass, ensuring consistency with negligible overhead.
Request-Level Prefix Cache: In multi-turn agent training, this cache stores KV states and routed experts from prior turns. It avoids re-computation and ensures sampling consistency across turns.
Fine-Grained Data Scheduler: We extend the rollout engine to schedule fine-grained sequences instead of micro-batches. Combined with partial rollout, this significantly reduces GPU idleness caused by long-tail stragglers.
Toolbox & Tool Manager: A two-layer design using Ray actor pools to handle resource contention. It eliminates cold-start delays for tool execution and isolates task logic from system policies.

6. Inference & Deployment

MiMo-V2-Flash supports FP8 mixed precision inference. We recommend using SGLang for optimal performance.

Quick Start with SGLang

pip install sglang

# Launch server
python3 -m sglang.launch_server \
        --model-path XiaomiMiMo/MiMo-V2-Flash \
        --served-model-name mimo-v2-flash \
        --pp-size 1 \
        --dp-size 2 \
        --enable-dp-attention \
        --tp-size 8 \
        --moe-a2a-backend deepep \
        --page-size 1 \
        --host 0.0.0.0 \
        --port 9001 \
        --trust-remote-code \
        --mem-fraction-static 0.75 \
        --max-running-requests 128 \
        --chunked-prefill-size 16384 \
        --reasoning-parser qwen3 \
        --tool-call-parser mimo \
        --context-length 262144 \
        --attention-backend fa3 \
        --speculative-algorithm EAGLE \
        --speculative-num-steps 3 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 4 \
        --enable-mtp

# Send request
curl -i http://localhost:9001/v1/chat/completions \
    -H 'Content-Type:application/json' \
    -d  '{
            "messages" : [{
                "role": "user",
                "content": "Nice to meet you MiMo"
            }],
            "model": "mimo-v2-flash",
            "max_tokens": 4096,
            "temperature": 0.8,
            "top_p": 0.95,
            "stream": true,
            "chat_template_kwargs": {
                "enable_thinking": true
            }
        }'

Inference with KTransformers (CPU Offloading)

KTransformers enables efficient MiMo-V2-Flash deployment on consumer-grade hardware by offloading MoE expert computations to CPU, built on top of SGLang. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.

For quick start and benchmarks, visit KTransformers.

Notifications

1. System prompt

[!IMPORTANT] The following system prompts are HIGHLY recommended, please choose from English and Chinese version.

English

You are MiMo, an AI assistant developed by Xiaomi.

Today's date: {date} {week}. Your knowledge cutoff date is December 2024.

Chinese

你是MiMo（中文名称也是MiMo），是小米公司研发的AI智能助手。

今天的日期：{date} {week}，你的知识截止日期是2024年12月。

2. Sampling parameters

[!IMPORTANT] Recommended sampling parameters:

top_p=0.95

temperature=0.8 for math, writing, web-dev

temperature=0.3 for agentic taks (e.g., vibe-coding, tool-use)

3. Tool-use practice

[!IMPORTANT] In the thinking mode with multi-turn tool calls, the model returns a reasoning_content field alongside tool_calls. To continue the conversation, the user must persist all history reasoning_content in the messages array of each subsequent request.

7. Citation

If you find our work helpful, please cite our technical report:

@misc{mimo2025flash,
  title={MiMo-V2-Flash Technical Report},
  author={LLM-Core Xiaomi},
  year={2025},
  url={https://github.com/XiaomiMiMo/MiMo-V2-Flash/paper.pdf}
}

8. Contact

Please contact us at mimo@xiaomi.com, join our WeChat group below or open an issue if you have any questions.