SoTA Feed — Every open-weights release from the labs that matter

Ad: Read SoTA Feed without this slot — ad-free site plus a personal ad-free feed URL $3/month

MiMo-VL-7B-SFT-2508

Aug 7, 2025 · Xiaomi MiMo · license: mit · view on Hugging Face ↗
16.6 GB · 8.3B dense

Xiaomi-MiMo

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL Technical Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


| 🤗 HuggingFace  | 🤖️ ModelScope  | 📔 Technical Report  | 📃 Paper  |

🔥🔥🔥MiMo-VL 2508 Updates

We're excited to announce improvements to our MiMo-VL (MiMo-VL-7B-RL-2508 and MiMo-VL-7B-SFT-2508), featuring enhanced performance across multiple benchmarks, improved thinking control capabilities, and better user experience.

📈 Performance Improvements

MiMo-VL-7B-RL-2508 demonstrates consistent improvements across both image and video benchmarks, achieving notable milestones of 70.6 on MMMU and 70.8 on VideoMME.

Benchmark Improvements

Full evaluation results can be found below.

🤔 Thinking Control Feature

A thinking control capability that allows users to turn off the model's reasoning mode using the no_think parameter:

❤️ Enhanced User Experience

Our internal VLM Arena ratings show meaningful improvement in real-world performance:

These updates deliver a more capable, flexible, and reliable vision-language model for both academic evaluation and practical applications.

😉 Case Study: What are the appealing features of this car?

Xiaomi SU7

MiMo-VL-7B-RL-2508's Answer:

<think>...</think>
To determine the appealing features of the Xiaomi SU7, we analyze visual elements and context from the video:
1. Aesthetic Design
- The car boasts a sleek, aerodynamic exterior with smooth, modern lines.
- Distinctive lighting (sharp headlights and a continuous, glowing taillight strip) enhances its futuristic and sporty look.
- Vibrant color and polished finish make it visually striking.
2. Performance & Speed
- Multiple dynamic shots of the car driving at high speeds on diverse roads (desert, forest, city) emphasize agility and power.
- Close - ups of spinning wheels and the engine (with lightning-like effects) suggest advanced engineering and high-performance capabilities (likely electric or tech-driven).
3. Versatility
- The car is shown in varied environments (desert at dusk, forested roads, urban streets), highlighting its adaptability to different terrains and use cases.
4. Technological Innovation
- The engine close-up with electric-like lightning effects implies cutting-edge technology, positioning the car as a tech-forward vehicle (aligning with Xiaomi’s tech brand identity).
5. Brand Appeal
Prominent “Xiaomi SU7” branding leverages Xiaomi’s reputation in consumer tech, adding credibility and allure for tech-savvy audiences.


In summary, the Xiaomi SU7 appeals through its striking design, high-performance capabilities, versatility, technological innovation, and strong brand association.

Full Evaluation Results

Evaluation Results

Model Recommendation

Both versions of the MiMo-VL-7B-2508 model are now open-sourced on Hugging Face:

Deployment Parameters

Thinking Control

Users can control the thinking mode by appending /no_think to queries:

❗️Important: The /no_think command must be the very last part of user message, which means after /no_think, there shouldn't be any user content like image or video.

Placing Visual Input

For prompts with a single image or video, always place the visual media before the text. For example:

✅ Good:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text",  "text": "Describe the image. /no_think"},
        ],
    }
]

❌ Bad:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text",  "text": "Describe the image. /no_think"},
            {"type": "image", "image": image_path},
        ],
    }
]

I. Introduction

In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks.

The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model.

We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community.

🛤️ During this journey, we find

II. Model Details

Models are available at Huggingface Collections: MiMo-VL and ModelScope Collections: MiMo-VL

ModelDescriptionDownload (HuggingFace)Download (ModelScope)
MiMo-VL-7B-SFTVLM with extraordinary reasoning potential after 4-stage pre-training🤗 XiaomiMiMo/MiMo-VL-7B-SFT🤖️ XiaomiMiMo/MiMo-VL-7B-SFT
MiMo-VL-7B-RLRL model leapfrogging existing open-source models🤗 XiaomiMiMo/MiMo-VL-7B-RL🤖️ XiaomiMiMo/MiMo-VL-7B-RL

III. Evaluation Results

General Capabilities

In general visual-language understanding, MiMo-VL-7B models achieve state-of-the-art open-source results.

Reasoning Tasks

In multi-modal reasoning, both the SFT and RL models significantly outperform all compared open-source baselines across these benchmarks.

[!IMPORTANT] Results marked with \* are obtained using our evaluation framework. Tasks with ${\dagger}$ are evaluated by GPT-4o.

GUI Tasks

MiMo-VL-7B-RL possess exceptional GUI understanding and grounding capabilities. As a general-purpose VL model, MiMo-VL achieves comparable or even superior performance to GUI-specialized models.

Elo Rating

With our in-house evaluation dataset and GPT-4o judgments, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models spanning from 7B to 72B parameters.

IV. Deployment

The MiMo-VL-7B series maintain full compatibility with the Qwen2_5_VLForConditionalGeneration architecture for deployment and inference.

V. Citation

@misc{coreteam2025mimovltechnicalreport,
      title={MiMo-VL Technical Report}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      eprint={2506.03569},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03569}, 
}

VI. Contact

Please contact us at mimo@xiaomi.com or open an issue if you have any questions.

← all releases