SoTA Feed — Every open-weights release from the labs that matter

Ad: Read SoTA Feed without this slot — ad-free site plus a personal ad-free feed URL $3/month

Qwen-Image-Bench

May 21, 2026 · Alibaba Qwen · license: apache-2.0 · view on Hugging Face ↗
55 GB · 27B dense

Q-Judger

Paper GitHub Model ModelScope Dataset ModelScope

A fine-tuned judge model for evaluating text-to-image (T2I) generation quality. Built on top of Qwen3.6-27B, it scores generated images across 5 hierarchical dimensions using structured checklists and outputs JSON-formatted evaluation results.

Links

ResourceLink
📑 Paperhttp://arxiv.org/abs/2605.28091
📊 Benchmark Dataset (HuggingFace)https://huggingface.co/datasets/Qwen/Qwen-Image-Bench
📊 Benchmark Dataset (ModelScope)https://www.modelscope.cn/datasets/Qwen/Qwen-Image-Bench
💻 GitHubhttps://github.com/QwenLM/Qwen-Image-Bench
🧑‍⚖️ Q-Judger Modelhttps://huggingface.co/Qwen/Qwen-Image-Bench
🧑‍⚖️ Q-Judger Modelhttps://modelscope.cn/models/Qwen/Qwen-Image-Bench

Model Description

Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy and outputs structured JSON scores.

Evaluation Dimensions

The model evaluates images across 5 top-level dimensions, each with multiple sub-dimensions:

Quality

Aesthetics

Alignment

Real-world Fidelity

Creative Generation

Scoring Methodology

Raw Score Mapping

Raw ScoreMeaningMapped Score
0Fail0
1Pass60
2Excel100
N/ANot applicableExcluded

Aggregation

  1. Level-3 → Level-2: Average all non-N/A Level-3 scores within a Level-2 category
  2. Level-2 → Level-1: Average all Level-2 scores within a Level-1 dimension
  3. Level-1 → Total: Average all Level-1 dimension scores

Human Agreement

We validate the judge model against human expert rankings by computing Spearman rank correlation ($\rho$) between the model's rankings and human expert rankings across the five L1 pillars and overall. All correlations are statistically significant ($p < 10^{-4}$, $N = 18$ models).

DimensionSpearman $\rho$
Quality0.89
Aesthetics0.89
Alignment0.89
Real-world Fidelity0.92
Creative Generation0.92
Overall0.92

Quick Start

Get the Inference Code

git clone https://github.com/QwenLM/Qwen-Image-Bench.git
cd Qwen-Image-Bench

Installation

1. Create and activate a virtual environment with uv:

uv venv myenv --python 3.11
source myenv/bin/activate

2. Install PyTorch (select the command matching your CUDA version):

See the official guide: https://pytorch.org/get-started/locally/

3. Install Python dependencies:

uv pip install -r requirements.txt

This installs all required dependencies including ms-swift.

Run Inference

python judge.py \
  --input your_data.jsonl \
  --model Qwen/Qwen-Image-Bench

Input Format

Prepare a CSV, JSON, or JSONL file with the following columns:

ColumnTypeDescription
IDintPrompt identifier (1-1000), must match benchmark metadata
promptstrThe text prompt used to generate the image
image_pathstrPath to the generated image file

Output Format

The model outputs a JSON object per dimension, structured as:

{
  "Level-2 Dimension": {
    "Level-3 Dimension": {"score": 0|1|2|"N/A"}
  }
}

Example (Quality dimension):

{
  "Realism": {
    "Physical Logic": {"score": 1},
    "Material Texture": {"score": 2}
  },
  "Detail": {
    "Noise": {"score": 1},
    "Edge Clarity": {"score": 1},
    "Naturalness": {"score": 1}
  },
  "Resolution": {
    "Resolution": {"score": 2}
  }
}

CLI Options

ArgumentDefaultDescription
--input(required)Input CSV/JSON/JSONL with ID, prompt, image_path
--model(required)HuggingFace model ID or local model path
--hf-bench-repo-HF dataset repo for bench metadata
--local-metadata-Local metadata file path (overrides default)
--max-batch-size24ms-swift max_batch_size
--max-new-tokens4096Max generation tokens

Inference Parameters

The judge model uses fixed inference parameters for reproducibility:

ParameterValue
seed42
temperature0
top_k1
top_p1.0
repetition_penalty1.05
max_new_tokens4096
enable_thinkingTrue
max_batch_size24

Citation

If you find this model useful, please cite our paper:

@misc{li2026qwenimagebenchgenerationcreationtexttoimage,
      title={Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation}, 
      author={Niantong Li and Guangzheng Hu and Weixu Qiao and Ying Ba and Qichen Hong and Shijun Shen and Jinlin Wang and Fan Zhou and Jianye Kang and Xin Shang and Ziyi He and Wei Wang and Dalin Li and Jiahao Li and Jie Zhang and Kaiyuan Gao and Kun Yan and Lihan Jiang and Ningyuan Tang and Shengming Yin and Tianhe Wu and Xiao Xu and Xiaoyue Chen and Yuxiang Chen and Yan Shu and Yanran Zhang and Yilei Chen and Yixian Xu and Zekai Zhang and Zhendong Wang and Zihao Liu and Zikai Zhou and Hongzhu Shi and Yi Wang and Bing Zhao and Hu Wei and Lin Qu and Chenfei Wu},
      year={2026},
      eprint={2605.28091},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.28091}, 
}

License

This project is licensed under the Apache License 2.0.

← all releases