Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge

Motivation

Can Multimodal Judges Really Judge What They See?

Multimodal LLMs are replacing costly human annotations as automated evaluators.

However, judging vision-language responses requires more than fluent reasoning: a judge must verify whether each response is grounded in the image.

Key Question: Do MLLM judges actually use visual evidence when evaluating responses?

Problem

The Core Problem: Perceptual Judgment Bias

Definition: A multimodal judge fails to penalize a response whose visual claims contradict the image.

Key distinction from prior work: This is not simply poor visual perception or hallucination. It is a judgment-specific bias where the judge fails to connect perception with evaluation.

Two examples of perceptual judgment bias and how Perception-Judge corrects them. — **Perceptual judgment bias.** Existing MLLM judges can give high scores to visually inconsistent responses either because they misperceive the image or because they anchor on the response text despite perceiving the image correctly. Perception-Judge mitigates both failure modes by aligning judgments with verifiable visual evidence.

1. Insufficient Perceptual Capability

The judge itself misperceives the image, leading it to reward a visually incorrect answer simply because it sounds logical.

2. Response-Anchored Judgment Reasoning

The judge can perceive the image correctly, but still follows the response text during judgment, ignoring its own accurate perception.

Analysis 1: Bias Decomposition

The Main Error Is Not Just Poor Perception

Decomposing perceptual judgment errors reveals that anchoring on textual responses is a more significant issue than failing to see the image correctly.

Model	Acc. ↑	Error ↓
Model	Acc. ↑	Mode (a)Insufficient perception	Mode (b)Response anchoring	Overall
Qwen2.5-VL-7B	69.5%	14.0%	16.4%	30.5%
Flex-Judge-VL-7B	76.6%	9.4%	14.1%	23.5%
Perception-Judge-Flex-7B (Ours)	85.7%	6.7%	7.6%	14.3%

Main message: Response anchoring (14.1%) accounts for a larger portion of errors than pure perception failure (9.4%) in baseline models.

Analysis 2: Controlled Perturbation Study

Fluent Reasoning Can Hide Visual Errors

Comparing correct responses \(r_c\) with perturbed negative responses reveals a key vulnerability in current judges.

Controlled perturbation analysis for the project page.

Left: High Accuracy

The response contains both visual and reasoning errors, making it easier for judges to reject.

Right: Accuracy Drops

The response contains only a visual error, while the reasoning remains fluent and plausible.

Conclusion: Existing judges reject bad reasoning, but miss fluent visual errors.

Dataset

Dataset Generation: Perceptually Perturbed Judgment Dataset

For each input:

\(r_c\): perceptually and logically correct response

\(r_{rp}\): perceptually incorrect but reasoning-preserved response

\(r_{rp+r}\): both perceptually and logically incorrect response

Supervision: Explicit ordering: \(r_c \succ r_{rp} \succ r_{rp+r}\)

PPJD Statistics
Source	MMPR v1.2
Training set	3k high-quality pairs
Response set	\(r_c\), \(r_{rp}\), \(r_{rp+r}\)
Supervision type	Explicit triplet ordering

Method

Training Objective - Batch-Ranking Reward with GRPO

Batch-ranking reward objective for the project page.

Goal & Optimization

Train the judge to output the global target order. We use GRPO to optimize the judge with a verifiable reward.

Reward Design

Format reward: Validates the judgment output structure.

Batch-ranking reward: Measures closeness to the target order using Levenshtein distance.

Core Contribution: A perception-aware judge trained not with human pair labels, but with verifiable structured perturbations.

Results

Results on MLLM-as-a-Judge Benchmark

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark (ICML 2024 Oral)

Model	Single Score ↑	Pairwise (w/ Tie) ↑	Pairwise (w/o Tie) ↑	Batch Ranking ↓
Flex-Judge-VL-7B (Baseline)	0.404	0.514	0.623	0.517
Perception-Judge-Flex (Ours)	0.466	0.520	0.645	0.505
Qwen3-VL-4B-Thinking (Baseline)	0.419	0.543	0.663	0.498
Perception-Judge-Flex (Ours)	0.457	0.554	0.691	0.444

Our model achieves state-of-the-art evaluation alignment, matching proprietary models like GPT-4o while remaining highly data-efficient (trained on just 3k samples).

Qualitative Examples

Qualitative comparison between Flex-Judge-VL-7B and Perception-Judge-Flex-7B. — Flex-Judge-VL-7B vs. Perception-Judge-Flex-7B

Qualitative comparison between Qwen3-VL-4B-Thinking and Perception-Judge-Qwen-7B. — Qwen3-VL-4B-Thinking vs. Perception-Judge-Qwen-7B

Citation

BibTeX

If you find this work useful, please cite the paper as follows.

@inproceedings{perceptionjudge2026,
  title={Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling},
  author={Park, Seojeong and Choi, Jiho and Kang, Junyong and Lee, Seonho and Shin, Jaeyo and Shim, Hyunjung},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}