Overview
IQBench is a novel benchmark developed to evaluate the fluid intelligence of Vision-Language Models (VLMs) using standardized visual IQ tests. It features:
- 500 human-curated questions
- 10 reasoning domains
- Sources: Images sourced from online repositories, textbooks, and educational content.
- Editing: Images were cleaned or regenerated to ensure quality and legality.
Unlike other benchmarks, IQBench emphasizes both prediction accuracy and interpretability of reasoning.
Tasks
- MDRT: Mechanical Deductive Reasoning Test
- DRTF: Deductive Reasoning Test with Figures
- 3D SPRT: 3D Spatial Deductive Reasoning Test
- VRTS: Verbal Reasoning Test with Syllogisms
- IVRT: Inductive Verbal Reasoning Test
- Num.: Numerical Reasoning
- FS: Figure Series
- NS: Number Series
- Ana5 / Ana3: Anagram Tests (5/3-letter)
Dataset Statistics
Metric | Value |
---|---|
Total samples | 500 |
Number of topics | 10 |
Samples per topic | 50 |
Multiple choice questions | 110 |
Open-ended questions | 390 |
Avg. question length | 27 words |
Avg. reasoning pattern length | 48 words |
Dataset examples
Data Quality Control
- Uniqueness: Lexical overlap analysis to remove duplicates
- Standardization: Manual review to ensure consistency in format
- Visual emphasis: Questions are vision-centric to minimize reliance on language priors
- Legal compliance: All images are verified or newly created to avoid copyright issues
- Reduced data leakage: Human-generated content decreases the chance of pretraining contamination
Evaluation Methods
IQBench adopts a dual-metric evaluation framework to assess both the accuracy and reasoning quality of Vision-Language Models (VLMs):
1. Accuracy Score
- Measures if the final answer is correct (exact match).
- Applies to multiple-choice and open-ended questions.
- Score:
1
(correct),0
(incorrect).
2. Reasoning Score
- Evaluates how well the modelβs explanation aligns with the expected reasoning path.
- Uses an LLM-as-judge approach (e.g.,
gpt-4o-mini
) to compare model explanations with human-annotated patterns. - Score:
1
(aligned),0
(misaligned).
Human Evaluation
- Human judges assess reasoning quality for qualitative insights, complementing automated scores.
This combined approach ensures a more comprehensive evaluation of both answer accuracy and interpretability of reasoning.
π IQBench Evaluation Results
π Reasoning Evaluation
Model | MDRT | DRTF | 3D SPRT | VRTS | IVRT | Num. | FS | NS | Ana5 | Ana3 | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|
gemini-2.5-flash |
0.60 | 0.78 | 0.22 | 0.72 | 0.78 | 0.72 | 0.54 | 0.94 | 0.14 | 0.42 | 0.586 |
gemini-2.0-flash |
0.50 | 0.52 | 0.16 | 0.74 | 0.74 | 0.28 | 0.32 | 0.58 | 0.16 | 0.08 | 0.408 |
claude-3.7-sonnet |
0.58 | 0.82 | 0.36 | 0.72 | 0.70 | 0.50 | 0.54 | 0.76 | 0.04 | 0.14 | 0.516 |
claude-3.5-sonnet |
0.60 | 0.64 | 0.16 | 0.74 | 0.72 | 0.16 | 0.28 | 0.52 | 0.12 | 0.06 | 0.400 |
gpt-4o |
0.60 | 0.44 | 0.56 | 0.78 | 0.80 | 0.30 | 0.68 | 0.44 | 0.04 | 0.02 | 0.466 |
o4-mini |
0.92 | 0.88 | 0.82 | 0.78 | 0.80 | 0.72 | 0.90 | 0.90 | 0.10 | 0.14 | 0.696 |
gpt-o3 |
0.70 | 0.88 | β | β | β | β | β | β | 0.12 | β | β |
β Accuracy Evaluation
Model | MDRT | DRTF | 3D SPRT | VRTS | IVRT | Num. | FS | NS | Ana5 | Ana3 | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|
gemini-2.5-flash |
0.60 | 0.78 | 0.18 | 0.66 | 0.78 | 0.74 | 0.60 | 0.88 | 0.14 | 0.42 | 0.578 |
gemini-2.0-flash |
0.58 | 0.56 | 0.22 | 0.70 | 0.74 | 0.52 | 0.44 | 0.84 | 0.16 | 0.14 | 0.490 |
claude-3.7-sonnet |
0.64 | 0.90 | 0.40 | 0.72 | 0.68 | 0.46 | 0.66 | 0.82 | 0.04 | 0.16 | 0.548 |
claude-3.5-sonnet |
0.62 | 0.68 | 0.20 | 0.74 | 0.74 | 0.32 | 0.42 | 0.76 | 0.12 | 0.10 | 0.470 |
gpt-4o |
0.56 | 0.42 | 0.20 | 0.80 | 0.74 | 0.36 | 0.26 | 0.66 | 0.06 | 0.02 | 0.408 |
o4-mini |
0.72 | 0.86 | 0.34 | 0.66 | 0.76 | 0.82 | 0.60 | 0.94 | 0.02 | 0.14 | 0.615 |
gpt-o3 |
0.70 | 0.88 | β | β | β | β | β | β | 0.12 | β | β |