MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities


TL;DR

本文提出了一个名为 MM-Vet 的新基准,旨在通过评估大型多模态模型(LMMs)在整合多种核心视觉语言(VL)能力以解决复杂任务时的表现,来系统性地衡量其综合智能水平,并为此设计了一个基于大型语言模型(LLM)的自动化评估器。

关键定义

相关工作

当前,大型多模态模型(LMMs)如 GPT-4V 展现出解决复杂问题的惊人能力。然而,现有的视觉语言(VL)基准,如 VQA、COCO 等,通常只关注单一或两种特定能力(如识别、描述),无法系统性地评估模型在处理需要多种能力协同的复杂任务时的表现。这构成了一个关键瓶颈:我们缺乏一个能够衡量LMMs“综合智能”的有效工具。

此外,由于LMMs的输出通常是开放式的自由文本,长度和风格各异,如何设计一个统一、公平且可扩展的评估指标也是一大挑战。

本文旨在解决上述两个问题:

  1. 如何系统地构建和评估需要多种核心能力整合的复杂多模态任务?
  2. 如何为格式多样的开放式回答设计一个统一且有效的评估指标?

本文方法

本文的核心贡献是提出了一个全新的评估框架,包括 MM-Vet 基准的设计和一个创新的评估器。

MM-Vet 基准设计

MM-Vet 的设计理念源于一个核心洞察:LMMs 解决复杂多模态任务的“涌现”能力,本质上来源于其对多种核心视觉语言能力的掌握和无缝整合。

基于此,MM-Vet 的构建分为两步:

  1. 定义核心能力:首先,通过分析现实世界中的复杂场景,提炼出六种最核心的VL能力:识别(Rec)、OCR、知识(Know)、语言生成(Gen)、空间意识(Spat)和数学(Math)。
  2. 构建综合任务:其次,围绕这六种能力的16种不同组合(例如“识别+知识+生成”用于解释视觉笑话,“OCR+空间意识+数学”用于计算购物清单总价),构建了包含200张图片和218个问题的测试集。这些问题均为开放式问答,覆盖了广泛的真实世界场景。

下图展示了MM-Vet数据集中各核心能力及其组合的分布情况。

MM-Vet中各项能力的比例 图注:(a) 数据集中每个单一能力的覆盖比例(因样本常涉及多能力,总和大于100%)。(b) 数据集中不同能力组合的分布比例(总和为100%)。

基于LLM的开放式评估器

为了解决开放式回答的评估难题,本文没有采用传统的关键词匹配或多项选择题模式,而是创新地提出了一个基于LLM的自动化评估器。

Table 1: 用于GPT-4评估模型输出的少样本提示 下表中,$\mathcal{Q}$ 是样本问题,$\mathcal{G}$ 是标准答案,$\mathcal{P}$ 是模型输出。该提示包含了不同长度和类型的答案示例,使评估器能够处理多样化的输出。

       
Compare the ground truth and prediction from AI models, to give a correctness score for the prediction. in the ground truth means it is totally right only when all elements in the ground truth are present in the prediction, and means it is totally right when any one element in the ground truth is present in the prediction. The correctness score is 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Just complete the last space of the correctness score.      
Question Ground truth Prediction Correctness
What is x in the equation? -1 -5 x = 3 0.0
What is x in the equation? -1 -5 x = -1 0.5
What is x in the equation? -1 -5 x = -5 0.5
What is x in the equation? -1 -5 x = -5 or 5 0.5
What is x in the equation? -1 -5 x = -1 or x = -5 1.0
Can you explain this meme? This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. The meme talks about Iceland and Greenland. It’s pointing out that despite their names, Iceland is not very icy and Greenland isn’t very green. 0.4
Can you explain this meme? This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. The meme is using humor to point out the misleading nature of Iceland’s and Greenland’s names. Iceland, despite its name, has lush green landscapes while Greenland is mostly covered in ice and snow. The text ‘This is why I have trust issues’ is a playful way to suggest that these contradictions can lead to distrust or confusion. The humor in this meme is derived from the unexpected contrast between the names of the countries and their actual physical characteristics. 1.0
$\mathcal{Q}$ $\mathcal{G}$ $\mathcal{P}$  

实验结论

本文对两种主流LMM范式进行了评估:1) 端到端微调的LMMs(如LLaVA、InstructBLIP等);2) 基于LLM并使用外部工具的智能体系统(如MM-ReAct)。同时,也测试了闭源的GPT-4V模型。

核心能力评估摘要

Table 2: 各LMM在MM-Vet六大核心能力上的表现 (%) 绿色、橙色、蓝色分别表示各项得分的第一、二、三名。

Model Rec OCR Know Gen Spat Math Total
Transformers Agent (GPT-4) 18.2 3.9 2.2 3.2 12.4 4.0 13.4$\pm$0.5
MiniGPT-4-8B 27.4 15.0 12.8 13.9 20.3 7.7 22.1$\pm$0.1
BLIP-2-12B 27.5 11.1 11.8 7.0 16.2 5.8 22.4$\pm$0.2
LLaVA-7B 28.0 17.1 16.3 18.9 21.2 11.5 23.8$\pm$0.6
MiniGPT-4-14B 29.9 16.1 20.4 22.1 22.2 3.8 24.4$\pm$0.4
Otter-9B 27.3 17.8 14.2 13.8 24.4 3.8 24.7$\pm$0.3
OpenFlamingo-9B 28.7 16.7 16.4 13.1 21.0 7.7 24.8$\pm$0.2
InstructBLIP-14B 30.8 16.0 9.8 9.0 21.1 10.5 25.6$\pm$0.3
InstructBLIP-8B 32.4 14.6 16.5 18.2 18.6 7.7 26.2$\pm$0.2
LLaVA-13B 30.9 20.1 23.5 26.4 24.3 7.7 26.4$\pm$0.1
MM-ReAct-GPT-3.5 24.2 31.5 21.5 20.7 32.3 26.2 27.9$\pm$0.1
LLaVA-7B (LLaMA-2) 32.9 20.1 19.0 20.1 25.7 5.2 28.1$\pm$0.4
LLaMA-Adapter v2-7B 38.5 20.3 31.4 33.4 22.9 3.8 31.4$\pm$0.1
LLaVA-13B (V1.3, 336px) 38.1 22.3 25.2 25.8 31.3 11.2 32.5$\pm$0.1
LLaVA-13B (LLaMA-2) 39.2 22.7 26.5 29.3 29.6 7.7 32.9$\pm$0.1
MM-ReAct-GPT-4 33.1 65.7 29.0 35.0 56.8 69.2 44.6$\pm$0.2

能力整合评估摘要

Table 3: 各LMM在MM-Vet能力整合任务上的表现 (%) (部分列展示)

Model Rec+Know+Gen Rec OCR+Spat OCR+Spat+Math OCR+Math Total
LLaVA-13B (LLaMA-2) 29.8 59.5 21.2 14.3 36.2 32.9$\pm$0.1
LLaMA-Adapter v2-7B 35.3 54.1 13.5 7.1 38.5 31.4$\pm$0.1
MM-ReAct-GPT-4 22.5 33.0 69.2 78.6 83.0 44.6$\pm$0.2

与 GPT-4V 的对比

Table 5: GPT-4V 在MM-Vet六大核心能力上的表现 (%)

Model Rec OCR Know Gen Spat Math Total
LLaVA-13B (LLaMA-2) 39.2 22.7 26.5 29.3 29.6 7.7 32.9$\pm$0.1
MM-ReAct-GPT-4 33.1 65.7 29.0 35.0 56.8 69.2 44.6$\pm$0.2
GPT-4V 67.5 68.3 56.2 60.7 69.4 58.6 67.7$\pm$0.3
GPT-4V-Turbo-detail:high 62.9 75.9 53.7 57.3 76.8 69.5 67.6$\pm$0.1