Loading...

Arena LeaderBoard

The table below illustrate the elo rating rank of the selected LVLM models based on the evaluation data collected from our web Arena.

LVLM Arena Leaderboard 🏆. (Timeframe: May 13 - Aug 20, 2023)

Rank Model Name Elo Rating Affiliation Description
1 🥇Otter-Image 1047.5 NTU a LVLM with in-context instruction tuning.
2 🥈LLaMA-Adapter V2 1038.7 Shanghai AI Lab a LVLM with parameter-efficient instruction tuning.
3 🥉MiniGPT-4 1009.9 KAUST a LVLM with only the FC layer tuned on 3.5K intruction data.
4 LLaVA 1002.8 Wisconsin-Madison a LVLM with the whole LLM trained on 158K visual instruction data.
5 VPGTrans 999.6 NUS a LVLM equips LLaMA with a visual model by transfer learning.
6 InstructBLIP 995.5 Salesforce a LVLM with the Q-Former trained on many instruction QA datasets.
7 mPLUG-Owl 985.6 DAMO Academy a LVLM with LLM instructionally tuned using LoRA technique.
8 Otter 972.1 NTU a LVLM with 1.3B adaption parameters tuned on 158K instruction data.
9 BLIP-2 948.8 Salesforce a LVLM without being tuned with instruction data.

Quantitative LeaderBoard

In LVLM-eHub, we evaluate LVLM model in six kinds of abilities.

Model Name Average Visual Perception Visual Knowledge Acquisition Visual Reasoning Visual Commonsense Object Hallucination Embodied Intelligence
LLaMA-Adapter V2 0.725 0.813 0.443 0.833 0.589 0.751 0.922
LLaVA 0.671 0.615 0.377 0.771 0.791 0.595 0.879
VPGTrans 0.624 0.563 0.720 0.588 0.522 0.565 0.789
MiniGPT-4 0.594 0.727 0.346 0.527 0.565 0.594 0.805
InstructBLIP 0.928 0.928 0.967 0.908 0.995 1.00 0.772
mPLUG-Owl 0.596 0.831 0.286 0.420 0.579 0.673 0.785
Otter 0.565 0.661 0.237 0.513 0.582 0.633 0.761
BLIP-2 0.783 0.858 0.927 0.759 0.535 0.945 0.674

Visual Perception

The visual perception ability of LVLM models are evaluated in three tasks, which are image classification, object counting (OC) and multi-class identification (MCI). Notably, accuracy metric is used for all tasks.

Model Name ImageNet1K CIFAR10 Pets37 Flowers102 COCO (OC) VCR (OC) COCO (MCI) VCR (MCI)
BLIP2 23.71 58.20 34.79 19.44 48.90 25.05 86.06 66.59
InstructBLIP 24.51 67.24 38.86 21.78 46.65 29.29 87.81 76.49
LA-V2 25.89 64.86 24.39 22.34 38.50 26.51 82.90 50.66
LLaVA 23.50 67.96 9.09 8.38 20.56 24.60 49.66 66.90
MiniGPT-4 21.17 61.39 18.90 21.70 20.86 25.26 72.70 66.02
mPLUG-Owl 26.68 59.66 43.16 22.91 34.14 27.98 58.30 55.56
Otter 19.29 65.42 5.79 6.13 46.14 41.06 51.03 51.60
VPGTrans 19.75 60.88 10.88 7.97 27.30 19.46 52.34 48.80

Visual Knowledge Acquisition

The visual knowledge acqusition ability of LVLM models are evaluated in three kinds of tasks, which are optical character recognition (OCR), key information extraction (KIE) and image captioning. In the table below, SROIE and FUNSD datasets are used to evaluate the KIE performance of LVLM models with entity-level F1 score, which NoCaps, Flickr-30k and WHOOPS datasets use CIDEr score the test the image captioning performance of LVLM models. The rest datasets evaluate LVLM models on OCR tasks with word accuracy.

Model Name IIIT5K IC13 IC15 Total-Text CUTE80 SVT SVTP COCO-Text WordArt CTW HOST WOST SROIE FUNSD NoCaps Flickr-30k WHOOPS
BLIP2 80.17 81.13 66.68 68.31 85.07 85.78 77.34 53.62 73.66 67.43 57.28 68.83 0.08 1.02 48.58 46.48 96.12
InstructBLIP 83.90 82.08 73.57 71.51 86.11 86.86 80.93 58.25 75.12 68.58 61.22 73.26 0.09 1.03 46.33 50.45 97.98
LA-V2 36.30 20.87 29.40 30.93 35.76 20.40 31.01 20.94 38.98 18.13 16.60 21.73 0.02 2.16 41.66 30.49 57.60
LLaVA 31.57 16.39 26.58 24.51 36.46 18.55 27.44 18.05 35.87 16.73 15.94 20.49 0.01 1.93 33.09 27.65 34.36
MiniGPT-4 25.00 16.69 22.05 18.65 33.33 15.46 20.31 11.86 31.90 14.95 13.45 19.12 0.02 1.27 42.43 26.04 47.36
mPLUG-Owl 25.30 14.98 20.99 20.63 31.94 14.37 20.78 12.88 31.90 13.87 11.88 14.65 0.01 0.41 28.30 20.53 42.73
Otter 17.57 09.67 18.49 14.81 18.75 10.51 19.22 11.30 21.05 10.05 10.14 12.29 0.01 1.91 29.23 23.00 32.70
VPGTrans 62.87 71.11 55.90 54.76 70.49 72.02 64.50 36.98 62.34 52.80 50.58 57.66 0.02 1.20 48.13 32.51 38.38

Visual Reasoning

In LVLM-eHub, the evaluation of LVLM models' visual reasoning ability encompasses visual knowledge answering (VQA), knowledge-grounded image description (KGID) and visual entailment (VE). In the table below, all datasets are used in VQA task except ScienceQA IMG and VizWiz are used in KGID task and SNLI-VE is used in VE task. Besides mean reciprocal rank (MRR) is used for Visdial dataset, accuracy are used in the remaining datasets.

Model Name DocVQA TextVQA STVQA OCR-VQA OKVQA GQA Visdial IconQA VSR WHOOPS ScienceQA IMG VizWiz SNLI-VE
BLIP2 4.75 31.98 20.98 38.85 44.93 45.53 10.73 62.82 63.63 24.87 60.73 65.44 32.00
InstructBLIP 5.89 39.60 28.30 60.20 60.52 49.96 45.20 56.25 41.28 30.13 46.26 65.31 59.00
LLaMA-Adapter-v2 8.13 43.76 32.33 38.12 55.93 43.93 12.92 41.83 50.63 24.15 54.19 62.07 58.8
LLaVA 6.26 38.92 28.40 23.40 54.36 41.30 14.66 42.95 51.24 24.39 49.33 62.42 57.80
MiniGPT-4 2.65 19.40 13.55 16.85 37.48 30.82 10.31 37.59 41.56 17.91 25.43 47.48 54.80
mPLUG-Owl 2.24 38.76 12.10 8.84 22.89 14.02 13.34 11.64 24.74 20.70 2.80 38.99 54.50
Otter 3.44 21.52 15.23 19.50 49.01 38.12 11.67 26.77 06.40 15.14 27.22 50.04 52.60
VPGTrans 3.53 21.98 17.13 21.71 44.51 32.99 9.70 38.22 48.77 15.88 36.99 53.23 52.20

Visual Commonsense

In the evaluation of visual commonsense ability, two challenging benchmarks are used, which are ImageNetVC and Visual Comonsense Reasoning (VCR).

Model Name ImageNetVC (Color) ImageNetVC (Shape) ImageNetVC (Mater.) ImageNetVC (Compo.) ImageNetVC (Others) VCR
BLIP2 26.22 34.21 35.79 50.71 34.48 31.60
InstructBLIP 67.78 59.06 63.50 83.25 68.37 54.20
LA-v2 36.12 28.63 33.86 50.13 32.69 49.80
LLaVA 43.70 39.10 65.58 56.73 59.38 48.20
MiniGPT-4 24.49 23.54 28.56 59.26 39.38 49.00
mPLUG-Owl 26.20 34.19 35.82 50.73 34.50 46.00
Otter 26.21 34.20 35.81 50.72 34.49 47.00
VPGTrans 23.34 23.92 27.26 56.43 35.83 41.40

Object Hallucination

Based on the MSCOCO dataset, LVLM-eHub performs the evaluations on MSCOCO-Random/Popular/Adversarial datasets. Notably, from Random to Adversarial, the questions are increasingly complex.

Model Name Random (Accuracy) Random (Precision) Random (Recall) Random (F1-Score) Random (Yes) Popular (Accuracy) Popular (Precision) Popular (Recall) Popular (F1-Score) Popular (Yes) Adversarial (Accuracy) Adversarial (Precision) Adversarial (Recall) Adversarial (F1-Score) Adversarial (Yes)
BLIP2 82.21 97.48 67.27 79.61 35.58 80.10 90.49 67.27 77.17 37.17 78.52 86.83 67.27 75.81 38.73
InstructBLIP 88.83 96.01 81.60 88.23 43.99 84.15 85.96 81.60 83.72 47.47 81.95 82.05 81.60 81.82 49.77
LA-V2 74.44 68.24 94.00 79.08 70.99 56.82 53.89 94.20 68.56 87.40 60.52 54.58 96.45 69.12 88.23
LLaVA 51.52 51.54 100.00 68.03 100.00 50.00 50.00 100.00 66.67 100.00 50.00 50.00 100.00 66.67 100.00
MiniGPT-4 52.58 68.63 57.50 62.57 44.25 49.31 63.56 58.03 60.67 48.29 49.62 62.55 58.71 68.47 48.54
mPLUG-Owl 61.37 57.89 97.52 72.65 87.15 55.83 53.61 97.13 69.09 91.20 54.43 52.73 97.59 72.09 92.95
Otter 61.40 57.82 95.92 72.15 85.76 49.56 50.07 95.92 65.79 96.58 50.68 50.56 95.92 66.22 95.31
VPGTrans 48.28 74.17 56.78 64.32 47.38 47.86 70.37 55.90 62.92 51.92 47.86 69.76 59.22 64.06 52.27

Embodied Intelligence

To appraive the quality of LVLM models' planning outputs, we conducted a user study involving 15 participants to evaluate the embodied intelligence of LVLM models. The study comprised six household scenarios carefully selected from VirtualHome.

Model Name Object Recon. Spatial Relation. Conciseness Reasonability Executability
BLIP2 2.03 1.68 3.25 2.78 2.88
InstructBLIP 3.08 2.78 2.48 3.20 3.10
LA-V2 3.81 3.71 2.04 4.04 4.08
LLaVA 3.88 3.61 1.86 3.70 3.82
MiniGPT-4 3.70 3.47 1.62 3.54 3.11
mPLUG-Owl 3.42 3.22 1.48 3.44 3.54
Otter 3.38 3.10 1.86 3.07 3.12
VPGTrans 3.43 3.22 1.76 3.35 3.35
Top