Arena LeaderBoard
The table below illustrate the elo rating rank of the selected LVLM models based on the evaluation data collected from our web Arena.
LVLM Arena Leaderboard 🏆. (Timeframe: May 13 - Aug 20, 2023)
Rank | Model Name | Elo Rating | Affiliation | Description |
---|---|---|---|---|
1 | 🥇Otter-Image | 1047.5 | NTU | a LVLM with in-context instruction tuning. |
2 | 🥈LLaMA-Adapter V2 | 1038.7 | Shanghai AI Lab | a LVLM with parameter-efficient instruction tuning. |
3 | 🥉MiniGPT-4 | 1009.9 | KAUST | a LVLM with only the FC layer tuned on 3.5K intruction data. |
4 | LLaVA | 1002.8 | Wisconsin-Madison | a LVLM with the whole LLM trained on 158K visual instruction data. |
5 | VPGTrans | 999.6 | NUS | a LVLM equips LLaMA with a visual model by transfer learning. |
6 | InstructBLIP | 995.5 | Salesforce | a LVLM with the Q-Former trained on many instruction QA datasets. |
7 | mPLUG-Owl | 985.6 | DAMO Academy | a LVLM with LLM instructionally tuned using LoRA technique. |
8 | Otter | 972.1 | NTU | a LVLM with 1.3B adaption parameters tuned on 158K instruction data. |
9 | BLIP-2 | 948.8 | Salesforce | a LVLM without being tuned with instruction data. |
Quantitative LeaderBoard
In LVLM-eHub, we evaluate LVLM model in six kinds of abilities.
Model Name | Average | Visual Perception | Visual Knowledge Acquisition | Visual Reasoning | Visual Commonsense | Object Hallucination | Embodied Intelligence |
---|---|---|---|---|---|---|---|
LLaMA-Adapter V2 | 0.725 | 0.813 | 0.443 | 0.833 | 0.589 | 0.751 | 0.922 |
LLaVA | 0.671 | 0.615 | 0.377 | 0.771 | 0.791 | 0.595 | 0.879 |
VPGTrans | 0.624 | 0.563 | 0.720 | 0.588 | 0.522 | 0.565 | 0.789 |
MiniGPT-4 | 0.594 | 0.727 | 0.346 | 0.527 | 0.565 | 0.594 | 0.805 |
InstructBLIP | 0.928 | 0.928 | 0.967 | 0.908 | 0.995 | 1.00 | 0.772 |
mPLUG-Owl | 0.596 | 0.831 | 0.286 | 0.420 | 0.579 | 0.673 | 0.785 |
Otter | 0.565 | 0.661 | 0.237 | 0.513 | 0.582 | 0.633 | 0.761 |
BLIP-2 | 0.783 | 0.858 | 0.927 | 0.759 | 0.535 | 0.945 | 0.674 |
Visual Perception
The visual perception ability of LVLM models are evaluated in three tasks, which are image classification, object counting (OC) and multi-class identification (MCI). Notably, accuracy metric is used for all tasks.
Model Name | ImageNet1K | CIFAR10 | Pets37 | Flowers102 | COCO (OC) | VCR (OC) | COCO (MCI) | VCR (MCI) |
---|---|---|---|---|---|---|---|---|
BLIP2 | 23.71 | 58.20 | 34.79 | 19.44 | 48.90 | 25.05 | 86.06 | 66.59 |
InstructBLIP | 24.51 | 67.24 | 38.86 | 21.78 | 46.65 | 29.29 | 87.81 | 76.49 |
LA-V2 | 25.89 | 64.86 | 24.39 | 22.34 | 38.50 | 26.51 | 82.90 | 50.66 |
LLaVA | 23.50 | 67.96 | 9.09 | 8.38 | 20.56 | 24.60 | 49.66 | 66.90 |
MiniGPT-4 | 21.17 | 61.39 | 18.90 | 21.70 | 20.86 | 25.26 | 72.70 | 66.02 |
mPLUG-Owl | 26.68 | 59.66 | 43.16 | 22.91 | 34.14 | 27.98 | 58.30 | 55.56 |
Otter | 19.29 | 65.42 | 5.79 | 6.13 | 46.14 | 41.06 | 51.03 | 51.60 |
VPGTrans | 19.75 | 60.88 | 10.88 | 7.97 | 27.30 | 19.46 | 52.34 | 48.80 |
Visual Knowledge Acquisition
The visual knowledge acqusition ability of LVLM models are evaluated in three kinds of tasks, which are optical character recognition (OCR), key information extraction (KIE) and image captioning. In the table below, SROIE and FUNSD datasets are used to evaluate the KIE performance of LVLM models with entity-level F1 score, which NoCaps, Flickr-30k and WHOOPS datasets use CIDEr score the test the image captioning performance of LVLM models. The rest datasets evaluate LVLM models on OCR tasks with word accuracy.
Model Name | IIIT5K | IC13 | IC15 | Total-Text | CUTE80 | SVT | SVTP | COCO-Text | WordArt | CTW | HOST | WOST | SROIE | FUNSD | NoCaps | Flickr-30k | WHOOPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BLIP2 | 80.17 | 81.13 | 66.68 | 68.31 | 85.07 | 85.78 | 77.34 | 53.62 | 73.66 | 67.43 | 57.28 | 68.83 | 0.08 | 1.02 | 48.58 | 46.48 | 96.12 |
InstructBLIP | 83.90 | 82.08 | 73.57 | 71.51 | 86.11 | 86.86 | 80.93 | 58.25 | 75.12 | 68.58 | 61.22 | 73.26 | 0.09 | 1.03 | 46.33 | 50.45 | 97.98 |
LA-V2 | 36.30 | 20.87 | 29.40 | 30.93 | 35.76 | 20.40 | 31.01 | 20.94 | 38.98 | 18.13 | 16.60 | 21.73 | 0.02 | 2.16 | 41.66 | 30.49 | 57.60 |
LLaVA | 31.57 | 16.39 | 26.58 | 24.51 | 36.46 | 18.55 | 27.44 | 18.05 | 35.87 | 16.73 | 15.94 | 20.49 | 0.01 | 1.93 | 33.09 | 27.65 | 34.36 |
MiniGPT-4 | 25.00 | 16.69 | 22.05 | 18.65 | 33.33 | 15.46 | 20.31 | 11.86 | 31.90 | 14.95 | 13.45 | 19.12 | 0.02 | 1.27 | 42.43 | 26.04 | 47.36 |
mPLUG-Owl | 25.30 | 14.98 | 20.99 | 20.63 | 31.94 | 14.37 | 20.78 | 12.88 | 31.90 | 13.87 | 11.88 | 14.65 | 0.01 | 0.41 | 28.30 | 20.53 | 42.73 |
Otter | 17.57 | 09.67 | 18.49 | 14.81 | 18.75 | 10.51 | 19.22 | 11.30 | 21.05 | 10.05 | 10.14 | 12.29 | 0.01 | 1.91 | 29.23 | 23.00 | 32.70 |
VPGTrans | 62.87 | 71.11 | 55.90 | 54.76 | 70.49 | 72.02 | 64.50 | 36.98 | 62.34 | 52.80 | 50.58 | 57.66 | 0.02 | 1.20 | 48.13 | 32.51 | 38.38 |
Visual Reasoning
In LVLM-eHub, the evaluation of LVLM models' visual reasoning ability encompasses visual knowledge answering (VQA), knowledge-grounded image description (KGID) and visual entailment (VE). In the table below, all datasets are used in VQA task except ScienceQA IMG and VizWiz are used in KGID task and SNLI-VE is used in VE task. Besides mean reciprocal rank (MRR) is used for Visdial dataset, accuracy are used in the remaining datasets.
Model Name | DocVQA | TextVQA | STVQA | OCR-VQA | OKVQA | GQA | Visdial | IconQA | VSR | WHOOPS | ScienceQA IMG | VizWiz | SNLI-VE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BLIP2 | 4.75 | 31.98 | 20.98 | 38.85 | 44.93 | 45.53 | 10.73 | 62.82 | 63.63 | 24.87 | 60.73 | 65.44 | 32.00 |
InstructBLIP | 5.89 | 39.60 | 28.30 | 60.20 | 60.52 | 49.96 | 45.20 | 56.25 | 41.28 | 30.13 | 46.26 | 65.31 | 59.00 |
LLaMA-Adapter-v2 | 8.13 | 43.76 | 32.33 | 38.12 | 55.93 | 43.93 | 12.92 | 41.83 | 50.63 | 24.15 | 54.19 | 62.07 | 58.8 |
LLaVA | 6.26 | 38.92 | 28.40 | 23.40 | 54.36 | 41.30 | 14.66 | 42.95 | 51.24 | 24.39 | 49.33 | 62.42 | 57.80 |
MiniGPT-4 | 2.65 | 19.40 | 13.55 | 16.85 | 37.48 | 30.82 | 10.31 | 37.59 | 41.56 | 17.91 | 25.43 | 47.48 | 54.80 |
mPLUG-Owl | 2.24 | 38.76 | 12.10 | 8.84 | 22.89 | 14.02 | 13.34 | 11.64 | 24.74 | 20.70 | 2.80 | 38.99 | 54.50 |
Otter | 3.44 | 21.52 | 15.23 | 19.50 | 49.01 | 38.12 | 11.67 | 26.77 | 06.40 | 15.14 | 27.22 | 50.04 | 52.60 |
VPGTrans | 3.53 | 21.98 | 17.13 | 21.71 | 44.51 | 32.99 | 9.70 | 38.22 | 48.77 | 15.88 | 36.99 | 53.23 | 52.20 |
Visual Commonsense
In the evaluation of visual commonsense ability, two challenging benchmarks are used, which are ImageNetVC and Visual Comonsense Reasoning (VCR).
Model Name | ImageNetVC (Color) | ImageNetVC (Shape) | ImageNetVC (Mater.) | ImageNetVC (Compo.) | ImageNetVC (Others) | VCR |
---|---|---|---|---|---|---|
BLIP2 | 26.22 | 34.21 | 35.79 | 50.71 | 34.48 | 31.60 |
InstructBLIP | 67.78 | 59.06 | 63.50 | 83.25 | 68.37 | 54.20 |
LA-v2 | 36.12 | 28.63 | 33.86 | 50.13 | 32.69 | 49.80 |
LLaVA | 43.70 | 39.10 | 65.58 | 56.73 | 59.38 | 48.20 |
MiniGPT-4 | 24.49 | 23.54 | 28.56 | 59.26 | 39.38 | 49.00 |
mPLUG-Owl | 26.20 | 34.19 | 35.82 | 50.73 | 34.50 | 46.00 |
Otter | 26.21 | 34.20 | 35.81 | 50.72 | 34.49 | 47.00 |
VPGTrans | 23.34 | 23.92 | 27.26 | 56.43 | 35.83 | 41.40 |
Object Hallucination
Based on the MSCOCO dataset, LVLM-eHub performs the evaluations on MSCOCO-Random/Popular/Adversarial datasets. Notably, from Random to Adversarial, the questions are increasingly complex.
Model Name | Random (Accuracy) | Random (Precision) | Random (Recall) | Random (F1-Score) | Random (Yes) | Popular (Accuracy) | Popular (Precision) | Popular (Recall) | Popular (F1-Score) | Popular (Yes) | Adversarial (Accuracy) | Adversarial (Precision) | Adversarial (Recall) | Adversarial (F1-Score) | Adversarial (Yes) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BLIP2 | 82.21 | 97.48 | 67.27 | 79.61 | 35.58 | 80.10 | 90.49 | 67.27 | 77.17 | 37.17 | 78.52 | 86.83 | 67.27 | 75.81 | 38.73 |
InstructBLIP | 88.83 | 96.01 | 81.60 | 88.23 | 43.99 | 84.15 | 85.96 | 81.60 | 83.72 | 47.47 | 81.95 | 82.05 | 81.60 | 81.82 | 49.77 |
LA-V2 | 74.44 | 68.24 | 94.00 | 79.08 | 70.99 | 56.82 | 53.89 | 94.20 | 68.56 | 87.40 | 60.52 | 54.58 | 96.45 | 69.12 | 88.23 |
LLaVA | 51.52 | 51.54 | 100.00 | 68.03 | 100.00 | 50.00 | 50.00 | 100.00 | 66.67 | 100.00 | 50.00 | 50.00 | 100.00 | 66.67 | 100.00 |
MiniGPT-4 | 52.58 | 68.63 | 57.50 | 62.57 | 44.25 | 49.31 | 63.56 | 58.03 | 60.67 | 48.29 | 49.62 | 62.55 | 58.71 | 68.47 | 48.54 |
mPLUG-Owl | 61.37 | 57.89 | 97.52 | 72.65 | 87.15 | 55.83 | 53.61 | 97.13 | 69.09 | 91.20 | 54.43 | 52.73 | 97.59 | 72.09 | 92.95 |
Otter | 61.40 | 57.82 | 95.92 | 72.15 | 85.76 | 49.56 | 50.07 | 95.92 | 65.79 | 96.58 | 50.68 | 50.56 | 95.92 | 66.22 | 95.31 |
VPGTrans | 48.28 | 74.17 | 56.78 | 64.32 | 47.38 | 47.86 | 70.37 | 55.90 | 62.92 | 51.92 | 47.86 | 69.76 | 59.22 | 64.06 | 52.27 |
Embodied Intelligence
To appraive the quality of LVLM models' planning outputs, we conducted a user study involving 15 participants to evaluate the embodied intelligence of LVLM models. The study comprised six household scenarios carefully selected from VirtualHome.
Model Name | Object Recon. | Spatial Relation. | Conciseness | Reasonability | Executability |
---|---|---|---|---|---|
BLIP2 | 2.03 | 1.68 | 3.25 | 2.78 | 2.88 |
InstructBLIP | 3.08 | 2.78 | 2.48 | 3.20 | 3.10 |
LA-V2 | 3.81 | 3.71 | 2.04 | 4.04 | 4.08 |
LLaVA | 3.88 | 3.61 | 1.86 | 3.70 | 3.82 |
MiniGPT-4 | 3.70 | 3.47 | 1.62 | 3.54 | 3.11 |
mPLUG-Owl | 3.42 | 3.22 | 1.48 | 3.44 | 3.54 |
Otter | 3.38 | 3.10 | 1.86 | 3.07 | 3.12 |
VPGTrans | 3.43 | 3.22 | 1.76 | 3.35 | 3.35 |