LVLM-eHub

Arena LeaderBoard

The table below illustrate the elo rating rank of the selected LVLM models based on the evaluation data collected from our web Arena.

LVLM Arena Leaderboard 🏆. (Timeframe: May 13 - Aug 20, 2023)

Rank	Model Name	Elo Rating	Affiliation	Description
1	🥇Otter-Image	1047.5	NTU	a LVLM with in-context instruction tuning.
2	🥈LLaMA-Adapter V2	1038.7	Shanghai AI Lab	a LVLM with parameter-efficient instruction tuning.
3	🥉MiniGPT-4	1009.9	KAUST	a LVLM with only the FC layer tuned on 3.5K intruction data.
4	LLaVA	1002.8	Wisconsin-Madison	a LVLM with the whole LLM trained on 158K visual instruction data.
5	VPGTrans	999.6	NUS	a LVLM equips LLaMA with a visual model by transfer learning.
6	InstructBLIP	995.5	Salesforce	a LVLM with the Q-Former trained on many instruction QA datasets.
7	mPLUG-Owl	985.6	DAMO Academy	a LVLM with LLM instructionally tuned using LoRA technique.
8	Otter	972.1	NTU	a LVLM with 1.3B adaption parameters tuned on 158K instruction data.
9	BLIP-2	948.8	Salesforce	a LVLM without being tuned with instruction data.

Quantitative LeaderBoard

In LVLM-eHub, we evaluate LVLM model in six kinds of abilities.

Model Name	Average	Visual Perception	Visual Knowledge Acquisition	Visual Reasoning	Visual Commonsense	Object Hallucination	Embodied Intelligence
LLaMA-Adapter V2	0.725	0.813	0.443	0.833	0.589	0.751	0.922
LLaVA	0.671	0.615	0.377	0.771	0.791	0.595	0.879
VPGTrans	0.624	0.563	0.720	0.588	0.522	0.565	0.789
MiniGPT-4	0.594	0.727	0.346	0.527	0.565	0.594	0.805
InstructBLIP	0.928	0.928	0.967	0.908	0.995	1.00	0.772
mPLUG-Owl	0.596	0.831	0.286	0.420	0.579	0.673	0.785
Otter	0.565	0.661	0.237	0.513	0.582	0.633	0.761
BLIP-2	0.783	0.858	0.927	0.759	0.535	0.945	0.674

Visual Perception

The visual perception ability of LVLM models are evaluated in three tasks, which are image classification, object counting (OC) and multi-class identification (MCI). Notably, accuracy metric is used for all tasks.

Model Name	ImageNet1K	CIFAR10	Pets37	Flowers102	COCO (OC)	VCR (OC)	COCO (MCI)	VCR (MCI)
BLIP2	23.71	58.20	34.79	19.44	48.90	25.05	86.06	66.59
InstructBLIP	24.51	67.24	38.86	21.78	46.65	29.29	87.81	76.49
LA-V2	25.89	64.86	24.39	22.34	38.50	26.51	82.90	50.66
LLaVA	23.50	67.96	9.09	8.38	20.56	24.60	49.66	66.90
MiniGPT-4	21.17	61.39	18.90	21.70	20.86	25.26	72.70	66.02
mPLUG-Owl	26.68	59.66	43.16	22.91	34.14	27.98	58.30	55.56
Otter	19.29	65.42	5.79	6.13	46.14	41.06	51.03	51.60
VPGTrans	19.75	60.88	10.88	7.97	27.30	19.46	52.34	48.80

Visual Knowledge Acquisition

The visual knowledge acqusition ability of LVLM models are evaluated in three kinds of tasks, which are optical character recognition (OCR), key information extraction (KIE) and image captioning. In the table below, SROIE and FUNSD datasets are used to evaluate the KIE performance of LVLM models with entity-level F1 score, which NoCaps, Flickr-30k and WHOOPS datasets use CIDEr score the test the image captioning performance of LVLM models. The rest datasets evaluate LVLM models on OCR tasks with word accuracy.

Model Name	IIIT5K	IC13	IC15	Total-Text	CUTE80	SVT	SVTP	COCO-Text	WordArt	CTW	HOST	WOST	SROIE	FUNSD	NoCaps	Flickr-30k	WHOOPS
BLIP2	80.17	81.13	66.68	68.31	85.07	85.78	77.34	53.62	73.66	67.43	57.28	68.83	0.08	1.02	48.58	46.48	96.12
InstructBLIP	83.90	82.08	73.57	71.51	86.11	86.86	80.93	58.25	75.12	68.58	61.22	73.26	0.09	1.03	46.33	50.45	97.98
LA-V2	36.30	20.87	29.40	30.93	35.76	20.40	31.01	20.94	38.98	18.13	16.60	21.73	0.02	2.16	41.66	30.49	57.60
LLaVA	31.57	16.39	26.58	24.51	36.46	18.55	27.44	18.05	35.87	16.73	15.94	20.49	0.01	1.93	33.09	27.65	34.36
MiniGPT-4	25.00	16.69	22.05	18.65	33.33	15.46	20.31	11.86	31.90	14.95	13.45	19.12	0.02	1.27	42.43	26.04	47.36
mPLUG-Owl	25.30	14.98	20.99	20.63	31.94	14.37	20.78	12.88	31.90	13.87	11.88	14.65	0.01	0.41	28.30	20.53	42.73
Otter	17.57	09.67	18.49	14.81	18.75	10.51	19.22	11.30	21.05	10.05	10.14	12.29	0.01	1.91	29.23	23.00	32.70
VPGTrans	62.87	71.11	55.90	54.76	70.49	72.02	64.50	36.98	62.34	52.80	50.58	57.66	0.02	1.20	48.13	32.51	38.38

Visual Reasoning

In LVLM-eHub, the evaluation of LVLM models' visual reasoning ability encompasses visual knowledge answering (VQA), knowledge-grounded image description (KGID) and visual entailment (VE). In the table below, all datasets are used in VQA task except ScienceQA IMG and VizWiz are used in KGID task and SNLI-VE is used in VE task. Besides mean reciprocal rank (MRR) is used for Visdial dataset, accuracy are used in the remaining datasets.

Model Name	DocVQA	TextVQA	STVQA	OCR-VQA	OKVQA	GQA	Visdial	IconQA	VSR	WHOOPS	ScienceQA IMG	VizWiz	SNLI-VE
BLIP2	4.75	31.98	20.98	38.85	44.93	45.53	10.73	62.82	63.63	24.87	60.73	65.44	32.00
InstructBLIP	5.89	39.60	28.30	60.20	60.52	49.96	45.20	56.25	41.28	30.13	46.26	65.31	59.00
LLaMA-Adapter-v2	8.13	43.76	32.33	38.12	55.93	43.93	12.92	41.83	50.63	24.15	54.19	62.07	58.8
LLaVA	6.26	38.92	28.40	23.40	54.36	41.30	14.66	42.95	51.24	24.39	49.33	62.42	57.80
MiniGPT-4	2.65	19.40	13.55	16.85	37.48	30.82	10.31	37.59	41.56	17.91	25.43	47.48	54.80
mPLUG-Owl	2.24	38.76	12.10	8.84	22.89	14.02	13.34	11.64	24.74	20.70	2.80	38.99	54.50
Otter	3.44	21.52	15.23	19.50	49.01	38.12	11.67	26.77	06.40	15.14	27.22	50.04	52.60
VPGTrans	3.53	21.98	17.13	21.71	44.51	32.99	9.70	38.22	48.77	15.88	36.99	53.23	52.20

Visual Commonsense

In the evaluation of visual commonsense ability, two challenging benchmarks are used, which are ImageNetVC and Visual Comonsense Reasoning (VCR).

Model Name	ImageNetVC (Color)	ImageNetVC (Shape)	ImageNetVC (Mater.)	ImageNetVC (Compo.)	ImageNetVC (Others)	VCR
BLIP2	26.22	34.21	35.79	50.71	34.48	31.60
InstructBLIP	67.78	59.06	63.50	83.25	68.37	54.20
LA-v2	36.12	28.63	33.86	50.13	32.69	49.80
LLaVA	43.70	39.10	65.58	56.73	59.38	48.20
MiniGPT-4	24.49	23.54	28.56	59.26	39.38	49.00
mPLUG-Owl	26.20	34.19	35.82	50.73	34.50	46.00
Otter	26.21	34.20	35.81	50.72	34.49	47.00
VPGTrans	23.34	23.92	27.26	56.43	35.83	41.40

Object Hallucination

Based on the MSCOCO dataset, LVLM-eHub performs the evaluations on MSCOCO-Random/Popular/Adversarial datasets. Notably, from Random to Adversarial, the questions are increasingly complex.

Model Name	Random (Accuracy)	Random (Precision)	Random (Recall)	Random (F1-Score)	Random (Yes)	Popular (Accuracy)	Popular (Precision)	Popular (Recall)	Popular (F1-Score)	Popular (Yes)	Adversarial (Accuracy)	Adversarial (Precision)	Adversarial (Recall)	Adversarial (F1-Score)	Adversarial (Yes)
BLIP2	82.21	97.48	67.27	79.61	35.58	80.10	90.49	67.27	77.17	37.17	78.52	86.83	67.27	75.81	38.73
InstructBLIP	88.83	96.01	81.60	88.23	43.99	84.15	85.96	81.60	83.72	47.47	81.95	82.05	81.60	81.82	49.77
LA-V2	74.44	68.24	94.00	79.08	70.99	56.82	53.89	94.20	68.56	87.40	60.52	54.58	96.45	69.12	88.23
LLaVA	51.52	51.54	100.00	68.03	100.00	50.00	50.00	100.00	66.67	100.00	50.00	50.00	100.00	66.67	100.00
MiniGPT-4	52.58	68.63	57.50	62.57	44.25	49.31	63.56	58.03	60.67	48.29	49.62	62.55	58.71	68.47	48.54
mPLUG-Owl	61.37	57.89	97.52	72.65	87.15	55.83	53.61	97.13	69.09	91.20	54.43	52.73	97.59	72.09	92.95
Otter	61.40	57.82	95.92	72.15	85.76	49.56	50.07	95.92	65.79	96.58	50.68	50.56	95.92	66.22	95.31
VPGTrans	48.28	74.17	56.78	64.32	47.38	47.86	70.37	55.90	62.92	51.92	47.86	69.76	59.22	64.06	52.27

Embodied Intelligence

To appraive the quality of LVLM models' planning outputs, we conducted a user study involving 15 participants to evaluate the embodied intelligence of LVLM models. The study comprised six household scenarios carefully selected from VirtualHome.

Model Name	Object Recon.	Spatial Relation.	Conciseness	Reasonability	Executability
BLIP2	2.03	1.68	3.25	2.78	2.88
InstructBLIP	3.08	2.78	2.48	3.20	3.10
LA-V2	3.81	3.71	2.04	4.04	4.08
LLaVA	3.88	3.61	1.86	3.70	3.82
MiniGPT-4	3.70	3.47	1.62	3.54	3.11
mPLUG-Owl	3.42	3.22	1.48	3.44	3.54
Otter	3.38	3.10	1.86	3.07	3.12
VPGTrans	3.43	3.22	1.76	3.35	3.35