目录 #

Datasets for Visual Instruction Tuning[1] #

MiniGPT-4 MiniGPT-4 [37] curates an image description dataset that contains 3439 image-text pairs for instruction fine-tuning. MiniGPT-4 randomly selects 5000 images from the Conceptual Caption dataset [38], [39] and prompts its pre-trained VLM model to generate detailed descriptions for each image. The generated descriptions are then** refined and filtered** both manually and by using ChatGPT, resulting in 3439 highquality image-text pairs.
MultiInstruct MultiInstruct [43] build a comprehensive instruction dataset that covers 62 diverse multimodal tasks from 10 broad categories, such VQA, Image-text matching, grounded generation, and so on. These tasks include 34 existing tasks derived from 21 public dataset and 28 new tasks extended from them. Each task is equipped with 5 instruction templates to prompt the model to perform the specific task.

LLaVA LLaVA-Instruct-158k [9] contains 158 image-text instruction data, including 58k conversation data asking about the visual content of the image,23k description data, and 77k complex reasoning data where the question may involve multi-step reasoning process.

Method	Training Paradigm[2]	Vision Encoder	Language Encoder	Inst[2]	Tuning Data
MiniGPT-4	FA → VLIT	EvaCLIP ViT	Vicuna	AA	CC3M, CC12M, SBU, LAION 400M, MiniGPT-3.5K
MiniGPT-v2		EVA	LLaMA2-chat	AA+SI	LAION, CC3M, SBU, GRIT-20M, COCO caption, Text Captions, RefCOCO, RefCOCO+, RefCOCOg, GQA, VQA-v2, OCR-VQA, OKVQA, AOK-VQA, Flickr30k Dataset, Unnatural Instruction Dataset
LLaVa	FA → VLIT	CLIP ViT	Vicuna	SI	CC3M Concept-balanced 595K, LLaVA-Instruct-158K
LLaVA-1.5	FA → VLIT	CLIP ViT	Vicuna		LLaVA, ShareGPT, VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA, TextCaps, RefCOCO, VG
MultiInstruct	VLIT	OFA	OFA	AA	VQAv2, Visual7w, GQA, OK-VQA, Visual Genome, MSCOCO, RefCOCO, COCO-Text, TDIUC, IQA, VAW, MOCHEG, WikiHow
Otter		CLIP ViT	MPT	SI	MIMIC-IT
LAMM	VLIT	CLIP ViT-L/14	Vicuna	SI	Language-Assisted Multi-Modal Instruction-Tuning Dataset
Qwen-VL	FA → VLIT(Multi-Task Tuning)	ViT	Qwen-7B		LAION-en&zh, DataComp, Coyo, CC12M&3M, SBU, COCO, In-house Data, GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D, GRIT, Visual Genome, RefCOCO, RefCOCO+, RefCOCOg, SynthDoG-en&zh, Common Crawl pdf&HTML
CogVLM	FA → VLIT	EVA2-CLIP-E	Vicuna-7Bv-1.5		VQAv2, TextVQA
StableLLaVA	FA → VLIT	CLIP-ViT-L/14	LLaMA	AA	Synthesized Image-Dialogue Dataset

《Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey》 *** 第4 5章南洋大学
《Vision-Language Instruction Tuning: A Review and Analysis》 *** 第2 3 4 5章腾讯
《Instruction Tuning for Large Language Models: A Survey》第5章