目录 #
Datasets for Visual Instruction Tuning[1] #
Single-turn #
-
MiniGPT-4 MiniGPT-4 [37] curates an image description dataset that contains 3439 image-text pairs for instruction fine-tuning. MiniGPT-4 randomly selects 5000 images from the Conceptual Caption dataset [38], [39] and prompts its pre-trained VLM model to generate detailed descriptions for each image. The generated descriptions are then** refined and filtered** both manually and by using ChatGPT, resulting in 3439 highquality image-text pairs.
-
MultiInstruct MultiInstruct [43] build a comprehensive instruction dataset that covers 62 diverse multimodal tasks from 10 broad categories, such VQA, Image-text matching, grounded generation, and so on. These tasks include 34 existing tasks derived from 21 public dataset and 28 new tasks extended from them. Each task is equipped with 5 instruction templates to prompt the model to perform the specific task.
Multi-turn #
- LLaVA LLaVA-Instruct-158k [9] contains 158 image-text instruction data, including 58k conversation data asking about the visual content of the image,23k description data, and 77k complex reasoning data where the question may involve multi-step reasoning process.
VLIT Data Construction Strategy[2] #
Annotation Adaption #
- MiniGPT-4
Self-Instruct #
- LLaVA
High-Quality VLIT Data[2] #
Correctness #
Diversity #
Complexity #
Method [1][2] #
| Method | Training Paradigm[2] | Vision Encoder | Language Encoder | Inst[2] | Tuning Data |
|---|---|---|---|---|---|
| MiniGPT-4 | FA → VLIT | EvaCLIP ViT | Vicuna | AA | CC3M, CC12M, SBU, LAION 400M, MiniGPT-3.5K |
| MiniGPT-v2 | EVA | LLaMA2-chat | AA+SI | LAION, CC3M, SBU, GRIT-20M, COCO caption, Text Captions, RefCOCO, RefCOCO+, RefCOCOg, GQA, VQA-v2, OCR-VQA, OKVQA, AOK-VQA, Flickr30k Dataset, Unnatural Instruction Dataset | |
| LLaVa | FA → VLIT | CLIP ViT | Vicuna | SI | CC3M Concept-balanced 595K, LLaVA-Instruct-158K |
| LLaVA-1.5 | FA → VLIT | CLIP ViT | Vicuna | LLaVA, ShareGPT, VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA, TextCaps, RefCOCO, VG | |
| MultiInstruct | VLIT | OFA | OFA | AA | VQAv2, Visual7w, GQA, OK-VQA, Visual Genome, MSCOCO, RefCOCO, COCO-Text, TDIUC, IQA, VAW, MOCHEG, WikiHow |
| Otter | CLIP ViT | MPT | SI | MIMIC-IT | |
| LAMM | VLIT | CLIP ViT-L/14 | Vicuna | SI | Language-Assisted Multi-Modal Instruction-Tuning Dataset |
| Qwen-VL | FA → VLIT(Multi-Task Tuning) | ViT | Qwen-7B | LAION-en&zh, DataComp, Coyo, CC12M&3M, SBU, COCO, In-house Data, GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D, GRIT, Visual Genome, RefCOCO, RefCOCO+, RefCOCOg, SynthDoG-en&zh, Common Crawl pdf&HTML | |
| CogVLM | FA → VLIT | EVA2-CLIP-E | Vicuna-7Bv-1.5 | VQAv2, TextVQA | |
| StableLLaVA | FA → VLIT | CLIP-ViT-L/14 | LLaMA | AA | Synthesized Image-Dialogue Dataset |
Annotation Adaption-> SI #
Self-Instruct -> AA #
参考 #
-
《Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey》 *** 第4 5章 南洋大学
-
《Vision-Language Instruction Tuning: A Review and Analysis》 *** 第2 3 4 5章 腾讯
-
《Instruction Tuning for Large Language Models: A Survey》 第5章