多模态RAG-多向量检索器 [10][11] #

semi-structured (tables + text) RAG [20] #

分析pdf中表格

分析PDF中图片

Option 1 [基于CLIP] [23] [30][32][33]
- Use multimodal embeddings (such as CLIP) to embed images and text
- Retrieve both using similarity search
- Pass raw images and text chunks to a multimodal LLM for answer synthesis
  {选项1：对文本和表格生成summary，然后应用多模态embedding模型把文本/表格summary、原始图片转化成embedding存入多向量检索器。对话时，根据query召回原始文本/表格/图像。然后将其喂给多模态LLM生成应答结果。}[10]
Option 2 [21]
- Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
- Embed and retrieve text
- Pass text chunks to an LLM for answer synthesis
  【将图片转成摘要，和其他文本信息整合在文本粒度进行检索】[12]
  {选项2：首先应用多模态大模型（GPT4-V、LLaVA、FUYU-8b）生成图片summary。然后对文本/表格/图片summary进行向量化存入多向量检索器中。当生成应答的多模态大模型不具备时，可根据query召回原始文本/表格+图片summary。}[10]
Option 3 [24] [31][34]
- Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
- Embed and retrieve image summaries with a reference to the raw image
- Pass raw images and text chunks to a multimodal LLM for answer synthesis
  【实际模型输入使用的是图片】
  【图片概要依然是用于检索（GPT-4V，LLaVA，FUYU-8b）】[12]
  {选项3：前置阶段同选项2相同。对话时，根据query召回原始文本/表格/图片。构造完整Prompt，访问多模态大模型生成应答结果。}[10]

组件 #

pdf解析
unstructured
store
MultiVectorRetriever - 元数据+数据

参考 #

实战 #

1xx. Using Multi-Modal LLMs page21

notebook #

Semi_Structured_RAG notebook
Advanced-RAG semi_structured_data notebook {半结构化-解析pdf中的表格，运行没问题，能问表格中的数据}
Semi_structured_and_multi_modal_RAG notebook
Private Semi-structured and Multi-modal RAG w/ LLaMA2 and LLaVA notebook {多模态- 解析pdf中的图片运行有问题}
Private Semi-structured and Multi-modal RAG w/ LLaMA2 and LLaVA notebook
Chroma multi-modal RAG notebook
Multi-modal RAG notebook

template (失效了) #

rag-multi-modal-local
OpenCLIP(image embedding) + bakllava(answer synthesis)
rag-multi-modal-mv-local
bakllava(image summaries embedding) + bakllava (answer synthesis)
rag-chroma-multi-modal
OpenCLIP(image embedding) + GPT-4V (answer synthesis)
rag-gemini-multi-modal
OpenCLIP(image embedding) + Gemini(answer synthesis)
rag-chroma-multi-modal-multi-vector
GPT-4V(image summaries embedding) + GPT-4V (answer synthesis)

llamaindex #

1xx. 朴素多模态RAG如何实现？兼看RAG上下文过滤方案FILCO及202402大模型早报
1xx. Advanced Multi-Modal Retrieval using GPT4V and Multi-Modal Index/Retriever
1xx. Multimodal RAG pipeline with LlamaIndex and Neo4j
1xx. neo4j_llama_multimodal.ipynb git

多模态RAG-多向量检索器 [10][11] #

semi-structured (tables + text) RAG [20] #

multi-modal (text + tables + images) RAG [13] #

private multi-modal (text + tables + images) RAG [22] #

组件 #

参考 #

实战 #

notebook #

template (失效了) #

llamaindex #