当前位置：首页 > news >正文

使用 Elasticsearch 构建多模式 RAG 系统：哥谭市的故事

news 来源：原创 2025/8/15 9:15:16

作者：来自 Elastic Alex Salgado

学习如何构建一个多模态检索增强生成 (RAG) 系统，该系统集成文本、音频、视频和图像数据，以提供更丰富的、具有上下文的信息检索。

在这篇博客中，你将学习如何使用 Elasticsearch 构建一个多模态 RAG（Retrieval-Augmented Generation - 检索增强生成）流水线。我们将探讨如何利用 ImageBind 生成各种数据类型（文本、图像、音频、深度图等）的嵌入向量，并了解如何使用 dense_vector 和 k-NN 搜索 高效存储和检索这些嵌入向量。最后，我们将集成 大语言模型（LLM） 来分析检索到的证据，并生成一份综合报告。

流水线如何工作？

🔍 收集线索 → 从哥谭市犯罪现场提取图像、音频、文本和深度图数据。
📌 生成嵌入 → 使用 ImageBind 多模态模型，将每个文件转换为向量。
📂 索引至 Elasticsearch → 存储向量以便高效检索。
🔎 相似性搜索 → 给定新线索，检索最相似的向量。
🕵️ LLM 分析证据 → GPT-4 综合分析，锁定嫌疑人！

使用的技术

ImageBind → 生成各种模态的统一嵌入向量。
Elasticsearch → 提供快速高效的向量检索。
LLM（GPT-4, OpenAI） → 分析证据并生成最终报告。

谁适合阅读这篇博客？

✅ Elasticsearch 用户——对多模态向量搜索感兴趣的开发者。
✅ 希望实践多模态 RAG 的开发者——想要了解如何在实际应用中构建多模态 RAG。
✅ 寻求可扩展数据分析方案的工程师——需要处理来自多个来源的数据并进行深入分析。

先决条件：环境搭建

想要破解哥谭市的案件？首先，你需要搭建技术环境。请按照以下步骤进行设置：

1. 技术要求

Component	Specification
Sistem OS	Linux, macOS, or Windows
Python	3.10 or later
RAM	Minimum 8GB (16GB recommended)
GPU	Optional but recommended for ImageBind

2. 设置项目

所有调查材料都可在 GitHub 上找到，我们将在 Jupyter Notebook（Google Colab） 中进行这次互动式破案体验。请按照以下步骤开始：

使用 Jupyter Notebook（Google Colab）进行设置

1）访问 Notebook
打开我们已准备好的 Google Colab Notebook：Multimodal RAG with Elasticsearch。
该 Notebook 包含所有必要的代码和说明，方便你跟随学习。

2）克隆代码仓库

# Clone the repository with the multimodal RAG code
!git clone -b https://github.com/elastic/elasticsearch-labs.git# Navigate to the project directory
cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham

3）安装依赖

 # Install PyTorch and related libraries
!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0# Install vision processing libraries
!pip install opencv-python-headless pillow numpy# Install the specific ImageBind fork
!pip install git+https://github.com/hkchengrex/ImageBind.git# Install Elasticsearch and environment management
!pip install elasticsearch python-dotenv# This solves the problem: Couldn't find appropriate backend to handle uri data/audios/joker_laugh.wav 
!pip install torchaudio soundfile

4. 配置凭证

# Input your credentials securely
import getpassELASTICSEARCH_URL = input("Enter the Elasticsearch endpoint url: ")
ELASTICSEARCH_API_KEY = getpass.getpass("Enter the Elasticsearch API key: ")
OPENAI_API_KEY = getpass.getpass("Enter the OpenAI API key: ")# Configure environment variables
import os
os.environ["ELASTICSEARCH_API_KEY"] = ELASTICSEARCH_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ELASTICSEARCH_URL"] = ELASTICSEARCH_URL

注意：ImageBind 模型（约 2GB）将在第一次运行时自动下载。

现在一切都已设置好，让我们深入细节，解决案件！

介绍：哥谭市的犯罪

在一个雨夜，哥谭市发生了一起令人震惊的犯罪事件。戈登局长需要你的帮助来解开这个谜团。线索散布在不同的格式中：模糊的图像、神秘的音频、加密的文本，甚至是深度图。你准备好使用最先进的 AI 技术来破案了吗？

在这篇博客中，我们将一步步指导你构建一个多模态 RAG（检索增强生成）系统，将不同类型的数据（图像、音频、文本和深度图）统一到一个搜索空间中。我们将使用 ImageBind 来生成多模态嵌入，使用 Elasticsearch 来存储和检索这些嵌入，并使用 大语言模型（LLM） 来分析证据并生成最终报告。

基础：多模态 RAG 架构

什么是多模态 RAG？

检索增强生成（RAG）多模态的兴起正在改变我们与 AI 模型互动的方式。传统的 RAG 系统仅处理文本，从数据库中检索相关信息后生成响应。然而，世界不仅仅局限于文本 —— 图像、视频和音频也携带着宝贵的知识。这就是为什么多模态架构变得越来越重要，它允许 AI 系统结合不同格式的信息，从而生成更丰富、更精准的响应。

三种主要的多模态 RAG 方法

要实现多模态 RAG，常用三种策略。每种方法都有自己的优点和局限性，具体取决于使用场景：

1. 共享向量空间

不同模态的数据通过像 ImageBind 这样的多模态模型映射到一个公共的向量空间。这使得文本查询可以检索图像、视频和音频，而无需显式的格式转换。

优点：

实现跨模态检索，无需显式的格式转换。
提供不同模态之间的流畅集成，允许直接跨文本、图像、音频和视频进行检索。
可扩展到多种数据类型，非常适用于大规模检索应用。

缺点：

训练需要大规模的多模态数据集，这些数据集可能并不总是可用。
共享嵌入空间可能引入语义漂移，导致模态之间的关系不完全保留。
多模态模型中的偏差可能会影响检索准确性，具体取决于数据集分布。

2. 单一基础模态

所有模态在检索前都转换为一个单一格式，通常是文本。例如，图像通过自动生成的标题进行描述，音频转录为文本。

优点：

简化了检索过程，因为所有内容都转换为统一的文本表示。
与现有的基于文本的搜索引擎兼容，无需专门的多模态基础设施。
可提高可解释性，因为检索结果是人类可读的格式。

缺点：

信息丢失：某些细节（例如图像中的空间关系、音频中的语气）可能无法在文本描述中完全捕捉。
依赖于标题/转录的质量：自动注释中的错误可能会降低检索效果。
对纯视觉或听觉查询不理想，因为转换过程可能会移除关键信息。

3. 独立检索

为每个模态保持不同的模型。系统对每种数据类型进行独立搜索，然后合并结果。

优点：

允许针对每种模态进行自定义优化，提高每种数据类型的检索准确性。
较少依赖复杂的多模态模型，使得集成现有检索系统更容易。
提供对排名和重新排名的精细控制，因为来自不同模态的结果可以动态合并。

缺点：

需要结果融合，使得检索和排名过程更复杂。
如果不同模态返回冲突信息，可能会生成不一致的响应。
计算成本较高，因为每个模态都需要单独进行搜索，从而增加处理时间。

我们的选择：使用 ImageBind 的共享向量空间

在这些方法中，我们选择了共享向量空间，这一策略非常适合高效的多模态搜索需求。我们的实现基于 ImageBind，该模型能够将多种模态（文本、图像、音频和视频）表示在一个公共的向量空间中。这使我们能够：

在不同的媒体格式之间执行跨模态搜索，而无需将所有内容转换为文本。
使用高度表达力的嵌入来捕捉不同模态之间的关系。
确保可扩展性和高效性，存储优化后的嵌入以便在 Elasticsearch 中快速检索。

通过采用这种方法，我们构建了一个强大的多模态搜索流水线，在这里，文本查询可以直接检索图像或音频，无需额外的预处理。这种方法将实际应用从大型库中的智能搜索扩展到先进的多模态推荐系统。

下图展示了多模态 RAG 流水线中的数据流，突出了基于多模态数据的索引、检索和响应生成过程：

嵌入空间如何工作？

传统上，文本嵌入来自语言模型（例如 BERT、GPT）。现在，借助像 Meta AI 的 ImageBind 这样的原生多模态模型，我们有了一个基础，能够为多种模态生成向量：

文本：句子和段落被转换为相同维度的向量。
图像（视觉）：像素被映射到与文本相同的维度空间。
音频：声音信号被转换为与图像和文本可比的嵌入。
深度图：深度数据也会被处理，并生成向量。

因此，任何线索（文本、图像、音频、深度图）都可以通过向量相似度度量（如余弦相似度）与其他线索进行比较。如果一个笑声音频样本和嫌疑人面部的图像在这个空间中是 “接近的”，我们可以推断出某种关联（例如，同一个身份）。

阶段 1 - 收集犯罪现场线索

在分析证据之前，我们需要先收集它们。哥谭市的犯罪留下了可能隐藏在图像、音频、文本，甚至深度数据中的痕迹。让我们整理这些线索，以便输入到系统中。

我们有什么？

戈登局长给我们发送了以下文件，包含从犯罪现场收集的四种不同模态的证据：

线索描述与模态：

a) 图像（2 张照片）

crime_scene1.jpg, crime_scene2.jpg → 从犯罪现场拍摄的照片，显示地面上可疑的痕迹。
suspect_spotted.jpg → 安全摄像头图像，显示一名身影从现场逃跑。

b) 音频（1 个录音）

joker_laugh.wav → 一只靠近犯罪现场的麦克风录下了一声邪恶的笑声。

c) 文本（1 条消息）

Riddle.txt, note2.txt → 在现场发现了一些神秘的便条，可能是犯罪嫌疑人留下的。

d) 深度（1 个深度图）

depth_suspect.png → 一台带有深度传感器的安全摄像头捕捉到一个嫌疑人出现在附近的小巷。
jdancing-depth.png → 一台带有深度传感器的安全摄像头捕捉到一个嫌疑人走向地铁站。

这些证据以不同的格式存在，无法直接以相同的方式进行分析。我们需要将它们转换为嵌入 —— 数值向量，以便进行跨模态比较。

文件组织

在开始处理之前，我们需要确保所有线索都正确地组织在 data/ 目录中，以确保流水线顺利运行。

预期的目录结构：

data/
├── images/
│   ├── crime_scene1.jpg
│   ├── suspect_spotted.jpg
│   ...
├── audios/
│   ├── joker_laugh.wav
│   ...
├── texts/
│   ├── riddle.txt
│   ... 
├── depths/
│   ├── depth_suspect.png

验证线索组织的代码

在继续之前，让我们确保所有必需的文件都位于正确的位置。

import os# Base directory for clues
data_dir = "data"# List of expected files
evidences = {"images": ["crime_scene1.jpg","crime_scene1.jpg", "joker_alley.jpg"],"audios": ["joker_laugh.wav"],"texts": ["riddle.txt", "note2.txt”],"depths": ["depth_suspect.png", "jdancing-depth.png"]
}# Create directories if they don't exist
for category, files in evidences.items():category_path = os.path.join(data_dir, category)os.makedirs(category_path, exist_ok=True)for file in files:file_path = os.path.join(category_path, file)if not os.path.exists(file_path):print(f"Warning: {file} not found in {category_path}.")print("All files are correctly organized!")

运行文件:

python  stages/01-stage/files_check.py

预期输出（如果所有文件正确）：

All files are correctly organized!

预期输出（如果缺少任何文件）：

Warning: joker_laugh.wav not found in data/audios/
Warning: depth_suspect.png not found in data/depths/

这个脚本有助于在开始生成嵌入并将其索引到 Elasticsearch 之前防止错误。

阶段 2 - 组织证据

使用 ImageBind 生成嵌入

为了统一线索，我们需要将它们转换为嵌入 —— 捕捉每种模态意义的向量表示。我们将使用 ImageBind，Meta AI 提供的一个模型，它可以在共享向量空间内为不同的数据类型（图像、音频、文本和深度图）生成嵌入。

ImageBind 如何工作？

为了比较不同类型的证据（图像、音频、文本和深度图），我们需要使用 ImageBind 将它们转换为数值向量。这个模型允许将任何类型的输入转换为相同的嵌入格式，从而实现跨模态搜索。

以下是优化后的代码（src/embedding_generator.py），用于使用适当的处理器为每种模态生成嵌入：

class EmbeddingGenerator:"""Class for generating multimodal embeddings using ImageBind."""def __init__(self):self.device = "cuda" if torch.cuda.is_available() else "cpu"self.model = self._load_model()def _load_model(self):"""Loads the ImageBind model and sets it to inference mode."""model = imagebind_model.imagebind_huge(pretrained=True)model.eval()model.to(self.device)return modeldef generate_embedding(self, input_data, modality):"""Generates embedding for different modalities"""processors = {"vision": lambda x: data.load_and_transform_vision_data(x, self.device),"audio": lambda x: data.load_and_transform_audio_data(x, self.device),"text": lambda x: data.load_and_transform_text(x, self.device),"depth": self.process_depth}try:# Input type verificationif not isinstance(input_data, list):raise ValueError(f"Input data must be a list. Received: {type(input_data)}")# Convert input data to a tensor format that the model can process# For images: [batch_size, channels, height, width] # For audio: [batch_size, channels, time] # For text: [batch_size, sequence_length]inputs = {modality: processors[modality](input_data)}with torch.no_grad():embedding = self.model(inputs)[modality]return embedding.squeeze(0).cpu().numpy()except Exception as e:logger.error(f"Error generating {modality} embedding: {str(e)}", exc_info=True)raise

tensor 是机器学习和深度学习中的基本数据结构，特别是在使用像 ImageBind 这样的模型时。在我们的上下文中：

input_tensor = processors[modality]([input_data], self.device)

在这里，张量表示输入数据（图像、音频或文本），并将其转换为模型可以处理的数学格式。具体来说：

对于图像：张量将图像表示为一个多维矩阵，矩阵中的数值代表像素（按高度、宽度和颜色通道组织）。
对于音频：张量将声音波形表示为随时间变化的幅度序列。
对于文本：张量将单词或标记表示为数值向量。

测试嵌入生成：

让我们通过以下代码测试我们的嵌入生成。将其保存为 02-stage/test_embedding_generation.py 并使用以下命令执行：

python stages/02-stage/test_embedding_generation.py

generator = EmbeddingGenerator()
image_embedding = generator.generate_embedding("data/images/crime_scene1.jpg","vision")print(image_embedding.shape)

预期输出：

(1024,)

现在，图像已经被转换为一个 1024 维的向量。

阶段 3 - 在 Elasticsearch 中存储和搜索

现在我们已经为证据生成了嵌入，我们需要将它们存储在向量数据库中，以实现高效的搜索。为此，我们将使用 Elasticsearch，它支持密集向量（dense_vector）并允许相似度搜索。

这一步包含两个主要过程：

索引嵌入 → 将生成的向量存储到 Elasticsearch 中。
相似度搜索 → 检索与新证据最相似的记录。

在 Elasticsearch 中索引证据

每一条通过 ImageBind 处理的证据（图像、音频、文本或深度图）都会被转换为一个 1024 维的向量。我们需要将这些向量存储到 Elasticsearch 中，以便进行未来的搜索。

以下代码（src/elastic_manager.py）会在 Elasticsearch 中创建一个索引，并配置映射来存储这些嵌入。

from elasticsearch import Elasticsearch, helpers
...class ElasticsearchManager:"""Manages multimodal operations in Elasticsearch"""def __init__(self):load_dotenv()  # Load variables from .envself.es = self._connect_elastic()self.index_name = "multimodal_content"self._setup_index()def _connect_elastic(self):"""Connects to Elasticsearch"""return Elasticsearch(os.getenv("ELASTICSEARCH_URL"),  # Elasticsearch endpointapi_key=os.getenv("ELASTICSEARCH_API_KEY"))def _setup_index(self):"""Sets up the index if it doesn't exist"""if not self.es.indices.exists(index=self.index_name):mapping = {"mappings": {"properties": {"embedding": {"type": "dense_vector","dims": 1024,"index": True,"similarity": "cosine"},"modality": {"type": "keyword"},"content": {"type": "binary"},"description": {"type": "text"},"metadata": {"type": "object"},"content_path": {"type": "text"}}}}self.es.indices.create(index=self.index_name, body=mapping)def index_content(self, embedding, modality, content=None, description="", metadata=None, content_path=None):"""Indexes multimodal content"""doc = {"embedding": embedding.tolist(),"modality": modality,"description": description,"metadata": metadata or {},"content_path": content_path}if content:doc["content"] = base64.b64encode(content).decode() if isinstance(content, bytes) else contentreturn self.es.index(index=self.index_name, document=doc)def search_similar(self, query_embedding, modality=None, k=5):"""Searches for similar contents"""query = {"knn": {"field": "embedding","query_vector": query_embedding.tolist(),"k": k,"num_candidates": 100,"filter": [{"term": {"modality": modality}}] if modality else []}}try:response = self.es.search(index=self.index_name,query=query,size=k            )# Return both source data and score for each hitreturn [{**hit["_source"],"score": hit["_score"]} for hit in response["hits"]["hits"]]except Exception as e:print(f"Error: processing search_evidence: {str(e)}")return "Error generating search evidence"

运行索引

现在，让我们索引一条证据来测试这个过程。在项目的根目录下定义如下的一个文件 test_index.py：

# Example: Indexing an image from the crime scene
import sys
import os
import jsonsys.path.append(os.path.join((os.path.dirname(__file__)), "src")
)# Dump the object to a JSON string
json_string = json.dumps(data, indent=2)
print(json_string)from elastic_manager import ElasticsearchManager
from embedding_generator import EmbeddingGeneratorgenerator = EmbeddingGenerator()
es_manager = ElasticsearchManager()image_embedding = generator.generate_embedding(["data/images/crime_scene1.jpg"], "vision")response = es_manager.index_content(embedding=image_embedding,modality="vision",description="Photo of the crime scene with suspicious traces",content_path="data/images/crime_scene1.jpg"
)print(response)

预期的 Elasticsearch 输出（索引文档的摘要）：

{"embedding": [0.12, -0.53, 0.89, ...],  "modality": "vision",  "description": "Photo of the crime scene with suspicious traces",  "content_path": "data/images/crime_scene1.jpg"  
}

要索引所有多模态证据，请执行以下 Python 命令：

python stages/03-stage/index_all_modalities.py

现在，证据已存储在 Elasticsearch 中，并且可以在需要时检索。

验证索引过程

在运行索引脚本后，让我们验证所有证据是否正确存储在 Elasticsearch 中。你可以使用 Kibana 的开发者工具运行一些验证查询：

1）首先，检查索引是否已创建：

GET _cat/indices/multimodal_content?v

2）然后，验证每种模态的文档数量：

GET multimodal_content/_search
{"size": 0,"aggs": {"modalities": {"terms": {"field": "modality.keyword"}}}
}

3）最后，检查索引文档的结构：

GET multimodal_content/_search
{"size": 1,"query": {"match_all": {}}
}

预期结果：

应该存在一个名为 multimodal_content 的索引。
大约 7 个文档分布在不同的模态（视觉、音频、文本、深度）中。
每个文档应包含：embedding、modality、description、metadata 和 content_path 字段。

此验证步骤确保我们的证据数据库在进行相似度搜索之前已正确设置。

在 Elasticsearch 中搜索相似证据

现在证据已经被索引，我们可以执行搜索，找到与新线索最相似的记录。此搜索使用向量相似度来返回嵌入空间中最接近的记录。

以下代码执行此搜索：

def search_similar_evidence(self, query_embedding, k=5, modality=None):"""Performs a kNN search to find the most similar clues."""knn_query = {"field": "embedding","query_vector": query_embedding.tolist(),"k": k,"num_candidates": 100}query_body = {"knn": knn_query}if modality:query_body = {"bool": {"must": [query_body, {"term": {"modality": modality}}]}}try:results = self.es.search(index=self.index_name,query=query_body,_source_includes=["description", "modality", "content_path"],size=k)except Exception as e:print(f"Error processing search_evidence: {str(e)}")return "Error generating search evidence”return results["hits"]["hits"]

测试搜索 - 使用音频作为查询进行多模态结果搜索

现在，让我们使用一个可疑的音频文件来测试证据搜索。我们需要以相同的方式生成该文件的嵌入，并搜索相似的嵌入：

python stages/03-stage/search_by_audio.py

# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager(cloud_id="YOUR_CLOUD_ID", api_key="YOUR_API_KEY")# Generate embedding for a suspicious audio
audio_embedding = generator.generate_embedding("data/audios/mysterious_laugh.wav", "audio")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar_evidence(audio_embedding, k=3)# Display the retrieved results
print("\n🔎 Similar evidence found:\n")
for i, evidence in enumerate(similar_evidences, start=1):description = evidence['_source']['description']modality = evidence['_source']['modality']score = evidence['_score']content_path = evidence['_source'].get('content_path', 'N/A')print(f"{i}. {description} ({modality})")print(f"   Similarity: {score:.4f}")print(f"   File path: {content_path}\n")

预期输出（终端）：

🔎 Similar evidence found:1. A sinister laugh captured near the crime scene (audio)Similarity: 0.9985File path: data/audios/joker_laugh.wav2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.6068File path: data/images/joker_laughing.png3. Suspect dancing (vision)Similarity: 0.5591File path: data/images/jdancing.png

现在，我们可以分析检索到的证据并确定它与案件的相关性。

超越音频 - 探索多模态搜索

反转角色：任何模态都可以是 “问题”

在我们的多模态 RAG 系统中，每种模态都是潜在的搜索查询。让我们超越音频示例，探索其他数据类型如何启动调查。

1）通过文本搜索（破译犯罪分子的笔记）
场景：你发现了一条加密的文本消息，并希望找到相关证据。

python stages/03-stage/search_by_text.py

# Generate embedding from text
text = "Why so serious?"
embedding_text = generator.generate_embedding([text], "text")# Search for related evidence
similar_evidences = es_manager.search_similar(query_embedding=embedding_text,k=3
)

预期结果：

🔎 Similar evidence found:1. Mysterious note found at the location (text)Similarity: 0.7639File path: data/texts/riddle.txt2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.7161File path: data/images/joker_laughing.png3. Why so serious (text)Similarity: 0.7132File path: data/texts/note2.txt

2）图像搜索（追踪可疑的犯罪现场）
场景：需要将新的犯罪现场图像（crime_scene2.jpg）与其他证据进行比较。

python stages/03-stage/search_by_image.py

# Generate embedding for a suspicious image
vision_embedding = generator.generate_embedding(["data/images/crime_scene2.jpg"], "vision")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(query_embedding=vision_embedding,k=3
)

输出：

🔎 Similar evidence found:1. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)Similarity: 0.8258File path: data/images/crime_scene1.jpg2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.6897File path: data/images/joker_laughing.png3. Suspect dancing (vision)Similarity: 0.6588File path: data/images/jdancing.png

3）深度图搜索（3D 追踪）
场景：深度图（jdancing-depth.png）揭示了逃跑路线的图像模式。

python stages/03-stage/search_by_depth.py

# Generate embedding for a suspicious depth map
vision_embedding = generator.generate_embedding(["data/depths/jdancing-depth.png"], "depth")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(query_embedding=vision_embedding,modality="vision",k=3
)

输出：

🔎 Similar evidence found:1. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.5329File path: data/images/joker_laughing.png

2. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)Similarity: 0.5053File path: data/images/crime_scene1.jpg

3. Suspect dancing (vision)Similarity: 0.4859File path: data/images/jdancing.png

为什么这很重要？

每种模态揭示了独特的联系：

文本 → 嫌疑人的语言模式。
图像 → 位置和物体的识别。
深度 → 3D 场景重建。

现在，我们在 Elasticsearch 中拥有一个结构化的证据数据库，使我们能够高效地存储和检索多模态证据。

我们所做的总结：

将多模态嵌入存储在 Elasticsearch 中。
执行相似度搜索，找到与新线索相关的证据。
使用可疑音频文件测试搜索，确保系统正常工作。

下一步：

我们将使用大型语言模型（LLM）来分析检索到的证据并生成最终报告。

阶段 4 - 通过 LLM 连接线索

现在，证据已被索引到 Elasticsearch 中，并且可以通过相似度检索，我们需要一个大型语言模型（LLM）来分析这些证据并生成一份最终报告，提交给戈登警官。LLM 将负责识别模式、连接线索，并根据检索到的证据建议一个可能的嫌疑人。

为此任务，我们将使用 GPT-4 Turbo，制定详细的提示，使模型能够高效地解释结果。

LLM 集成

为了将 LLM 集成到我们的系统中，我们创建了 LLMAnalyzer 类（src/llm_analyzer.py），该类接收来自 Elasticsearch 的检索证据，并以这些证据作为提示上下文生成法医报告。

import os
from openai import OpenAI
import logging
from dotenv import load_dotenvlogging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)class LLMAnalyzer:"""Evidence analyzer using GPT-4"""def __init__(self):load_dotenv()self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))def analyze_evidence(self, evidence_results):"""Analyzes multimodal search results and generates a reportArgs:evidence_results: Dict with results by modality{'vision': [...],'audio': [...],'text': [...],'depth': [...]}"""# Format evidence for the promptevidence_summary = self._format_evidence(evidence_results)# final promptprompt = f"""
You are a highly experienced forensic detective specializing in multimodal evidence analysis. Your task is to analyze the collected evidence (audio, images, text, depth maps) and conclusively determine the **prime suspect** responsible for the Gotham Central Bank case.---### **Collected Evidence:**
{evidence_summary}### **Task:**
1. **Analyze all the evidence** and identify cross-modal connections.
2. **Determine the exact identity of the criminal** based on behavioral patterns, visual/auditory/textual clues, and symbolic markers.
3. **Justify your conclusion** by explaining why this suspect is definitively responsible.
4. **Assign a confidence score (0-100%)** to your conclusion.---### **Final Output Format (Strictly Follow This Format):**
- **Prime Suspect:** [Full Name or Alias]
- **Evidence Supporting Conclusion:** [Detailed breakdown of visual, auditory, textual, and behavioral evidence]
- **Behavioral Patterns:** [Key actions, motives, and criminal signature]
- **Confidence Level:** [0-100%]
- **Next Steps (if any):** [What additional evidence would further confirm the identity? If none, state "No further evidence required."]If there is **insufficient evidence**, specify exactly what is missing and suggest what additional data would be needed for a conclusive identification.This report must be **direct and definitive**--avoid speculation and provide a final, actionable determination of the suspect's identity.
"""try:response = self.client.chat.completions.create(model="gpt-4-turbo-preview",messages=[{"role": "system","content": "You are a forensic detective specialized in multimodal evidence analysis."},{"role": "user", "content": prompt_01}],temperature=0.5,max_tokens=1000)report = response.choices[0].message.contentlogger.info("\n📋 Forensic Report Generated:")logger.info("=" * 50)logger.info(report)logger.info("=" * 50)return reportexcept Exception as e:logger.error(f"Error generating report: {str(e)}")return None

LLM 分析中的温度设置：

对于我们的法医分析系统，我们使用了 0.5 的适中温度设置。选择这个平衡的设置是因为：

它代表了确定性（过于僵化）和高度随机输出之间的中间地带；
在 0.5 的设置下，模型保持足够的结构性，以提供逻辑性和可辩解的法医结论；
该设置使模型能够识别模式并建立联系，同时保持在合理的法医分析参数范围内；
它平衡了对一致性、可靠输出的需求，以及生成富有洞察力分析的能力。

这个适中的温度设置有助于确保我们的法医分析既可靠又富有洞察力，避免了过于僵化和过于推测的结论。

运行证据分析

现在我们已经完成了 LLM 的集成，我们需要一个脚本将所有系统组件连接起来。这个脚本将会：

在 Elasticsearch 中搜索相似证据。
使用 LLM 分析检索到的证据并生成最终报告。

代码：证据分析脚本

python stages/04-stage/rag_crime_analyze.py

import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(__file__)), 'src'))from embedding_generator import EmbeddingGenerator
from elastic_manager import ElasticsearchManager
from llm_analyzer import LLMAnalyzerimport json
import logging
from dotenv import load_dotenv# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)# Load environment variables
load_dotenv()# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager()llm = LLMAnalyzer()
logger.info("✅ All components initialized successfully")try:evidence_data = {}# Get data for each modalitytest_files = {'vision': 'data/images/crime_scene2.jpg','audio': 'data/audios/joker_laugh.wav','text': 'Why so serious?','depth': 'data/depths/jdancing-depth.png'}logger.info("🔍 Collecting evidence...")for modality, test_input in test_files.items():try:if modality == 'text':embedding = generator.generate_embedding([test_input], modality)else:embedding = generator.generate_embedding([str(test_input)], modality)results = es_manager.search_similar(embedding, k=2)if results:evidence_data[modality] = resultslogger.info(f"✅ Data retrieved for {modality}: {len(results)} results")else:logger.warning(f"⚠️ No results found for {modality}")except Exception as e:logger.error(f"❌ Error retrieving {modality} data: {str(e)}")if not evidence_data:raise ValueError("No evidence data found in Elasticsearch!")# Test forensic report generationlogger.info("\n📝 Generating forensic report...")report = llm.analyze_evidence(evidence_data)if report:logger.info("✅ Forensic report generated successfully")logger.info("\n📊 Report Preview:")logger.info("+" * 50)logger.info(report)logger.info("+" * 50)else:raise ValueError("Failed to generate forensic report")except Exception as e:logger.error(f"❌ Error in analysis : {str(e)}")

预期的 LLM 输出：

**Prime Suspect:** The Joker**Evidence Supporting Conclusion:**- **Visual Evidence:**- The photo of the crime scene with playing cards scattered around and the graffiti of the Joker laughing matches the Joker's known calling cards and thematic elements. The similarity score of 0.83 indicates a high likelihood that these elements are directly associated with the Joker.- The image of the Joker with green hair, white face paint, and a sinister smile in an urban night setting, although with a lower similarity score of 0.69, still supports the presence or recent activity of the Joker in areas consistent with the crime scene's characteristics.- **Auditory Evidence:**- The captured sinister laugh with a similarity score of 1.00 perfectly matches known audio profiles of the Joker, making it a direct auditory signature of his presence at or near the crime scene.- Despite the lower similarity score of 0.61, the second audio piece further corroborates the Joker's involvement through thematic consistency.- **Textual Evidence:**- The mysterious note found at the location, with a similarity score of 0.76, likely contains thematic or direct references to the Joker's modus operandi or signature phrases, further implicating him in the crime.- The similarity score of 0.72 for the Joker's description in textual evidence reinforces the thematic connection to the crime scene.- **Depth Evidence:**- Depth sensor capture of the suspect with a similarity score of 0.77 suggests a physical presence matching the Joker's known dimensions or characteristic movements.- The lower similarity score of 0.53 in the second depth evidence still contributes to the overall pattern of evidence pointing towards the Joker, albeit with less certainty.**Behavioral Patterns:**
- The Joker is known for his theatrical crimes, often leaving behind a signature trail of chaos, including playing cards, sinister laughter, and thematic graffiti. These elements are not only consistent with his known criminal signature but also directly observed at the crime scene.
- His motives often include creating chaos, drawing attention to his acts, and challenging his arch-nemesis, Batman, making a high-profile bank heist fitting within his behavioral patterns.**Confidence Level:** 95%**Next Steps:** No further evidence required.The combination of visual, auditory, textual, and depth evidence strongly points to the Joker as the prime suspect. The thematic consistency across multiple modes of evidence, combined with known behavioral patterns and criminal signature, leaves little doubt regarding his involvement. While there is always a small margin of uncertainty in forensic analysis, the evidence at hand provides a compelling case against the Joker with a high degree of confidence.

结论：案件已解决

通过收集和分析所有线索，多模态 RAG 系统已经识别出嫌疑人：小丑（joker）。

通过使用 ImageBind 将图像、音频、文本和深度图像转换为共享向量空间，系统能够检测到人工无法发现的关联。Elasticsearch 确保了快速高效的搜索，而 LLM 则将证据综合成一份清晰且具有决定性结论的报告。

然而，这个系统的真正潜力远超哥谭市。多模态RAG架构为许多现实世界的应用打开了大门：

城市监控：根据图像、音频和传感器数据识别嫌疑人。
法医分析：将来自多个来源的证据关联起来，解决复杂案件。
多媒体推荐：创建能够理解多模态背景的推荐系统（例如，根据图像或文本推荐音乐）。
社交媒体趋势：跨不同数据格式检测流行话题。

现在，你已经学会了如何构建一个多模态 RAG 系统，为什么不用自己的线索来测试一下呢？

分享你的发现，帮助社区在多模态 AI 领域取得进展！

特别感谢

我要感谢 Adrian Cole 在定义此代码部署架构过程中所做出的宝贵贡献和审阅。

参考文献

使用 KNN 搜索和 CLIP 嵌入构建多模态图像检索系统
k-最近邻（kNN）搜索
PyTorch 官方文档关于 tensors 的介绍
ImageBind：一种通过 “连接” 不同感官实现AI的新方式

Elasticsearch原生集成行业领先的生成 AI 工具和供应商。查看我们的网络研讨会，了解如何超越 RAG 基础，或构建生产就绪的应用程序 Elastic 向量数据库。

为了构建适合你使用案例的最佳搜索解决方案，开始免费云试用或在本地机器上尝试 Elastic。

原文：Building a Multimodal RAG system with Elasticsearch: The story of Gotham City - Elasticsearch Labs

使用 Elasticsearch 构建多模式 RAG 系统：哥谭市的故事

作者：来自 Elastic Alex Salgado 学习如何构建一个多模态检索增强生成 (RAG) 系统，该系统集成文本、音频、视频和图像数据，以提供更丰富的、具有上下文的信息检索。在这篇博客中，你将学习如何使用 Elasticsearch 构建一个多模态 …...

编程日记 2025/8/15 9:15:16

单一责任原则在Java设计模式中的深度解析

在软件开发中，设计模式提供了一种解决特定问题的思路。在众多的设计原则中，单一责任原则（Single Responsibility Principle，SRP）是一个非常重要的概念。它主要强调一个类应该只有一个责任，也就是说&#xf…...

编程日记 2025/8/14 10:44:41

设计模式学习记录

设计模式23种创建型抽象工厂模式工厂模式生成器模式原型模式单例模式结构型适配器模式桥接模式组合模式装饰模式外观模式享元模式代理模式行为型责任链模式命令模式解释器模式迭代器模式中介者模式备忘录模式观察者模式状态模式策略模式模版方法模式访问者模式创建型与对…...

编程日记 2025/8/15 9:15:14

set_clock_groups

一、命令参数与工具处理逻辑核心参数定义参数定义工具行为工具兼容性-asynchronous完全异步时钟组，无任何相位或频率关系（如独立晶振、不同时钟树）工具完全禁用组间路径的时序分析，但需用户自行处理跨时钟域（CDC&a…...

编程日记 2025/8/14 23:39:29

QT创建项目（项目模板、构建系统、选择类、构建套件）

1. 项目模版项目类型界面技术适用场景核心依赖模块开发语言Qt Widget ApplicationC Widgets传统桌面应用（复杂控件）Qt WidgetsCQt Console Application无 GUI命令行工具、服务Qt CoreCQt Quick ApplicationQML/Quick现代跨平台应用（动画/触…...

编程日记 2025/8/15 9:15:03

麒麟系统利用pycharm生成deb文件

在麒麟系统（Kylin OS）上使用 PyCharm 进行 Python 开发并生成 .deb 可安装软件包，可以按照以下步骤进行操作： 1. 准备工作安装 PyCharm：确保已经在麒麟系统上安装了 PyCharm，可以使用官方提供的安装包进…...

编程日记 2025/8/15 9:15:15

超声重建，3D重建超声三维重建，三维可视化平台 UR 3D Reconstruction

1. 超声波3D重建技术的实现方法与算法技术概述 3D超声重建是一种基于2D超声图像生成3D体积数据的技术，广泛应用于医学影像领域。通过重建和可视化三维结构，3D超声能够显著提高诊断精度和效率，同时减少医生的脑力负担。本技术文档将详细阐述…...

编程日记 2025/8/11 15:10:58

Qt 信号与槽

目录 Qt信号和槽 connect函数 connect使用方法自定义信号与自定义槽 Qt界面化工具自动生成的槽自定义信号带参数的信号和槽信号与槽的断开 Qt信号和槽谈到信号，设计3个要素信号源：谁发出了信号信号触发条件：哪个控件的哪个…...

编程日记 2025/8/15 9:13:16

卷积神经网络 - 卷积的变种、数学性质

本文我们来学习卷积的变种和相关的数学性质，为后面学习卷积神经网络做准备，有些概念可能不好理解，可以先了解其概念，然后慢慢理解、逐步深入。在卷积的标准定义基础上，还可以引入卷积核的滑动步长和零填充来增加卷积…...

编程日记 2025/8/9 19:30:29

ubuntu 和 RV1126 交叉编译Mosqutiio-1.6.9

最近需要交叉编译mosquitto，遇到一些小问题记录一下。 1.众所周知使用它自带的Makefile编译的时候，只需要在编译前，指定它config.mk中的变量：CFLAGS头文件路径和 LDFLAGS库文件路径就ok，例子如下： expor…...

编程日记 2025/8/15 9:12:22

从零开始学习机器人---如何高效学习机械原理

如何高效学习机械原理 1. 理解课程的核心概念2. 结合图形和模型学习3. 掌握公式和计算方法4. 理论与实践相结合5. 总结和复习6. 保持好奇心和探索精神总结机械原理是一门理论性和实践性都很强的课程，涉及到机械系统的运动、动力传递、机构设计等内容。快速学习机械…...

编程日记 2025/7/17 9:06:55

STM32 RS232通信开发全解析 | 零基础入门STM32第五十九步

主题内容教学目的/扩展视频RS232串口电路原理，跳线设置，驱动程序。与超级终端通信。了解电路原理和RS232协议。师从洋桃电子，杜洋老师 📑文章目录一、RS232通信系统架构二、RS232核心原理与硬件设计2.1 电气特性对比2.2 典型电路…...

编程日记 2025/8/13 11:23:14

文献分享: 对ColBERT段落多向量的剪枝——基于学习的方法

原论文 1. 导论 & \textbf{\&} &方法 1️⃣要干啥：在 ColBERT \text{ColBERT} ColBERT方法中，限制每个段落要保留的 Token \text{Token} Token的数量，或者说对段落 Token \text{Token} Token进行剪枝 2️⃣怎么干：注…...

编程日记 2025/8/10 10:08:41

(已解决)aws 上部署Splunk 负载均衡unhealthy

在AWS 部署Splunk 服务，instance 是后端的EC2, 我把splunk 服务起好后，发现port : 8000 是listening: #netstat -an | grep 80 tcp 0 0 127.0.0.1:8065 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:8089 0.0.0.0:* …...

编程日记 2025/8/12 7:02:06

C# 异步编程

概述同步：指必须等待前一个操作完成，后续操作才能继续。同步操作会阻塞线程直到任务完成。异步：异步操作不会阻塞线程，允许程序在等待某个任务完成的同时，继续执行其他任务。异步编程适用场景： 1、从…...

编程日记 2025/8/10 3:41:52

缓存之美：Guava Cache 相比于 Caffeine 差在哪里？

大家好，我是方圆。本文将结合 Guava Cache 的源码来分析它的实现原理，并阐述它相比于 Caffeine Cache 在性能上的劣势。为了让大家对 Guava Cache 理解起来更容易，我们还是在开篇介绍它的原理： Guava Cache 通过分段（…...

编程日记 2025/8/14 7:49:58

Go string 字符串底层逻辑

在 Go 语言中，string 类型的底层结构是一个结构体，包含两个字段：一个指向字节数组的指针和该字节数组的长度。以下是其在 Go 源码中的大致定义：type stringStruct struct {str unsafe.Pointerlen int } str：这是一个指…...

编程日记 2025/8/8 4:45:57

高效集成聚水潭采购退货数据到MySQL的最佳实践

聚水潭数据集成到MySQL：采购退货单的高效对接方案在企业的数据管理和分析过程中，数据的准确性和实时性至关重要。本文将分享一个具体的系统对接集成案例：如何通过轻易云数据集成平台，将聚水潭中的采购退货单数据高效地集成到MyS…...

编程日记 2025/8/10 14:50:35

STM32步进电机S型与T型加减速算法

目录一、基本原理二、常见类型三、算法详解四、应用场合五、代码实现 1、main...

编程日记 2025/7/13 11:26:17

centos操作系统上传和下载百度网盘内容

探序基因整理进入百度网盘官网百度网盘客户端下载下载linux的rpm格式的安装包在linux命令行中输入：rpm -ivh baidunetdisk_4.17.7_x86_64.rpm 出现报错： 错误：依赖检测失败： libXScrnSaver 被 baidunetdisk-4.17.7-1.x8…...

编程日记 2025/8/14 12:58:59

深入 Python 网络爬虫开发：从入门到实战

一、为什么需要爬虫？ 在数据驱动的时代，网络爬虫是获取公开数据的重要工具。它可以帮助我们： 监控电商价格变化抓取学术文献构建数据分析样本自动化信息收集二、基础环境搭建 1. 核心库安装 pip install requests beautifulsoup4 lxml …...

编程日记 2025/8/11 12:01:09

网络爬虫【简介】

我叫补三补四，很高兴见到大家，欢迎一起学习交流和进步今天来讲一讲爬虫一、网络爬虫的定义网络爬虫（Web Crawler），又称为网络蜘蛛、网络机器人等，是一种按照一定规则自动抓取互联网信息的程序或脚本。它…...

编程日记 2025/8/8 16:54:17

Linux：Ubuntu server 24.02 上搭建 ollama + dify

一、安装Ubuntu 具体的安装过程可以参见此链接：链接：Ubuntu Server 20.04详细安装教程，这里主要记录一下过程中遇到的问题。安装时subnet如何填写在Ubuntu中subnet填写255.255.255.0是错误的，其格式为 xx.xx.xx.xx/yy &#…...

编程日记 2025/8/12 7:51:05

【生日蛋糕——DFS剪枝优化】

题目分析代码 #include <bits/stdc.h> using namespace std;const int N 24; const int inf 0x3f3f3f3f;int mins[N], minv[N]; int R[N], H[N]; int n, m, ans inf;void dfs(int u, int v, int s) {if(v minv[u] > n) return;if(s mins[u] > ans) return;…...

编程日记 2025/8/13 10:51:08

RabbitMq C++客户端的使用

1.RabbitMq介绍 RabbitMQ 是一款开源的消息队列中间件，基于 AMQP（高级消息队列协议）实现，支持多种编程语言和平台。以下是其核心特点和介绍： 核心特点多语言支持提供 Java、Python、C#、Go、JavaScript 等语言的客…...

编程日记 2025/8/13 9:48:03

入门基础项目-前端Vue_02

文章目录 1. 用户信息1.1 整体设计1.2 完整代码 User.vue1.2.1 数据加载1.2.2 表格 el-table1.2.2.1 多选1.2.2.2 自定义列的内容 Slot1.2.2.3 图片 el-image1.2.2.4 分页 el-pagination 1.2.3 编辑1.2.3.1 弹出框 el-dialog1.2.3.2 上传 el-upload 1.2.4 新增1.2.5 删除1.2.6 …...

编程日记 2025/8/12 17:49:09

C#中SerialPort 的使用

最近在学习C#的SerialPort ，关于SerialPort 的使用，做如下总结： 1.可以通过函数System.IO.Ports.SerialPort.GetPortNames() 将获得系统所有的串口名称。C#代码如下： string[] sPorts SerialPort.GetPortNames(); foreach(stri…...

编程日记 2025/8/15 5:13:13

使用py-ffmpeg批量合成视频的脚本

我有一个小米摄像头，用它录出来的视频全部都是3s一段3s一段的。其中有几个小时的视频我需要保存，当初直接把摄像头的卡文件导出来重命名掉了，那时候没有注意，之后想剪辑/发送给别人的时候发现疯了： 1.剪辑的话&#x…...

编程日记 2025/8/11 13:48:54

mac安装navicat及使用

0.删除旧的 sudo rm -Rf /Applications/Navicat\ Premium.app sudo rm -Rf /private/var/db/BootCaches/CB6F12B3-2C14-461E-B5A7-A8621B7FF130/app.com.prect.NavicatPremium.playlist sudo rm -Rf ~/Library/Caches/com.apple.helpd/SDMHelpData/Other/English/HelpSDMIndexF…...

编程日记 2025/8/7 16:10:26

流水线如何工作？

使用的技术

谁适合阅读这篇博客？

先决条件：环境搭建

1. 技术要求

2. 设置项目

使用 Jupyter Notebook（Google Colab）进行设置

介绍：哥谭市的犯罪

基础：多模态 RAG 架构

什么是多模态 RAG？

三种主要的多模态 RAG 方法

1. 共享向量空间

2. 单一基础模态

3. 独立检索

我们的选择：使用 ImageBind 的共享向量空间

嵌入空间如何工作？

阶段 1 - 收集犯罪现场线索

我们有什么？

文件组织

验证线索组织的代码

阶段 2 - 组织证据

使用 ImageBind 生成嵌入

ImageBind 如何工作？

测试嵌入生成：

阶段 3 - 在 Elasticsearch 中存储和搜索

在 Elasticsearch 中索引证据

运行索引

验证索引过程

预期结果：

在 Elasticsearch 中搜索相似证据

测试搜索 - 使用音频作为查询进行多模态结果搜索

超越音频 - 探索多模态搜索

反转角色：任何模态都可以是 “问题”

为什么这很重要？

我们所做的总结：

下一步：

阶段 4 - 通过 LLM 连接线索

LLM 集成

LLM 分析中的温度设置：

运行证据分析

代码：证据分析脚本

结论：案件已解决

特别感谢

参考文献

相关文章：