当前位置：首页 > news >正文

Kibana 控制台中提供语义、向量和混合搜索

news 来源：原创 2025/8/4 23:13:58

作者：来自 Elastic Mark_Laney

想要将常规 Elasticsearch 查询与新的 AI 搜索功能结合起来吗？那么，你不需要连接到某个第三方的大型语言模型（LLM）吗？不。你可以使用 Elastic 的 ELSER 模型来改进现有搜索，以创建语义向量，并将其与常规搜索相结合，以获得更相关的搜索结果！

介绍

当谈到使用 AI 并将向量搜索应用于 Elastic 上的数据时，我看到很多教程和文章，它们可能变得相当复杂。我想表明，你可以利用今天正在运行的现有搜索，并通过语义和向量搜索来增强它们。你无需连接任何外部 LLM 或付费给任何大公司来处理你的数据。仅在 Elasticsearch 和 Kibana 中，你就可以下载 Elastic 的语义模型 ELSER，处理你的数据以添加描述向量，并增强你的搜索以探索改进。你不需要 LangChain、Python、chatGPT 或任何其他外部工具。

注：目前 ELSER 只提供对英文的支持。它是一种稀疏向量的搜索方式。更多阅读，请参阅 “Elasticsearch：使用 ELSER 释放语义搜索的力量：Elastic Learned Sparse EncoderR”。

平台

我们将运行的搜索是在 Elasticsearch 和 Kibana 8.13.4 版本上完成的。你可以在任何你喜欢的平台上运行它们，无论是在本地还是在云端。

数据

我们将搜索的数据来自 Kaggle 的一组称为 “recipes” 的开放许可数据集。原始数据集可以从这里下载（在《使用 Elastic 的从业者向量搜索》第 173 页中讨论）。我下载了该文件并将其命名为 “allrecipes.csv”。

或者你可以从我的 GitHub 下载数据集。请打开该链接以在后续说明中下载其他文件。

删除重复数据

原始数据集中存在重复的条目。我编写了一个小型 Python 脚本来对它们进行重复数据删除。该脚本位于 GitHub 项目文件夹中，名为 dedupecsv.py。去重的结果文件叫做 allrecipes_dedupe.csv，它也在项目文件中。如果要自己运行 dedupecsv.py，则需要安装 python 库 pandas。（例如 pip install pandas）。但是你不必执行这个重复数据删除，因为它已经完成，结果是 allrecipes_dedupe.csv。下载该文件。

索引数据

我能够使用 Filebeat 和（甚至更简单的）Kibana 中的文件上传器将数据导入 Elasticsearch 中的索引，最近使用的是 8.13.4 版本，但旧版本也可以使用。如果你使用 Filebeat 将 allrecipes_dedupe.csv 导入 Elasticsearch，则配置文件为 filebeat.yml ，并且副本位于项目文件夹中。我按照 “快速安装” 文档安装并使用 filebeat 作为 “自托管” 系统。如果你正在使用远程系统（如 Cloud 或 Strigo），请将 filebeat.yml 和 allrecipes_dedupe.csv 都 scp 到你的实例。

但使用文件上传器更容易，只需下载 allrecipes_dedupe.csv 并将其拖放到 Kibana 中。

使用任一方法，创建一个名为 “recipes” 的 Elasticsearch 索引。 Filebeat 将为你完成此操作。你在文件上传器中输入该名称。

Kibana

步骤 1：让我们验证食谱索引是否已成功导入 Elasticsearch。

在 Kibana 控制台（Kibana 主菜单 -> Management -> Dev Tools）中运行此命令：

GET _cat/indices?v&s=index

_cat API 提供行和列的输出。问号 (?v&s=index) 后面的字符是显示标题标签 (v) 和按 “index” 列排序 (&s=index) 的选项。

验证 recipes 索引的另一种方法是使用 _count API。

GET recipes/_count

步骤 2：下载 ELSER

Kibana UI 让我们可以在机器学习区域下载模型。转到主菜单-> Machine Learning。找到并点击 “Trained Models”。你应该会看到一个可用模型的简短列表。对于非英特尔环境，请下载 .elser_model_2。如果你的平台使用的是Intel x86 芯片，请下载专门准备的.elser_model_2_linux-x86_64。

步骤 3：启动模型

开始部署该模型进行推理。使用一个分配和四个线程。大多数情况下，分配会消耗一个核心，而线程就是该核心上的线程。下面的 API 调用为模型提供了一个名称 “elser_model”。

如果你运行的是 Intel x86 芯片：

POST _ml/trained_models/.elser_model_2_linux-x86_64/deployment/_start?deployment_id=elser_model&number_of_allocations=1&threads_per_allocation=4

否则：

POST _ml/trained_models/.elser_model_2/deployment/_start?deployment_id=elser_model&number_of_allocations=1&threads_per_allocation=4

如果需要启动和停止模型，可以执行以下操作：

POST _ml/trained_models/elser_model/deployment/_stop

如果管道正在调用模型（我们稍后会讲到）：

POST _ml/trained_models/elser_model/deployment/_stop?force=true

确保模型处于启动状态：

GET _ml/trained_models/_stats/
GET _ml/trained_models/_stats?filter_path=trained_model_stats.deployment_stats.state

步骤 4：浏览食谱索引

运行这些命令来检查 recipes 索引的内容和属性。

检查索引中的某些文档：

GET recipes/_search

要查看字段及其数据类型：

GET recipes/_mapping

要查看索引的所有属性（映射、设置、别名），实际上更容易：

GET recipes

步骤 5：改进映射

如果我们运行这个：

GET recipes/_search?filter_path=hits.hits._source.id

请注意，所有 ID 都是小整数。然而，如果我们回过头来检查映射，我们会发现 ID 的数据类型可能是文本/关键字或长整型（取决于我们用来索引配方的工具）。

还要注意，summary 字段是数据类型 text，默认使用标准分析器。如果有英语分析器结果（提供词干结果），这将对我们的搜索很有用。这样，如果我们搜索“stewed” 西红柿的 summary，结果就会是 stewed、stew、stews、stewing等，这可能有助于更轻松地获得相关食谱。

此外，我们还需要创建一个字段来保存数据向量化的结果。稍后我们将配置一个摄取节点管道处理器来指示模型将结果写入名为 ml.tokens 的字段。正如这里所解释的，ELSER 模型结果应该存储为我们的向量嵌入的数据类型 “稀疏向量”。

考虑到这些改变，让我们创建一个具有所有改进的映射属性的不存在的索引。

PUT recipes_embeddings
{"mappings": {"properties": {"id": {"type": "short"},"group": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"ingredient": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"n_rater": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"n_reviewer": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"name": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"process": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"rating": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"summary": {"type": "text","analyzer": "english","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"ml.tokens": {"type": "sparse_vector"}}}
}

步骤 6：其他文档修复

让我们再次检查一些有关食谱的文件。

GET recipes/_search
{"size": 2000,"_source": ["ingredient"]
}

如果你在控制台的右侧按 cntr+f find，你可能会注意到有些文档带有双引号或三引号。

在这里，我们创建一个摄取节点管道，将三重引号替换为单引号。
我们将我们的管道称为 “doublequotes”。

PUT _ingest/pipeline/doublequotes
{"processors": [{"gsub": {"field": "ingredient","pattern": "\"(.*?)","replacement": ""}}]
}

为了测试我们的管道，我们可以使用_simulate API。

POST _ingest/pipeline/doublequotes/_simulate
{"docs": [{"_index": "recipes","_id": "id","_source": {"ingredient": """shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)""","tags": 2342}}]
}

步骤 7：带推理的管道

下面我们创建一个管道来清理引文并应用推理处理器。推理处理器是我们用来针对索引中的成分字段运行我们的模型（在本例中为 ELSER）的工具。回想一下，我们部署的名称是 else_model，我们在这里看到它被称为 model_id。我们将此管道称为 else_clean_recipes。

PUT _ingest/pipeline/elser_clean_recipes
{"processors": [{"pipeline": {"name": "doublequotes"},"inference": {"model_id": "elser_model","target_field": "ml","field_map": {"ingredient": "text_field"},"inference_config": {"text_expansion": {"results_field": "tokens"}}}}]
}

请注意，模型默认对 “text_field” 字段进行向量化。在“field_map”行中，我们配置推理处理器以使用不同的字段（在本例中为成分）。

一定要测试。

POST _ingest/pipeline/elser_clean_recipes/_simulate
{"docs": [{"_index": "recipes","_id": "id","_source": {"ingredient": """prep: 20 mins,cook: 20 mins,total: 40 mins,Servings: 4,Yield: 4 servings","½ small onion, chopped + ½ tomato, chopped + 1 jalapeno pepper, seeded and minced + 1 sprig fresh cilantro, chopped + 6 eggs, beaten + 4 (10 inch) flour tortillas + 2 cups shredded Cheddar cheese + ¼ cup sour cream, for topping + ¼ cup guacamole, for topping""","tags": 2342}},{"_index": "recipes","_id": "id","_source": {"ingredient":""""shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)"""}},{"_index": "recipes","_id": "id","_source": {  "ingredient": """shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)"""}}]
}

步骤 8：通过管道处理数据

现在我们有了管道，是时候处理我们的索引了。我们将使用 _reindex API 将数据从 recipes 索引发送到 recipes_embeddings。在此过程中，数据将通过我们的管道来创建嵌入。

重新索引可能需要很长时间，因此我们在这里使用一个名为 wait_for_completion=false 的选项进行运行。

当我们运行命令时，它将生成一个 ID 号，我们可以使用它来检查进度。确保将此 ID 复制并粘贴到某处。

此外，重新索引正在使用选项......

requests_per_second=-1&timeout=60m

...分别以尽可能快的速度运行并且不会太快超时。

警告：可能需要大约 15-30 分钟

POST _reindex?wait_for_completion=false&requests_per_second=-1&timeout=60m
{"conflicts": "proceed", "source": {"index": "recipes","size": 500},"dest": {"index": "recipes_embeddings","pipeline": "elser_clean_recipes"}
}

复制并使用任务编号来跟踪重新索引过程。

GET _tasks/< paste task number here >

例如，我复制并粘贴了这个ID：LAV3l8oZTmaR9p8VUVqO3g:373447

如果需要，你可以像这样删除 recipes_embeddings 索引。

DELETE recipes_embeddings

步骤 9：检查已处理的文件

至少一批文档完成后，检查结果。

GET _cat/indices?v&s=iGET recipes_embeddings/_search

冗长的 ml.tokens 字段使得输出看起来不太好看。你可以像这样抑制它。

GET recipes_embeddings/_search
{ "_source": { "excludes": "ml"} }

GET recipes_embeddings/_count
4808 documents

我们可以运行这个聚合来找到我们可以查询的 “group” 中的所有不同值。

GET recipes_embeddings/_search?size=0
{"aggs": {"all the groups": {"terms": {"field": "group.keyword","size": 200}}}
}

有多少个桶？

GET recipes_embeddings/_search
{"size": 0, "aggs": {"how many buckets": {"cardinality": {"field": "group.keyword","precision_threshold": 200}}}
}

我得到了 174 个不同的组。

为了隔离输出中的组以便我们可以在文档中看到它们，你可以运行以下命令：

GET recipes_embeddings/_search?size=1000&_source=group

搜索

最后，数据准备好了，我们可以在 recipes_embeddings 索引上执行搜索。让我们比较一下没有 ELSER 的运行结果和有 ELSER 的运行结果。

搜索 1：Old fashion（老式鸡尾酒）

首先，我们来寻找一种名为 Old Fashion 的鸡尾酒的配方。下面介绍如何进行此操作。

-- 老式波本鸡尾酒 --
1. 将一到两颗樱桃放入老式老式玻璃杯中，并用捣碎器轻轻捣碎。
2. 取橙皮并擦拭玻璃杯边缘内侧，然后将果皮放在樱桃上。
3. 加入冰块、黑麦威士忌、糖和苦味酒。
4. 搅拌均匀即可食用。

首先，没有 ELSER 的情况下：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml", "includes": ["name","group","summary","ingredient"]},"query": {"bool": {"should": [ { "wildcard": { "group": {"value": "drinks*" }}},{"multi_match": {"type": "phrase", "query": "old fashion","fields": ["summary","name"]}},{"match": {"summary": "delicious sensational"}}]}}
}

在你的控制台中运行它，你将看到热门点击中有一些非常糟糕的结果。
对我来说，它们都不是饮料（尽管我们搜索了饮料*组）。
我的许多搜索结果中都有诸如 “good old fashion meals and dishes...” 这样的短语，但这并不是我们想要的。

现在让我们用 ELSER 搜索...并看看出色的结果！

GET recipes_embeddings/_search
{"_source": {"excludes": "ml","includes": ["name","group","summary","ingredient"]},"sub_searches": [{"query": {"bool": {"should": [{"wildcard": {"group": {"value": "drinks*"}}},{"multi_match": {"query": "old fashioned","type": "phrase","fields": ["summary","name"]}}]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "old fashioned bourbon whiskey whisky drink"}}}}],"rank": {"rrf": {"window_size": 500,"rank_constant": 60}}
}

请注意，所有这些都是成人饮料，并且有几种饮料的名称以 “Old Fashion” 开头。

搜索 2：Shrimp dishes

没有 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"query": {"bool": {"should": [{"wildcard": {"group": {"value": "main*"}}},{"multi_match": {"type": "phrase", "query": "tempura shrimp","fields": ["ingredient","name^2"]}},{"match": {"summary": "tasty delightful"}}]}}
}

前五名的结果相当糟糕：我得到了像 pork tenderloin、Spanish sauce、carrot salad 这样的食谱……甚至没有虾。

使用 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"sub_searches": [{"query": {"bool": {"should": [{"wildcard": {"group": {"value": "main*"}}},{"multi_match": {"query": "tempura shrimp","type":"phrase","fields": ["ingredient","name^2"]}}/*,{"match": {"summary": "tasty delightful"}}*/]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "tempura shrimp"}}}}],"rank": {"rrf": {"window_size": 500,"rank_constant": 60}}
}

排名前五的菜品要好得多：Shrimp with Pasta、Shrimp Scampis、Penne with Shrimp、Penne with Shrimp、Shrimp Quiche

请注意，Elasticsearch 如何通过语义搜索发现 shrimp 和 scampi 以及 prawns 等其他术语也具有相关性。

搜索 3：Spaghetti dishes

没有 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"query": {"bool": {"should": [{"wildcard": {"group": {"value": "main*"}}},{"multi_match": {"type": "phrase", "query": "Spaghetti Bolognese","fields": ["ingredient","name^2"]}},{"match": {"summary": "tasty delightful"}}]}}
}

前五名中又有糟糕的：pork tenderloin, Med. sauce, carrot salad, lime chicken……

我根本不喜欢意大利面。

使用 ELSER：

GET recipes_embeddings/_search?filter_path=hits.hits._source
{"_source": {"excludes": "ml"},"sub_searches": [{"query": {"bool": {"should": [{"wildcard": {"group": {"value": "main*"}}},{"multi_match": {"query": "Spaghetti Bolognese","type":"phrase","fields": ["ingredient","name^2"]}}/*,{"match": {"summary": "tasty delightful"}}*/]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "main Spaghetti Bolognese"}}}}],"rank": {"rrf": {"window_size": 500,"rank_constant": 60}}
}

改进了很多。很多食材都有意大利面或通心粉

再次排在最前面：Pennes, pastas, spaghettis。

搜索 4：巧克力

没有 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"query": {"bool": {"should": [{"match": {"name": "dessert"}},{"match": {"ingredient": "chocolate"}},{"match": {"summary": "tasty delightful"}}]}}
}

奶昔和薄饼……？

我找不到任何含巧克力的配料。

使用 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"}, "sub_searches": [{"query": {"bool": {"should": [{"match": {"name": "dessert"}},{"match": {"ingredient": "chocolate"}},{"match": {"summary": "tasty delightful"}}]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "dessert chocolate"}}}}],"rank": {"rrf": {"window_size": 50,"rank_constant": 20}}
}

Peppermint bark 是巧克力的一个晦涩术语。

Nanaimo bars 是一种巧克力顶上的饼干。我还看到hot chocolate, chocolate muffins, chocolate cake, Oreo truffles 和 cake balls，其中的配料包括巧克力。

我看到很多配料都含有巧克力。

我们再看一下我们可以查询的 “groups”。

GET recipes_embeddings/_search?size=0
{"aggs": {"all the groups": {"terms": {"field": "group.keyword","size": 200}}}
}

everyday-cooking 有多少种菜谱？

GET recipes_embeddings/_count
{"query": {"wildcard": {"group.keyword": {"value": "everyday-cooking*"}}}
}

我得到 310。

鱼肉三明治 - Fish Sandwich

让我们在日常烹饪中寻找 “Fish Sandwich”。

没有 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"query": {"bool": {"should": [{"wildcard": {"group": {"value": "everyday-cooking*"}}},{"multi_match": {"type": "phrase", "query": "fish sandwich","fields": ["ingredient","name^2"]}}]}}
}

没有！

哇，根本没有鱼肉三明治……？

使用 ELSER 添加嵌入搜索。

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"sub_searches": [{"query": {"bool": {"should": [{"wildcard": {"group": {"value": "everyday-cooking*"}}},{"multi_match": {"query": "fish sandwich","type":"phrase","fields": ["ingredient","name^2"]}}]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "fish sandwich"}}}}],"rank": {"rrf": {"window_size": 500,"rank_constant": 60}}
}

在右侧搜索 “三明治”。

我看到 tuna patties, tuna salads - 很多鱼，很多都是三明治。

让我们看看“主菜”类别中有多少食谱？

让我们检查一下 “Main” 类别中有多少个食谱？

GET recipes_embeddings/_search
{"_source": {"excludes": "ml"},"query": {"bool": {"must": [{"match": {"group": "main*"}}]}}
}

我得到了458。

里脊牛排 - Tenderloin Steak

请注意，在美国我们说 “tenderloins or tenderloin steaks”，但在法国它被称为 Chateaubriand。还请注意，“Chateaubriand” 不在任何食谱中。可以使用 multi_match 命令来搜索许多字段，如下所示：

GET recipes_embeddings/_search
{"query": {"multi_match": {"query": "Chateaubriand","fields": ["ingredient","summary","name"]}}
}

零结果

不使用 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml", "includes": ["name","group","summary","ingredient"]},"query": {"bool": {"should": [ { "wildcard": { "group": {"value": "main*" }}},{"multi_match": {//"type": "phrase", "query": "tenderloin steak Chateaubriand beef","fields": ["ingredient","name^2"]}},{"match": {"summary": "delicious Chateaubriand"}},{"match": {"ingredient": "beef"}}]}}
}

前十名的成绩相当糟糕。Salt and pepper fries, salt bread, Pork Tenderloin 等等。

使用 ELSER：

GET recipes_embeddings/_search
{"_source": {"excludes": "ml", "includes": ["name","group","summary","ingredient"]},"sub_searches": [{"query": {"bool": {"should": [{"wildcard": {"group": {"value": "main*"}}},{"multi_match": {"query": "tenderloin steak Chateaubriand beef",//"type":"phrase","fields": ["ingredient","name^2"]}},{"match":{"ingredient":{"query": "beef"}}}]}}},{"query": {"text_expansion": {"ml.tokens": {"model_id": "elser_model","model_text": "tenderloin steak Chateaubriand beef"}}}}],"rank": {"rrf": {"window_size": 500,"rank_constant": 60}}
}

更多实际牛排。

恭喜！我们已经研究了许多示例，其中使用 ELSER 对数据进行向量化使我们能够进行语义搜索。我们将常规的、可能预先存在的术语搜索与向量搜索相结合，发现混合搜索结果通常比仅搜索术语本身更相关。

原文：Dec 13th, 2024: [EN] Semantic, Vector, and Hybrid Search all in Kibana Console - Advent Calendar - Discuss the Elastic Stack

介绍

平台

数据

删除重复数据

索引数据

Kibana

步骤 1：让我们验证食谱索引是否已成功导入 Elasticsearch。

步骤 2：下载 ELSER

步骤 3：启动模型

步骤 4：浏览食谱索引

步骤 5：改进映射

步骤 6：其他文档修复

步骤 7：带推理的管道

步骤 8：通过管道处理数据

步骤 9：检查已处理的文件

搜索

搜索 1：Old fashion（老式鸡尾酒）

搜索 2：Shrimp dishes

搜索 3：Spaghetti dishes

搜索 4：巧克力

鱼肉三明治 - Fish Sandwich

里脊牛排 - Tenderloin Steak

相关文章：