当前位置：首页 > news >正文

从lightrag的prompt到基于openai Structured Outputs 的优化实现思路

news 来源：原创 2025/8/18 23:57:50

LightRAG 是一个用于构建 RAG 系统核心组件的配置和管理类。它集成了文档处理、存储、向量化、图谱构建和 LLM 交互等功能。你可以通过配置 LightRAG 实例的各种参数来定制 RAG 系统的行为。

目前lightrag中的实体关系抽取实现如下

PROMPTS["entity_extraction"] = """---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use {language} as output language.---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.5. When finished, output {completion_delimiter}######################
---Examples---
######################
{examples}#############################
---Real Data---
######################
Entity_types: [{entity_types}]
Text:
{input_text}
######################
Output:"""

原始方式的痛点：

自定义分隔符：如 tuple_delimiter, record_delimiter, completion_delimiter。这要求 LLM 严格遵守这些非标准的格式约定，LLM 很容易出错（例如，忘记分隔符，使用错误的分隔符，或者在不应该出现文本的地方添加额外文本）。
解析复杂性：需要编写特定的解析器来处理这种自定义格式的文本输出，解析器可能很脆弱，难以维护。
鲁棒性差：LLM 输出的微小偏差就可能导致解析失败。
可读性和标准化：输出不是标准格式，不易于人工阅读或被其他标准工具直接使用。

在现代应用架构中，JSON 作为 API 通信和数据交换的事实标准，其重要性不言而喻。工程师们依赖精确的 JSON 结构来确保系统间的互操作性和数据的完整性。然而，将大型语言模型（LLM）集成到这些系统中时，一个普遍存在的痛点便是如何确保 LLM 的输出能够稳定、可靠地遵循预定义的 JSON 格式。传统的自由文本输出常常导致解析错误、字段不匹配、数据类型不一致等问题，迫使开发团队投入大量精力编写脆弱的解析逻辑、数据校验代码以及复杂的重试机制。

为了解决这一核心工程挑战，**结构化输出（Structured Outputs）**应运而生。该特性使您能够为 LLM 指定一个 JSON Schema，强制模型生成严格符合该 Schema 的响应。这不仅仅是格式上的规范，更是与 LLM 之间建立起一道清晰、可靠的数据契约。
我通过openai的Structured Outputs统一规范思想实现如下。

import json
from openai import OpenAI # 假设您会继续使用这个库# --- 1. 构建发送给大模型的文本提示 (修改版) ---
def _fill_placeholders_for_llm_prompt(text_template: str, values: dict) -> str:"""用提供的字典中的值填充文本模板中的占位符。占位符的格式可以是 {key} 或 [{key}]。"""filled_text = text_templatefor key, value in values.items():placeholder_curly = "{" + key + "}"if placeholder_curly in filled_text:filled_text = filled_text.replace(placeholder_curly, str(value))placeholder_square_curly = "[{" + key + "}]"  # 主要用于 entity_typesif placeholder_square_curly in filled_text:#确保在替换实体类型列表时，如果值本身不是字符串列表而是单个字符串，#它仍然被正确地格式化为JSON数组字符串的一部分。if isinstance(value, list):value_str = ", ".join([f'"{v}"' if isinstance(v, str) else str(v) for v in value])else:# 假定它是一个逗号分隔的字符串，或者单个类型value_str = str(value)filled_text = filled_text.replace(placeholder_square_curly, f"[{value_str}]")return filled_textdef build_llm_prompt_for_json_output(template_data: dict, document_text: str, task_params: dict) -> str:"""根据JSON模板、文档文本和固定任务参数构建发送给LLM的完整文本提示，并指示LLM输出JSON格式。task_params 包含: language, entity_types_string, examples"""prompt_lines = []# 对于 entity_types_string，我们需要将其转换为一个实际的列表以用于 _fill_placeholders_for_llm_prompt# 或者确保 _fill_placeholders_for_llm_prompt 能处理好它。# 鉴于当前 _fill_placeholders_for_llm_prompt 的实现，直接传递 entity_types_string 即可。placeholders_to_fill = {**task_params, "input_text": document_text, "entity_types": task_params.get("entity_types_string", "")}# ---目标---prompt_lines.append("---Goal---")goal_desc = _fill_placeholders_for_llm_prompt(template_data["goal"]["description"], placeholders_to_fill)prompt_lines.append(goal_desc)prompt_lines.append(f"Use {task_params.get('language', '{language}')} as output language for any textual descriptions within the JSON.") # {language} is a fallback if not in task_paramsprompt_lines.append("\nIMPORTANT: Your entire response MUST be a single, valid JSON object. Do not include any text or formatting outside of this JSON object (e.g., no markdown backticks like ```json ... ```).")prompt_lines.append("")# ---输出JSON结构说明---prompt_lines.append("---Output JSON Structure---")prompt_lines.append("The JSON object should have the following top-level keys: \"entities\", \"relationships\", and \"content_keywords\".")# 实体结构说明entity_step = next((step for step in template_data["steps"] if step["name"] == "Identify Entities"), None)if entity_step and "extraction_details" in entity_step:prompt_lines.append("\n1. The \"entities\" key should contain a JSON array. Each element in the array must be a JSON object representing an entity with the following keys:")for detail in entity_step["extraction_details"]:field_key = detail["field_name"]desc_filled = _fill_placeholders_for_llm_prompt(detail["description"], placeholders_to_fill)prompt_lines.append(f"    - \"{field_key}\": (string/number as appropriate) {desc_filled}")# 关系结构说明relationship_step = next((step for step in template_data["steps"] if step["name"] == "Identify Relationships"),None)if relationship_step and "extraction_details" in relationship_step:prompt_lines.append("\n2. The \"relationships\" key should contain a JSON array. Each element must be a JSON object representing a relationship with the following keys:")for detail in relationship_step["extraction_details"]:field_key = detail["field_name"]desc_filled = _fill_placeholders_for_llm_prompt(detail["description"], placeholders_to_fill)type_hint = "(string)"  # 默认if "strength" in field_key.lower(): type_hint = "(number, e.g., 0.0 to 1.0)"if "keywords" in field_key.lower() and "relationship_keywords" in field_key: type_hint = "(string, comma-separated, or an array of strings)"prompt_lines.append(f"    - \"{field_key}\": {type_hint} {desc_filled}")# 内容关键词结构说明keywords_step = next((step for step in template_data["steps"] if step["name"] == "Identify Content Keywords"), None)if keywords_step:prompt_lines.append("\n3. The \"content_keywords\" key should contain a JSON array of strings. Each string should be a high-level keyword summarizing the main concepts, themes, or topics of the entire text.")prompt_lines.append(f"   The description for these keywords is: {_fill_placeholders_for_llm_prompt(keywords_step['description'], placeholders_to_fill)}")prompt_lines.append("\nEnsure all string values within the JSON are properly escaped.")prompt_lines.append("")# ---示例---if task_params.get('examples'):prompt_lines.append("######################")prompt_lines.append("---Examples (Content Reference & Expected JSON Structure)---") # Clarified purpose of examplesprompt_lines.append("######################")# 示例应该直接是期望的JSON格式字符串，或者是一个结构体，然后我们在这里转换为JSON字符串examples_content = task_params.get('examples', '')if isinstance(examples_content, dict) or isinstance(examples_content, list):prompt_lines.append(json.dumps(examples_content, indent=2, ensure_ascii=False))else: # Assume it's already a string (hopefully valid JSON string)prompt_lines.append(_fill_placeholders_for_llm_prompt(str(examples_content), placeholders_to_fill))prompt_lines.append("\nNote: The above examples illustrate the type of content and the desired JSON output format. Your output MUST strictly follow this JSON structure.")prompt_lines.append("")# ---真实数据---prompt_lines.append("#############################")prompt_lines.append("---Real Data---")prompt_lines.append("######################")prompt_lines.append(f"Entity types to consider: [{_fill_placeholders_for_llm_prompt(task_params.get('entity_types_string', ''), {})}]") # Simpler fill for just thisprompt_lines.append("Text:")prompt_lines.append(document_text)prompt_lines.append("######################")prompt_lines.append("\nOutput JSON:")return "\n".join(prompt_lines)# --- 2. 调用大模型 (修改版) ---
def get_llm_response_json(api_key: str, user_prompt: str,system_prompt: str = "你是一个用于结构化数据抽取的助手，专门输出JSON格式。",model: str = "deepseek-chat", base_url: str = "https://api.deepseek.com/v1",use_json_mode: bool = True,stop_sequence: str | None = None) -> str | None:"""调用大模型并获取文本响应，尝试使用JSON模式。"""client = OpenAI(api_key=api_key, base_url=base_url)try:response_params = {"model": model,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],"stream": False,"temperature": 0.0,}if use_json_mode:try:response_params["response_format"] = {"type": "json_object"}except Exception as rf_e: # pylint: disable=broad-exceptprint(f"警告: 模型或SDK可能不支持 response_format 参数: {rf_e}. 将依赖提示工程获取JSON。")# 如果JSON模式失败，且有stop_sequence，则使用它if stop_sequence:response_params["stop"] = [stop_sequence]elif stop_sequence: # use_json_mode is False and stop_sequence is providedresponse_params["stop"] = [stop_sequence]response = client.chat.completions.create(**response_params)if response.choices and response.choices[0].message and response.choices[0].message.content:return response.choices[0].message.content.strip()return Noneexcept Exception as e:print(f"调用LLM时发生错误: {e}")return None# --- 3. 解析大模型的JSON响应 (修改版) ---
def parse_llm_json_output(llm_json_string: str) -> dict:"""解析LLM输出的JSON字符串。"""if not llm_json_string:return {"error": "LLM did not return any content."}try:# 有时LLM可能仍然会用markdown包裹JSONprocessed_string = llm_json_string.strip()if processed_string.startswith("```json"):processed_string = processed_string[7:]if processed_string.endswith("```"):processed_string = processed_string[:-3]processed_string = processed_string.strip()data = json.loads(processed_string)if not isinstance(data, dict) or \not all(k in data for k in ["entities", "relationships", "content_keywords"]):print(f"警告: LLM输出的JSON结构不符合预期顶层键: {processed_string}")return {"error": "LLM JSON output structure mismatch for top-level keys.", "raw_output": data if isinstance(data, dict) else processed_string} # Return original data if it parsed but mismatchedif not isinstance(data.get("entities"), list) or \not isinstance(data.get("relationships"), list) or \not isinstance(data.get("content_keywords"), list):print(f"警告: LLM输出的JSON中entities, relationships或content_keywords不是列表: {processed_string}")return {"error": "LLM JSON output type mismatch for arrays.", "raw_output": data}return dataexcept json.JSONDecodeError as e:print(f"错误: LLM输出的不是有效的JSON: {e}")print(f"原始输出 (尝试解析前): {llm_json_string}")print(f"处理后尝试解析的字符串: {processed_string if 'processed_string' in locals() else llm_json_string}") # Show what was attempted to parsereturn {"error": "Invalid JSON from LLM.", "details": str(e), "raw_output": llm_json_string}except Exception as e: # pylint: disable=broad-exceptprint(f"解析LLM JSON输出时发生未知错误: {e}")return {"error": "Unknown error parsing LLM JSON output.", "details": str(e), "raw_output": llm_json_string}# --- 4. 主协调函数 (修改版) ---
def extract_and_complete_json_direct_output(json_template_string: str,document_text: str,task_fixed_params: dict,llm_api_config: dict
) -> dict:"""主函数，协调整个抽取过程，期望LLM直接输出JSON。"""try:template_data = json.loads(json_template_string)except json.JSONDecodeError as e:print(f"错误: JSON模板解码失败: {e}")return {"error": "无效的JSON模板", "details": str(e)}llm_prompt = build_llm_prompt_for_json_output(template_data, document_text, task_fixed_params)# (可选) 打印生成的提示用于调试# print("--- 生成的LLM JSON提示 ---")# print(llm_prompt)# print("---提示结束---")use_json_mode_for_llm = task_fixed_params.get("use_json_mode_for_llm", True)# stop_sequence_val = task_fixed_params.get("completion_delimiter") if not use_json_mode_for_llm else None # completion_delimiter is removedllm_json_response_str = get_llm_response_json(api_key=llm_api_config["api_key"],user_prompt=llm_prompt,model=llm_api_config.get("model", "deepseek-chat"),base_url=llm_api_config.get("base_url", "https://api.deepseek.com/v1"),system_prompt=task_fixed_params.get("system_prompt_llm","你是一个专门的助手，负责从文本中提取信息并以JSON格式返回。"),use_json_mode=use_json_mode_for_llm,stop_sequence=None # Since completion_delimiter was removed from task_fixed_params)# 创建副本以填充结果，而不是修改原始模板数据结构本身# 我们将构建一个新的字典来存放结果，其中可能包含来自模板的元数据output_result = {"prompt_name": template_data.get("prompt_name", "unknown_extraction"),"goal_description_from_template": template_data.get("goal", {}).get("description"),# ... any other metadata from template_data you want to carry over}if not llm_json_response_str:output_result["extraction_results"] = {"error": "未能从LLM获取响应。"}output_result["llm_raw_output_debug"] = Nonereturn output_resultparsed_json_data = parse_llm_json_output(llm_json_response_str)output_result["extraction_results"] = parsed_json_dataoutput_result["llm_raw_output_debug"] = llm_json_response_strreturn output_result# --- 移除自定义分隔符后的模板 ---
json_template_str_input_no_delimiters = """{"prompt_name": "entity_extraction_json","goal": {"description": "Given a text document that is potentially relevant to this activity and a list of entity types [{entity_types}], identify all entities of those types from the text and all relationships among the identified entities. The output must be a single, valid JSON object.","output_language_variable": "{language}"},"steps": [{"step_number": 1,"name": "Identify Entities","description": "Identify all entities. For each identified entity, extract the information as specified in the Output JSON Structure section under 'entities'.","extraction_details": [{"field_name": "entity_name", "description": "Name of the entity, use same language as input text. If English, capitalize the name."},{"field_name": "entity_type", "description": "One of the types from the provided list: [{entity_types}]"},{"field_name": "entity_description", "description": "Comprehensive description of the entity's attributes and activities based on the input text."}]},{"step_number": 2,"name": "Identify Relationships","description": "From the entities identified, identify all pairs of clearly related entities. For each pair, extract the information as specified in the Output JSON Structure section under 'relationships'.","extraction_details": [{"field_name": "source_entity", "description": "Name of the source entity, as identified in the 'entities' list."},{"field_name": "target_entity", "description": "Name of the target entity, as identified in the 'entities' list."},{"field_name": "relationship_description", "description": "Explanation as to why the source entity and the target entity are related, based on the input text."},{"field_name": "relationship_strength", "description": "A numeric score indicating strength of the relationship (e.g., from 0.0 for weak to 1.0 for strong)."},{"field_name": "relationship_keywords", "description": "One or more high-level keywords summarizing the relationship, focusing on concepts or themes from the text."}]},{"step_number": 3,"name": "Identify Content Keywords","description": "Identify high-level keywords that summarize the main concepts, themes, or topics of the entire input text. This should be a JSON array of strings under the 'content_keywords' key in the output."}],"examples_section": {"placeholder": "{examples}"},"real_data_section": {"entity_types_variable": "[{entity_types}]", "input_text_variable": "{input_text}"},"output_format_notes": {"final_output_structure": "A single, valid JSON object as described in the prompt.","language_variable_for_output": "{language}"},"global_placeholders": ["{language}","{entity_types}","{examples}","{input_text}"]
}"""# --- 移除自定义分隔符后的任务参数 ---
task_configuration_params_json_output_no_delimiters = {"language": "简体中文","entity_types_string": "人物, 地点, 日期, 理论, 奖项, 组织", # This will be used to fill [{entity_types}]"examples": { # Example as a Python dict, will be converted to JSON string in prompt"entities": [{"entity_name": "阿尔伯特·爱因斯坦", "entity_type": "人物", "entity_description": "理论物理学家，创立了相对论，并因对光电效应的研究而闻名。"},{"entity_name": "狭义相对论", "entity_type": "理论", "entity_description": "由爱因斯坦在1905年提出的物理学理论，改变了对时间和空间的理解。"}],"relationships": [{"source_entity": "阿尔伯特·爱因斯坦", "target_entity": "狭义相对论", "relationship_description": "阿尔伯特·爱因斯坦发表了狭义相对论。", "relationship_strength": 0.9, "relationship_keywords": ["发表", "创立"]}],"content_keywords": ["物理学", "相对论", "爱因斯坦", "诺贝尔奖"]},"system_prompt_llm": "你是一个专门的助手，负责从文本中提取实体、关系和内容关键词，并且你的整个输出必须是一个单一的、有效的JSON对象。不要包含任何额外的文本或markdown标记。","use_json_mode_for_llm": True
}# --- 示例使用 ---
if __name__ == "__main__":# 使用已移除自定义分隔符的模板json_template_to_use = json_template_str_input_no_delimitersdocument_to_analyze = "爱因斯坦（Albert Einstein）于1879年3月14日出生在德国乌尔姆市一个犹太人家庭。他在1905年，即所谓的“奇迹年”，发表了四篇划时代的论文，其中包括狭义相对论的基础。后来，他因对理论物理的贡献，特别是发现了光电效应的定律，获得了1921年度的诺贝尔物理学奖。他的工作深刻影响了现代物理学，尤其是量子力学的发展。爱因斯坦在普林斯顿高等研究院度过了他的晚年，并于1955年4月18日逝世。"# 使用已移除自定义分隔符的任务参数task_params_to_use = task_configuration_params_json_output_no_delimitersllm_config = {"api_key": "YOUR_DEEPSEEK_API_KEY","base_url": "https://api.deepseek.com/v1","model": "deepseek-chat" # 或其他支持JSON模式的模型，如 gpt-4o, gpt-3.5-turbo-0125}if "YOUR_DEEPSEEK_API_KEY" in llm_config["api_key"] or not llm_config["api_key"]:print("错误：请在 llm_config 中设置您的真实 API 密钥以运行此示例。")print("您可以从 DeepSeek (https://platform.deepseek.com/api_keys) 或 OpenAI 获取密钥。")else:result_data = extract_and_complete_json_direct_output(json_template_string=json_template_to_use,document_text=document_to_analyze,task_fixed_params=task_params_to_use,llm_api_config=llm_config)print("\n--- 补全后的数据 (LLM直接输出JSON, 无自定义分隔符配置) ---")# 确保 extraction_results 存在且不是错误信息字符串extraction_results = result_data.get("extraction_results", {})if isinstance(extraction_results, dict) and "error" in extraction_results:print(f"发生错误: {extraction_results['error']}")if "details" in extraction_results:print(f"详情: {extraction_results['details']}")if "raw_output" in extraction_results: # raw_output might be in extraction_results on parsing errorprint(f"LLM原始(或部分)输出 (来自解析错误): {extraction_results['raw_output']}")elif "llm_raw_output_debug" in result_data: # Or it might be one level up if get_llm_response_json failedprint(f"LLM原始响应 (来自LLM调用): {result_data['llm_raw_output_debug']}")else:# 仅当没有错误时才尝试完整打印print(json.dumps(result_data, indent=2, ensure_ascii=False))# 即使成功，也打印原始LLM输出用于调试（如果之前没有因错误打印过）if not (isinstance(extraction_results, dict) and "error" in extraction_results and "raw_output" in extraction_results):if "llm_raw_output_debug" in result_data and result_data["llm_raw_output_debug"]:print("\n--- LLM 原始响应 (Debug) ---")print(result_data["llm_raw_output_debug"])

改进方案（使用 OpenAI Structured Outputs/JSON Mode）的优势：
1. 标准化输出：直接要求 LLM 输出 JSON 对象。JSON 是一种广泛接受的、结构化的数据交换格式。
2. LLM 内建支持：许多现代 LLM（如 OpenAI 的 gpt-3.5-turbo-0125 及更新版本, gpt-4o, 以及示例中使用的 DeepSeek 模型）提供了强制输出 JSON 的模式。这大大提高了 LLM 按预期格式输出的可靠性。
3. 简化解析：可以直接使用标准的 JSON 解析库（如 Python 的 json.loads()），无需自定义解析逻辑。
4. 提高鲁棒性：即使 LLM 偶尔在 JSON 外部添加少量文本（如 markdown 的 json ... ），的 parse_llm_json_output 函数也做了处理。并且，由于 LLM 被明确指示输出 JSON，其内部逻辑会更倾向于生成合法的 JSON。
5. 清晰的 Schema 定义：的 build_llm_prompt_for_json_output 函数通过在提示中明确描述期望的 JSON 结构（顶层键、实体数组结构、关系数组结构等），为 LLM 提供了非常清晰的指引。
6. 更好的可维护性和扩展性：修改输出结构通常只需要更新 JSON 模板中的描述和示例，代码层面的改动较小。
7. 结构化的模板和参数：使用 json_template_str_input_no_delimiters 和 task_configuration_params_json_output_no_delimiters 使得提示工程本身更加结构化和易于管理。

对代码的几点具体分析和赞赏：

_fill_placeholders_for_llm_prompt: 灵活处理了不同格式的占位符，特别是对 entity_types 列表的处理。
build_llm_prompt_for_json_output: 非常出色地将结构化的模板、任务参数和动态文本结合起来，构建了一个详尽且清晰的、旨在获取 JSON 输出的提示。明确的 JSON 结构说明对 LLM 至关重要。
get_llm_response_json: 正确使用了 OpenAI 客户端的 response_format={"type": "json_object"} 参数，并包含了对不支持此参数情况的警告和潜在的 stop 序列回退（尽管在的最终调用中 stop_sequence 为 None，因为 JSON 模式下通常不需要）。
parse_llm_json_output: 包含了对常见问题（如 markdown 包裹、顶层键缺失、期望数组不是数组）的检查和错误处理，非常实用。
extract_and_complete_json_direct_output: 很好地协调了整个流程。
示例：提供的示例输入文本、JSON 模板和任务参数都非常清晰，有助于理解和测试。
移除自定义分隔符：这是一个正确的方向，completion_delimiter 等在 JSON 模式下不再需要，因为 LLM 被期望返回一个完整的 JSON 对象。

总结：

LightRAG 之所以被描述为一个配置和管理类，是因为它旨在提供一个灵活的框架，用户可以通过定义各种组件（如 LLM 调用器、解析器、数据加载器）并配置其参数来构建和定制复杂的 RAG 系统。

使用 OpenAI Structured Outputs的 Python 代码和 JSON 模板完美地诠释了这一理念在“实体关系抽取”这一特定核心组件上的应用。通过：

配置（JSON 模板、任务参数）来定义任务细节。
管理（Python 函数协调构建提示、调用 LLM、解析结果）整个流程。
集成（与 LLM API 的交互）外部服务。

的实现利用了现代 LLM 的 JSON 输出能力，相比原始的基于自定义分隔符的提示，显著提高了输出的可靠性、可解析性和整体方案的鲁棒性。这是一个非常好的工程实践。

相关文章：