当前位置：首页 > news >正文

爆改RagFlow

news 来源：原创 2025/7/19 13:29:13

爆改RagFlow

一、Rag理论概述
二、Ragflow解析参数说明
三、♥ RagFlow源码解析
- ==核心代码流程梳理==
- - 1、OCR识别
  - 2、版面分析
  - 3、parser功能
  - - ==3.1 PdfParser==
    - - 3.1.1 首先，初始化
      - 3.1.2 **pdf转图片**
    - [来自工业界的知识库 RAG 服务(二)，RagFlow 源码全流程深度解析](https://blog.csdn.net/hustyichi/article/details/139162109?ops_request_misc=&request_id=&biz_id=102&utm_term=ragflow%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-1-139162109.nonecase&spm=1018.2226.3001.4187) 重要点梳理
任务1：对解析后的文档的某些字段挖空
- - - - 文件解析核心函数：build()方法
      - 解析器
      - 文本挖空
      - 表格内容提取
      - 修改源码

一、Rag理论概述

在这里插入图片描述

由近期 RAGFlow 的火爆看 RAG 的现状与未来

二、Ragflow解析参数说明

在这里插入图片描述

三、♥ RagFlow源码解析

在这里插入图片描述

核心代码流程梳理

参考：深度解读RAGFlow的深度文档理解DeepDoc

DeepDoc的模型应该是基于paddleOCR的模型去微调训练的，开源出来的模型是onnx格式的

1、OCR识别

主要代码在ocr.py里——"E:\ragflow-main\deepdoc\vision\ocr.py"
TextRecognizer 做文字识别，TextDetector 做文本框检测，OCR整合检测和识别功能，对外提供调用
在这里插入图片描述

2、版面分析

版面分析主要在recognizer.py和layout_recognizer.py里：
"E:\ragflow-main\deepdoc\vision\recognizer.py"和"E:\ragflow-main\deepdoc\vision\layout_recognizer.py"
LayoutRecognizer 继承Recognizer的类，用于对文档图像进行板式分析，识别不同类型的区域，例如表格、标题、段落等。这里用的模型应该还是基于paddleocr里的版面分析模型去优化的。
Recognizer的__call__ 方法，传入图像列表和置信度阈值
在这里插入图片描述

3、parser功能

OCR和版面分析，都是为parser服务的，parser负责解析文档，并拆分为chunk.

框架提供了PdfParser、PlainParser、DocxParser、ExcelParser、PptParser 5种解析器。另外针对resume，提供了专门的简历解析功能

3.1 PdfParser

我们挑选重点的==PdfParser ==也就是HuParser来分析。

3.1.1 首先，初始化

def __init__(self):  self.ocr = OCR()  if hasattr(self, "model_speciess"):  self.layouter = LayoutRecognizer("layout." + self.model_speciess)  else:  self.layouter = LayoutRecognizer("layout")  self.tbl_det = TableStructureRecognizer()  self.updown_cnt_mdl = xgb.Booster()

对表格结构的识别需要OCR、LayoutRecognizer，以及TableStructureRecognizer互相配合，一般都是模型搭配大量的工程trick，靠一些规则来解决一些边界情况。文档解析也是这样，需要多个模型配合，结合一些规则来做，这些规则通常是经验的集合，大白话就是各种case跑出来，遇到问题就加新的规则

***************PdfParser核心的__call__

def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):  # 转图片，处理文本，ocr识别  self.__images__(fnm, zoomin)  # 版面分析  self._layouts_rec(zoomin)  # table box 处理  self._table_transformer_job(zoomin)  # 合并文本块  self._text_merge()  self._concat_downward()  # 过滤分页信息  self._filter_forpages()  # 表格和图表抽取  tbls = self._extract_table_figure(  need_image, zoomin, return_html, False)  # 抽取的文本（去掉表格）， 表格  return self.__filterout_scraps(deepcopy(self.boxes), zoomin), tbls

在这里插入图片描述

3.1.2 pdf转图片

先读PDF文件

def __images__(self, fnm, zoomin=3, page_from=0,  page_to=299, callback=None):  self.lefted_chars = []  self.mean_height = []  self.mean_width = []  self.boxes = []  self.garbages = {}  self.page_cum_height = [0]  self.page_layout = []  self.page_from = page_from  try:  self.pdf = pdfplumber.open(fnm) if isinstance(  fnm, str) else pdfplumber.open(BytesIO(fnm))  self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in  enumerate(self.pdf.pages[page_from:page_to])]  self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in  self.pdf.pages[page_from:page_to]]  self.total_page = len(self.pdf.pages)  except Exception as e:  self.pdf = fitz.open(fnm) if isinstance(  fnm, str) else fitz.open(  stream=fnm, filetype="pdf")  self.page_images = []  self.page_chars = []  mat = fitz.Matrix(zoomin, zoomin)  self.total_page = len(self.pdf)  for i, page in enumerate(self.pdf):  if i < page_from:  continue  if i >= page_to:  break  pix = page.get_pixmap(matrix=mat)  img = Image.frombytes("RGB", [pix.width, pix.height],  pix.samples)  self.page_images.append(img)  self.page_chars.append([])

在这里插入图片描述
然后读PDF目录结构
使用了PyPDF2库来读取pdf的目录信息

self.outlines = []  
try:  self.pdf = pdf2_read(fnm if isinstance(fnm, str) else BytesIO(fnm))  outlines = self.pdf.outline  def dfs(arr, depth):  for a in arr:  if isinstance(a, dict):  self.outlines.append((a["/Title"], depth))  continue  dfs(a, depth + 1)  dfs(outlines, 0)  
except Exception as e:  logging.warning(f"Outlines exception: {e}")  
if not self.outlines:  logging.warning(f"Miss outlines")

然后是英文文档检测，大概就是利用正则匹配

logging.info("Images converted.")  
self.is_english = [re.search(r"[a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(  random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i]))))) for i in  range(len(self.page_chars))]  
if sum([1 if e else 0 for e in self.is_english]) > len(  self.page_images) / 2:  self.is_english = True  
else:  self.is_english = False

接着做分页处理
callback方法会更新文档解析进度，在文档页面可以查看实时进度
在这里插入图片描述

for i, img in enumerate(self.page_images):  chars = self.page_chars[i] if not self.is_english else []  # 计算字符的平均宽度、高度  self.mean_height.append(  np.median(sorted([c["height"] for c in chars])) if chars else 0  )  self.mean_width.append(  np.median(sorted([c["width"] for c in chars])) if chars else 8  )  self.page_cum_height.append(img.size[1] / zoomin)  j = 0  while j + 1 < len(chars):  # 对满足条件的添加空格（只包含数字、字母、逗号、句号、冒号、分号、感叹号和百分号， 两个字符宽度小于width的一半  if chars[j]["text"] and chars[j + 1]["text"] \  and re.match(r"[0-9a-zA-Z,.:;!%]+", chars[j]["text"] + chars[j + 1]["text"]) \  and chars[j + 1]["x0"] - chars[j]["x1"] >= min(chars[j + 1]["width"],  chars[j]["width"]) / 2:  chars[j]["text"] += " "  j += 1  # if i > 0:  #     if not chars:    #         self.page_cum_height.append(img.size[1] / zoomin)    #     else:    #         self.page_cum_height.append(    #             np.max([c["bottom"] for c in chars]))    # OCR 识别  self.__ocr(i + 1, img, chars, zoomin)  if callback:  callback(prog=(i + 1) * 0.6 / len(self.page_images), msg="")

__ocr 处理
主要做的是detect，检测文本框，然后根据经验规则来对文本块做处理

def __ocr(self, pagenum, img, chars, ZM=3):  # 检测文本框  bxs = self.ocr.detect(np.array(img))  if not bxs:  self.boxes.append([])  return  bxs = [(line[0], line[1][0]) for line in bxs]  # 按照Y轴坐标排序  bxs = Recognizer.sort_Y_firstly(  [{"x0": b[0][0] / ZM, "x1": b[1][0] / ZM,  "top": b[0][1] / ZM, "text": "", "txt": t,  "bottom": b[-1][1] / ZM,  "page_number": pagenum} for b, t in bxs if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]],  self.mean_height[-1] / 3  )  # merge chars in the same rect  for c in Recognizer.sort_X_firstly(  chars, self.mean_width[pagenum - 1] // 4):  ii = Recognizer.find_overlapped(c, bxs)  if ii is None:  self.lefted_chars.append(c)  continue  ch = c["bottom"] - c["top"]  bh = bxs[ii]["bottom"] - bxs[ii]["top"]  if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != ' ':  self.lefted_chars.append(c)  continue  if c["text"] == " " and bxs[ii]["text"]:  if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):  bxs[ii]["text"] += " "  else:  bxs[ii]["text"] += c["text"]  for b in bxs:  if not b["text"]:  left, right, top, bott = b["x0"] * ZM, b["x1"] * \  ZM, b["top"] * ZM, b["bottom"] * ZM  b["text"] = self.ocr.recognize(np.array(img),  np.array([[left, top], [right, top], [right, bott], [left, bott]],  dtype=np.float32))  del b["txt"]  bxs = [b for b in bxs if b["text"]]  if self.mean_height[-1] == 0:  self.mean_height[-1] = np.median([b["bottom"] - b["top"]  for b in bxs])  self.boxes.append(bxs)

在这里插入图片描述
·······························································分割线········································
实际的文件解析通过接口 /v1/document/run 进行触发的，实际的处理是在 api/db/services/task_service.py 中的 queue_tasks() 中完成的，此方法会根据文件创建一个或多个异步任务，方便异步执行。

在这里插入图片描述

提示词
在这里插入图片描述

想要通过源码了解 RAG 服务推荐优先阅读 QAnything

参考来源

对ragflow-main/deepdoc的源码剖析

RAGFlow嵌入自定义文件解析代码

向量化搜索，如果不希望引入第三方的向量数据库，那么开源的 Faiss 就是一个不错的选择：向量数据库 Faiss的git

来自工业界的知识库 RAG 服务(二)，RagFlow 源码全流程深度解析重要点梳理

文件解析通过接口 /v1/document/run 进行触发的，实际的处理是在 api/db/services/task_service.py 中的 queue_tasks() 中完成的，此方法会根据文件创建一个或多个异步任务，方便异步执行

文件的解析是根据内容拆分为多个任务，通过 Redis 消息队列进行暂存，之后就可以离线异步处理。

直接查看对应的消息队列的消费模块，对应在 rag/svr/task_executor.py 中的 main() 方法中

在 RAGFlow 中的文件中包含了不少了**数据的清理**操作，比如在 deepdoc/vision/layout_recognizer.py 中的就包含着文档中无用内容的判断 def __is_garbage(b)

文件检索的支持 (包含混合检索)

RAGFlow 的检索目前实现的是混合检索，实现的是文本检索 + 向量检索，混合检索完全依赖 ElasticSearch 实现

文件检索的支持可以查看实际的对话处理流程，对话的 API 为 /v1/conversation/completion，实际对话的处理是在 api/db/services/dialog_service.py 中的 chat() 方法中完成

深入跟踪对话处理流程，可以看到文件的检索是在 rag/nlp/search.py 中的 search() 方法中完成。

==检索结果的重排==在 rag/nlp/search.py 中的 rerank() 中完成的，重排是基于文本匹配得分 + 向量匹配得分混合进行排序，默认文本匹配的权重为 0.3, 向量匹配的权重为 0.7;获取混合相似分之后，基于混合的相似分进行过滤和重排，默认混合得分低于 0.2 的会被过滤掉

在进行上面的检索和重排阶段中，只是进行了必要的过滤，没有限制匹配文档的数量。

实际内容可能会超过大模型的输入 token 数量，因此在调用大模型前会调用 api/db/services/dialog_service.py 文件中 message_fit_in() 根据大模型可用的 token 数量进行过滤。

将检索的内容，历史聊天记录以及问题构造为 prompt，即可作为大模型的输入了

任务1：对解析后的文档的某些字段挖空

文件解析核心函数：build()方法

"E:\ragflow-main\rag\svr\task_executor.py"

这是RAGflow核心的文件解析方法。在这个函数中，根据parser_id选择合适的解析器组，并执行文档的解析和切片

解析器

rag/app/naive.py

以默认的naive类型为例，深入对应的chunk()实现，其对应的实现在rag/app/naive.py中。此方法中包含了目前主持的docx, pdf, xlsx, md等格式的解析

文本挖空

def __ocr：“E:\ragflow-main\deepdoc\parser\pdf_parser.py”
在__ocr方法中，会对文本框进行检测和处理。你可以在这里添加逻辑来识别特定的字段，并进行挖空处理

表格内容提取

__extract_table_content “E:\ragflow-main\deepdoc\parser\docx_parser.py”
__compose_table_content “E:\ragflow-main\deepdoc\parser\docx_parser.py”
如果文档中包含表格，__extract_table_content和__compose_table_content函数会处理表格内容。你可以在这里添加逻辑来识别表格中的特定字段，并进行挖空

修改源码

需要在相应的解析器中添加逻辑来识别特定的字段，并将其替换为占位符（挖空）。这可能涉及到正则表达式的使用，以及对文档结构的理解

在这里插入图片描述

爆改RagFlow

爆改RagFlow 一、Rag理论概述二、Ragflow解析参数说明三、♥ RagFlow源码解析核心代码流程梳理1、OCR识别2、版面分析3、parser功能3.1 PdfParser3.1.1 首先，初始化3.1.2 **pdf转图片** [来自工业界的知识库 RAG 服务(二)，RagFlow 源码全流程深度解析](h…...

编程日记 2025/7/19 13:29:13

Unity 使用UGUI制作卷轴开启关闭效果

视频效果代码 using UnityEngine.UI; using System.Collections; using System.Collections.Generic; using UnityEngine; using DG.Tweening; using DG.Tweening.Core; using DG.Tweening.Plugins.Options;public class JuanZhou : MonoBehaviour {[SerializeField]private …...

编程日记 2025/7/20 0:08:03

爆改RagFlow

一、Rag理论概述

二、Ragflow解析参数说明

三、♥ RagFlow源码解析

核心代码流程梳理

1、OCR识别

2、版面分析

3、parser功能

3.1 PdfParser

3.1.1 首先，初始化

3.1.2 pdf转图片

任务1：对解析后的文档的某些字段挖空

文件解析核心函数：build()方法

解析器

文本挖空

表格内容提取

修改源码

相关文章：