当前位置：首页 > news >正文

JAVA：探索 PDF 文字提取的技术指南

news 来源：原创 2025/8/3 23:00:38

1、简述

随着信息化的发展，PDF 文档成为了信息传播的重要媒介。在许多应用场景下，如数据迁移、内容分析和信息检索，我们需要从 PDF 文件中提取文字内容。JAVA提供了多种库来处理 PDF 文件，其中 PDFBox 和 iText 是最常用的两个。

在这里插入图片描述

在这篇博客中，我们将深入探讨如何使用多种方式来提取 PDF 文本，分析各自的优缺点，并讨论在不同场景下的最佳实践。

2、准备工作

在开始之前，你需要以下准备工作：

百度开发者账号：前往百度AI开放平台注册账号，并创建一个应用以获取 API Key 和 Secret Key。
Java 开发环境：确保你的开发环境已经配置好，包括 JDK 和一个集成开发环境（IDE），如 IntelliJ IDEA 或 Eclipse。
引入依赖：百度官方提供了 Java SDK，或者你可以直接使用 HttpClient 进行 API 调用。

引入Maven依赖：

<dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.7</version>
</dependency>
<dependency><groupId>org.apache.directory.studio</groupId><artifactId>org.apache.commons.codec</artifactId><version>1.8</version>
</dependency>
<dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.83</version>
</dependency>
<!-- spring-boot-actuator -->
<dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.2</version>
</dependency>
<dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>5.2.3</version>
</dependency>
<dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>5.2.3</version>
</dependency>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.0</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>font-asian</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>kernel</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>io</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>layout</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>forms</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>com.itextpdf</groupId><artifactId>pdfa</artifactId><version>7.1.16</version>
</dependency>
<dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.8.0</version>
</dependency>
<dependency><groupId>org.apache.commons</groupId><artifactId>commons-collections4</artifactId><version>4.4</version>
</dependency>
<dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.13</version>
</dependency>

3、利用 PDFBox 解析

可以使用 PDFBox 库来解析 PDF 文件并提取文本内容。PDFBox 可以帮助你逐行读取 PDF 的文本，然后你可以编写逻辑来查找指定的文字。

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;import java.io.File;
import java.io.IOException;public class PDFReader {public static void main(String[] args) {String filePath = "path/to/your/pdf-file.pdf";String keyword = "指定文字";  // 要查找的指定文字try (PDDocument document = PDDocument.load(new File(filePath))) {PDFTextStripper pdfStripper = new PDFTextStripper();String text = pdfStripper.getText(document);// 将文本按行分割String[] lines = text.split("\n");for (int i = 0; i < lines.length; i++) {if (lines[i].contains(keyword)) {System.out.println("在第 " + page + " 页找到关键字: " + keyword);}}} catch (IOException e) {e.printStackTrace();}}
}

4、利用 Tesseract 来解析 PDFBox

将 PDF 转换为图像并使用 Tesseract OCR 进行文本识别是一种有效的方法来处理 PDF 文档中的复杂布局或不规则表格。以下是如何在 Java 中实现这一过程的详细步骤：

将 PDF 页面转换为图像：使用 PDFBox 将每个 PDF 页面转换为图像。
使用 Tesseract OCR 识别图像中的文本：通过 Tesseract OCR 读取每个图像，并提取文本。
查找关键字并提取信息：在 OCR 识别的文本中查找关键字（如“图号”），并提取相邻单元格的值。

 @PostMapping("/pdf2Excel")public ResponseEntity<String> pdf2Excel(@RequestParam("keyword") String keyword, @RequestParam("file")MultipartFile file) throws IOException {if (file.isEmpty()) {return new ResponseEntity<>("File is empty", HttpStatus.BAD_REQUEST);}String basePath = System.getProperty("java.io.tmpdir");String imagesPath = basePath + "\\images\\";File directory = new File(imagesPath);if(!directory.exists()){directory.mkdirs();}File convFile = new File(basePath+ "/" + file.getOriginalFilename());file.transferTo(convFile);String excelPath = "C:\\Users\\WIN10\\Desktop\\fsdownload\\excel\\MapData.xlsx";File excelFile = new File(excelPath);if(!excelFile.exists()){Files.createFile(excelFile.toPath());}Map<String, Object> objectMap = new HashMap<>();try (PDDocument document = Loader.loadPDF(convFile)) {PDFRenderer pdfRenderer = new PDFRenderer(document);int numberOfPages = document.getNumberOfPages();ITesseract instance = new Tesseract();instance.setDatapath("D:\\soft\\Tesseract-OCR\\tessdata"); // 设置Tesseract的tessdata路径instance.setLanguage("chi_sim"); // 设置识别语言for (int pageIndex = 0; pageIndex < numberOfPages; pageIndex++) {BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(pageIndex, 144);int currentPageNum = pageIndex + 1;File imageFile = new File(imagesPath + "page_" + currentPageNum + ".jpg");ImageIO.write(bufferedImage, "jpg", imageFile);// 调用OCR服务识别文字String result =  instance.doOCR(imageFile);processOCRResult(result, objectMap, currentPageNum);System.out.println("Page " + (currentPageNum) + " converted to image.");}System.out.println("获取所有的图号-结束");System.out.println("数据转换Excel-开始");// 创建 Excel 工作簿Workbook workbook = new XSSFWorkbook();Sheet sheet = workbook.createSheet("Map Data");// 创建表头Row headerRow = sheet.createRow(0);headerRow.createCell(0).setCellValue("Key");headerRow.createCell(1).setCellValue("Value");// 填充数据int rowNum = 1;for (Map.Entry<String, Object> entry : objectMap.entrySet()) {Row row = sheet.createRow(rowNum++);row.createCell(0).setCellValue(entry.getKey());row.createCell(1).setCellValue((String) entry.getValue());}// 自动调整列宽sheet.autoSizeColumn(0);sheet.autoSizeColumn(1);FileOutputStream fileOut = new FileOutputStream(excelFile);workbook.write(fileOut);workbook.close();return ResponseEntity.ok().body("数据转换Excel成功");} catch (Exception e) {return new ResponseEntity<>("File upload error: " + e.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR);}}

4、利用百度文字识别来解析 PDFBox

在现代开发中，文字识别（OCR，Optical Character Recognition）技术已经被广泛应用于图像处理、文档管理等领域。百度提供的文字识别 API 功能强大、易于使用，能够帮助开发者快速实现图像中的文字提取。本文将介绍如何在 Java 中利用百度文字识别 API 进行图片文字提取。

4.1 获取 Access Token

在调用文字识别 API 之前，需要先获取 Access Token。这一步通常在应用初始化时执行，并且 Access Token 具有一定的有效期。

public  Map<String,Object> token() throws Exception {//获取当前配置表数据参数Map<String,Object> map  = new HashMap<>();CloseableHttpClient httpClient = HttpClients.createDefault();HttpPost httpPost = new HttpPost("https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials");httpPost.addHeader("Content-Type", "application/json");httpPost.addHeader("Accept", "application/json");//post请求参数配置List<NameValuePair> formparams = new ArrayList<NameValuePair>();formparams.add(new BasicNameValuePair("client_id", API_KEY));formparams.add(new BasicNameValuePair("client_secret", SECRET_KEY));UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");   //设置编码格式为utf-8httpPost.setEntity(uefEntity);  //设置POST请求参数//使用httpclient的execute方法发送接口请求CloseableHttpResponse response =  httpClient.execute(httpPost);HttpEntity  httpEntity = response.getEntity();String responseString = EntityUtils.toString(httpEntity);JSONObject obj = JSON.parseObject(responseString);if(response.getStatusLine().getStatusCode() == 200){map.put("access_token", obj.getString("access_token"));map.put("refresh_token",obj.getString("refresh_token"));map.put("expires_in",obj.getIntValue("expires_in"));}else {map.put("error", obj.getString("error"));map.put("error_description", obj.getString("error_description"));}map.put("stateCode", response.getStatusLine().getStatusCode());return map;
}

4.2 调用百度文字识别 API

获取到 Access Token 后，我们可以使用它来调用百度的文字识别 API。我们将通过一个 POST 请求发送图片数据，并接收识别结果。

public Map<String,Object> accurate(String token , String image) throws Exception{Map<String,Object> map  = new HashMap<>();CloseableHttpClient httpClient = HttpClients.createDefault();HttpPost httpPost = new HttpPost("https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic?access_token=" + token+"&language_type=CHN_ENG&detect_direction=false&paragraph=false&probability=false");httpPost.addHeader("Content-Type", "application/x-www-form-urlencoded");httpPost.addHeader("Accept", "application/json");//post请求参数配置List<NameValuePair> formparams = new ArrayList<NameValuePair>();formparams.add(new BasicNameValuePair("image", image));UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");   //设置编码格式为utf-8httpPost.setEntity(uefEntity);  //设置POST请求参数//使用httpclient的execute方法发送接口请求CloseableHttpResponse response =  httpClient.execute(httpPost);HttpEntity  httpEntity = response.getEntity();String responseString = EntityUtils.toString(httpEntity);JSONObject obj = JSON.parseObject(responseString);String errorCode = obj.getString("error_code");if(Objects.nonNull(errorCode)){map.put("stateCode", 500);map.put("error_code", obj.getString("error_code"));map.put("error_msg", obj.getString("error_msg"));}else {map.put("stateCode", 200);map.put("words_result", obj.getJSONArray("words_result"));map.put("words_result_num",obj.getString("words_result_num"));map.put("log_id",obj.getString("log_id"));}return map;
}

4.3 将识别的文字转成 Excel

百度 OCR API 返回的结果是 JSON 格式的。我们可以使用 Gson 或其他 JSON 解析库来处理这些结果，并提取出识别到的文字并转成Excel输出。

@PostMapping("/pdf2Excel")
public ResponseEntity<byte[]> pdf2Excel(@RequestParam("keyword") String keyword, @RequestParam("file") MultipartFile file) throws Exception {if (file.isEmpty()) {return new ResponseEntity<>("File is empty".getBytes(), HttpStatus.BAD_REQUEST);}String basePath = System.getProperty("java.io.tmpdir");String imagesPath = basePath + "\\images\\";File directory = new File(imagesPath);if(!directory.exists()){directory.mkdirs();}File convFile = new File(basePath+ "/" + file.getOriginalFilename());file.transferTo(convFile);Map<String, Object> tokenMap = ocrAPIFactory.token();int stateCode = (int)tokenMap.get("stateCode");if(stateCode != 200){return  new ResponseEntity<>("ERROR".getBytes(), HttpStatus.OK);}Map<Integer, String> objectMap = new HashMap<>();String accessToken= String.valueOf(tokenMap.get("access_token"));System.out.println("获取所有的图号-开始");try (PDDocument document = Loader.loadPDF(convFile)) {PDFRenderer pdfRenderer = new PDFRenderer(document);int numberOfPages = document.getNumberOfPages();for (int pageIndex = 0; pageIndex < numberOfPages; pageIndex++) {BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(pageIndex, 144);int currentPageNum = pageIndex + 1;String imageFilePath = imagesPath + "page_" + currentPageNum + ".jpg";File imageFile = new File(imageFilePath);ImageIO.write(bufferedImage, "jpg", imageFile);String imgStr =  ocrAPIFactory.getFileContentAsBase64(imageFilePath, false);Map<String, Object> fileMap = ocrAPIFactory.accurate(accessToken, imgStr);stateCode = (int)fileMap.get("stateCode");if(stateCode != 200){System.out.println ("获取百度OCR 百度文件转换失败：" + fileMap.get("error_msg"));return  new ResponseEntity<>("ERROR".getBytes(), HttpStatus.OK);}JSONArray wordsResults = (JSONArray)fileMap.get("words_result");if(Objects.nonNull(wordsResults)){processOCRResult(wordsResults, currentPageNum, objectMap);}System.out.println("Page " + (currentPageNum) + " converted to image.");}System.out.println("获取所有的图号-结束");System.out.println("数据转换Excel-开始");//页数排序Map<Integer, String> sortedMap = objectMap.entrySet().stream() // 将 Map 转换为 Stream.sorted(Map.Entry.comparingByKey()) // 按值排序.collect(Collectors.toMap(Map.Entry::getKey,Map.Entry::getValue,(oldValue, newValue) -> oldValue, // 如果有重复键时的合并策略() -> new LinkedHashMap<>()  // 保持顺序的 Map 实现));// 创建 Excel 工作簿Workbook workbook = new XSSFWorkbook();Sheet sheet = workbook.createSheet("Map Data");// 创建表头Row headerRow = sheet.createRow(0);headerRow.createCell(0).setCellValue("页码");headerRow.createCell(1).setCellValue("图号");// 填充数据int rowNum = 1;for (Map.Entry<Integer, String> entry : sortedMap.entrySet()) {Row row = sheet.createRow(rowNum++);row.createCell(0).setCellValue(Objects.nonNull(entry.getKey()) ? entry.getKey().toString() :"0");row.createCell(1).setCellValue(Objects.nonNull(entry.getValue()) ? entry.getValue().toString() :"" );}// 自动调整列宽sheet.autoSizeColumn(0);sheet.autoSizeColumn(1);// 将工作簿内容写入字节数组输出流ByteArrayOutputStream outputStream = new ByteArrayOutputStream();try {workbook.write(outputStream);workbook.close();} catch (IOException e) {e.printStackTrace();return ResponseEntity.status(500).build();}// 创建 Http 响应HttpHeaders headers = new HttpHeaders();headers.setContentType(MediaType.APPLICATION_OCTET_STREAM);headers.setContentDispositionFormData("attachment", "MapData.xlsx");System.out.println("数据转换Excel-结束");//删除文件夹try {// 递归删除文件夹及其内容Files.walkFileTree(new File(imagesPath).toPath(), new SimpleFileVisitor<Path>() {@Overridepublic FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {Files.delete(file);return FileVisitResult.CONTINUE;}@Overridepublic FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {Files.delete(dir);return FileVisitResult.CONTINUE;}});System.out.println("Directory deleted successfully.");} catch (IOException e) {e.printStackTrace();}return ResponseEntity.ok().headers(headers).body(outputStream.toByteArray());} catch (Exception e) {System.out.println("数据转换Excel异常:" + e);return new ResponseEntity<>("File upload error: ".getBytes(), HttpStatus.INTERNAL_SERVER_ERROR);}
}

5、结论

在 Java 中，PDF 文字提取可以通过 PDFBox 轻松实现。PDFBox 适合简单的文档处理，复杂的文档结构通过OCR来解析。在选择使用哪个库时，建议根据项目需求、文档复杂度和性能要求进行评估。

这篇博客提供了从 PDF 中提取文字的基础方法，并介绍了如何处理复杂的文档结构。希望这对你的项目有所帮助！如果有任何问题或建议，欢迎留言讨论。