

干货| 使用 Python 读写 PDF 文件

Python训练室

2025-11-09

导读：在本文中，我将向您分享关于如何使用 Python PDF 库从 PDF 文档中提取文本、表格和图像以及其他类型

在本文中，我将向您分享关于如何使用 Python PDF 库从 PDF 文档中提取文本、表格和图像以及其他类型的数据知识。PDF 文档是我们日常需要的文件格式，有时我们需要从中编辑和提取一些有用的数据。我们选择 Python 来提取、编辑和编写 PDF，因为它有一个庞大的社区和许多模块来处理这种格式。

以下是本文的上下文内容。

文章背景：
提取文本数据
提取表
提取图像
写入 PDF

提取文本

Python 有许多库可以帮助我们阅读 PDF，但最著名的是 PyPdf2 和 Pdfminner。我们将逐一介绍并查看它们的阅读功能。

使用 PyPdf2 提取文本

PyPdf2 是专门为 Python 3 及以上版本制作的第三方模块，它与支持 Python 2 的先前版本 PyPdf 具有相同的功能。

# importing moduleimport PyPDF2
# create a pdf file objectpdfFileObj = open('file.pdf', 'rb')
# create a pdf reader objectpdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# creating a page objectpageObj = pdfReader.getPage(0)
# extracte text from pageprint(pageObj.extractText())
# closing the pdf file objectpdfFileObj.close()

让我们分解代码并理解每一行。

先导入需要的模块
使用 open() 内置函数以二进制模式打开文件
在 PdfFileReader 方法中传递读取文件，以便 PyPdf2 可以读取它。
获取页码并将其存储在 pageObj 中。
使用 extractText() 方法从 pageObj 中提取文本。
最后，我们最终关闭了 PdfFileObj。

最后，关闭文件是强制性的。如果你让它打开并尝试读取另一个文件，它会给你一个文件读取错误。

到目前为止，我们看到我们正在提取每页的文本，但使用循环我们可以一次提取所有页面。

# importing moduleimport PyPDF2
# create a pdf file objectpdfFileObj = open('file.pdf', 'rb')
# create a pdf reader objectpdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for i in range(pdfReader.numPages):    pageObj = pdfReader.getPage(i)    print(pageObj.extractText())
# closing the pdf file objectpdfFileObj.close()

numPages 对象作为 pdf 文件中可用的页数返回。所以我们可以使用 range 方法来迭代页数，将文本一一提取出来。

让我们在下面显示的示例 pdf 上测试此代码。

输出：

A Simple PDF File  This is a small demonstration .pdf file -  just for usein the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ... Simple PDF File2  ...continued from page 1. Yet more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. Oh, how boring typing this stuff. But notas boring as watching  paint dry. And more text. And more text. And more text. And more text.  Boring.  More, a little more text. The end, and just as well.

从 PDF 中提取表格

使用 PyPDF 2 提取表格并不是一个好方法。为了正确地从 PDF 文件中提取表格，我们需要计算机视觉首先检测这些表格并进行机器学习计算并最终提取它。

为了完成这个任务，我们有一个库名称 Tabula。它是一个第三方 python 模块，专门设计用于从 pdf 中读取和提取表格并将其存储为 CSV 格式。

import tabula
# Read pdf into list of DataFramedf = tabula.read_pdf("test.pdf", pages='all')

让我们分解上面的代码并了解它是如何工作的。

首先导入所需的模块。
使用 tabula.read_pdf() 方法并传递 PDF 文件名并将页面设置为“all”，这意味着将提取所有页表。

如果您打印对象 df，您将在控制台屏幕中看到表格类型格式。但是我们总是不仅需要在屏幕上打印我们已经将提取的数据存储在某个地方。

import tabula
# convert PDF into CSV filetabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

为了直接将其转换为 CSV，我使用了 convet_into() 函数和我们在第一个代码中看到的 2 个参数是相同的，并且添加了两个新参数输出文件名和格式类型。

从 PDF 中提取图像

提取表格也可以使用 PyPDF 提取，但它不是表格格式，难以阅读，我们不使用 PyPDF 2 方法，但图像不是文本格式，这些是视觉格式的东西。要提取图像，我们必须使用其他第三方模块。

使用以下命令安装我们需要的 2 个模块。一个是 PyMuPDF，另一个是 Python “Pillow” 中的图像处理模块。

pip install PyMuPDF Pillow

# import librariesimport fitzimport iofrom PIL import Image
pdf_file = fitz.open("test2.pdf")# iterate over PDF pages
for page_index in range(len(pdf_file)):# get the page itself    page = pdf_file[page_index]    image_list = page.getImageList()
for image_index, img in enumerate(page.getImageList(), start=1):# get the XREF of the image        xref = img[0]
# extract the image bytes        base_image = pdf_file.extractImage(xref)        image_bytes = base_image["image"]
# get the image extension        image_ext = base_image["ext"]
# load it to PIL        image = Image.open(io.BytesIO(image_bytes))
# save it        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

让我们在带有图像的实时 PDF 文件上测试此代码。查看下面的输入 pdf 文件。

写入 PDF

到目前为止，我们学习了从 PDF 中提取数据的所有方法。现在是时候学习如何写入 PDF 文件了。要编写 PDF 文件，我们将使用模块 fpdf2。

pip install fpdf2

Fdf2 易于使用
它允许页面格式和边距
它允许管理页眉和页脚
字体格式
支持 PNG、GIF 和 JPG
带有视觉设计器和基本 html2pdf 的模板

#import a modulefrom fpdf import FPDF
pdf = FPDF()
# Adding a pagepdf.add_page()
# set style and size of font pdf.set_font("Arial", size = 15)
# create a cellpdf.cell(200, 10, txt = "Medium Article", ln = 1, align = 'C')
# add another cellpdf.cell(200, 10, txt = "How To Read and Write PDF files in Python",ln = 2, align = 'C')
# save the pdfpdf.output("medium.pdf")