Мастерская работы с PDF в Python: от извлечения данных до генерации документов

CORP · 28 Май 2025

🔥 Полезные библиотеки Python

Python PDF Handling Tutorial — интересная подборка скриптов для работы с PDF-файлами в Python:

Вы научитесь:
➡️ Извлекать текст и изображения из PDF файлов;
➡️ Извлекать таблицы и URL адреса из PDF файлов;
➡️ Извлекать страницы из PDF файлов как изображения;
➡️ Создавать PDF файлы;
➡️ Добавлять текст, изображения и таблицы в PDF файлы;
➡️ Выделять текст в PDF файлах и многое другое.

Код:

from io import StringIO
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
 
# PDFMiner Analyzers
rsrcmgr = PDFResourceManager()
sio = StringIO()
codec = "utf-8"
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
 
# path to our input file
pdf_file = "sample.pdf"
 
# Extract text
pdfFile = open(pdf_file, "rb")
for page in PDFPage.get_pages(pdfFile):
   interpreter.process_page(page)
pdfFile.close()
 
# Return text from StringIO
text = sio.getvalue()
 
print(text)
 
# Freeing Up
device.close()
sio.close()

Код:

import fitz
import io
from PIL import Image
 
# path to our input file
pdf_file = "sample.pdf"
 
# Input PDF file
pdf_file = fitz.open(pdf_file)
 
for page_no in range(len(pdf_file)):
   curr_page = pdf_file[page_no]
   images = curr_page.getImageList()
 
   for image_no, image in enumerate(curr_page.getImageList()):
       # get the XREF of the image
       xref = image[0]
       # extract the image bytes
       curr_image = pdf_file.extractImage(xref)
       img_bytes = curr_image["image"]
       # get the image extension
       img_extension = curr_image["ext"]
       # load it to PIL
       image = Image.open(io.BytesIO(img_bytes))
       # save it to local disk
       image.save(open(f"page{page_no+1}_img{image_no}.{img_extension}", "wb"))

⚙️ GitHub/Инструкция -

Скрытое содержимое доступно для зарегистрированных пользователей!

Мастерская работы с PDF в Python: от извлечения данных до генерации документов

CORP

Administrator

Что-то похожее