Automating PDFs with PyPDF2

Loading

PDF (Portable Document Format) files are widely used for sharing documents. Python’s PyPDF2 library helps in automating PDF-related tasks such as:
Reading PDF files
Extracting text from PDFs
Merging and splitting PDFs
Adding watermarks
Encrypting and decrypting PDFs

Install PyPDF2:

pip install pypdf2

2. Reading a PDF File

To read and extract text from a PDF:

import PyPDF2

with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file) # Load the PDF
num_pages = len(pdf_reader.pages) # Get number of pages
print(f"Total Pages: {num_pages}")

# Extract text from the first page
text = pdf_reader.pages[0].extract_text()
print(text)

Now you can extract text from PDFs!


3. Writing to a New PDF

You can copy pages from one PDF and write them to a new PDF:

with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()

# Add first page to new PDF
pdf_writer.add_page(pdf_reader.pages[0])

with open("new_pdf.pdf", "wb") as output:
pdf_writer.write(output)

Now a new PDF file is created!


4. Merging Multiple PDFs

To combine multiple PDFs into one:

pdf_writer = PyPDF2.PdfWriter()

for pdf in ["file1.pdf", "file2.pdf"]:
with open(pdf, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
for page in pdf_reader.pages:
pdf_writer.add_page(page)

with open("merged.pdf", "wb") as output:
pdf_writer.write(output)

Now multiple PDFs are merged!


5. Splitting a PDF

To extract specific pages from a PDF:

with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()

# Extract pages 1 to 3
for page_num in range(0, 3):
pdf_writer.add_page(pdf_reader.pages[page_num])

with open("split.pdf", "wb") as output:
pdf_writer.write(output)

Now a portion of the PDF is saved separately!


6. Adding Watermarks to PDFs

To overlay a watermark on each page:

with open("document.pdf", "rb") as doc, open("watermark.pdf", "rb") as watermark:
pdf_reader = PyPDF2.PdfReader(doc)
watermark_reader = PyPDF2.PdfReader(watermark)
pdf_writer = PyPDF2.PdfWriter()

for page in pdf_reader.pages:
page.merge_page(watermark_reader.pages[0]) # Apply watermark
pdf_writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
pdf_writer.write(output)

Now all pages have a watermark!


7. Encrypting and Decrypting PDFs

Encrypt a PDF with a password

pdf_writer = PyPDF2.PdfWriter()

with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)

for page in pdf_reader.pages:
pdf_writer.add_page(page)

pdf_writer.encrypt("mypassword") # Set password

with open("encrypted.pdf", "wb") as output:
pdf_writer.write(output)

Decrypt a PDF

with open("encrypted.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_reader.decrypt("mypassword") # Provide password

for page in pdf_reader.pages:
print(page.extract_text()) # Extract decrypted text

Now PDFs can be protected and unlocked!


8. Rotating Pages in a PDF

To rotate pages in a PDF:

with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()

rotated_page = pdf_reader.pages[0].rotate(90) # Rotate first page 90 degrees
pdf_writer.add_page(rotated_page)

with open("rotated.pdf", "wb") as output:
pdf_writer.write(output)

Now the page is rotated!


9. Extracting Images from a PDF

If a PDF contains images, extract them:

from PyPDF2 import PdfReader

with open("sample.pdf", "rb") as file:
pdf_reader = PdfReader(file)
for page in pdf_reader.pages:
for image in page.images:
with open(image.name, "wb") as img_file:
img_file.write(image.data) # Save image

Now images are extracted from PDFs!


10. Automating PDF Tasks with Schedule

To automate PDF processing every day:

import schedule
import time

def process_pdfs():
print("Processing PDFs...")
# Add your PDF automation tasks here

schedule.every().day.at("08:00").do(process_pdfs)

while True:
schedule.run_pending()
time.sleep(60)

Now PDF tasks run automatically!

Leave a Reply

Your email address will not be published. Required fields are marked *