PDF (Portable Document Format) files are widely used for sharing documents. Python’s PyPDF2 library helps in automating PDF-related tasks such as:
Reading PDF files
Extracting text from PDFs
Merging and splitting PDFs
Adding watermarks
Encrypting and decrypting PDFs
Install PyPDF2:
pip install pypdf2
2. Reading a PDF File
To read and extract text from a PDF:
import PyPDF2
with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file) # Load the PDF
num_pages = len(pdf_reader.pages) # Get number of pages
print(f"Total Pages: {num_pages}")
# Extract text from the first page
text = pdf_reader.pages[0].extract_text()
print(text)
Now you can extract text from PDFs!
3. Writing to a New PDF
You can copy pages from one PDF and write them to a new PDF:
with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()
# Add first page to new PDF
pdf_writer.add_page(pdf_reader.pages[0])
with open("new_pdf.pdf", "wb") as output:
pdf_writer.write(output)
Now a new PDF file is created!
4. Merging Multiple PDFs
To combine multiple PDFs into one:
pdf_writer = PyPDF2.PdfWriter()
for pdf in ["file1.pdf", "file2.pdf"]:
with open(pdf, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
for page in pdf_reader.pages:
pdf_writer.add_page(page)
with open("merged.pdf", "wb") as output:
pdf_writer.write(output)
Now multiple PDFs are merged!
5. Splitting a PDF
To extract specific pages from a PDF:
with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()
# Extract pages 1 to 3
for page_num in range(0, 3):
pdf_writer.add_page(pdf_reader.pages[page_num])
with open("split.pdf", "wb") as output:
pdf_writer.write(output)
Now a portion of the PDF is saved separately!
6. Adding Watermarks to PDFs
To overlay a watermark on each page:
with open("document.pdf", "rb") as doc, open("watermark.pdf", "rb") as watermark:
pdf_reader = PyPDF2.PdfReader(doc)
watermark_reader = PyPDF2.PdfReader(watermark)
pdf_writer = PyPDF2.PdfWriter()
for page in pdf_reader.pages:
page.merge_page(watermark_reader.pages[0]) # Apply watermark
pdf_writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
pdf_writer.write(output)
Now all pages have a watermark!
7. Encrypting and Decrypting PDFs
Encrypt a PDF with a password
pdf_writer = PyPDF2.PdfWriter()
with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
for page in pdf_reader.pages:
pdf_writer.add_page(page)
pdf_writer.encrypt("mypassword") # Set password
with open("encrypted.pdf", "wb") as output:
pdf_writer.write(output)
Decrypt a PDF
with open("encrypted.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_reader.decrypt("mypassword") # Provide password
for page in pdf_reader.pages:
print(page.extract_text()) # Extract decrypted text
Now PDFs can be protected and unlocked!
8. Rotating Pages in a PDF
To rotate pages in a PDF:
with open("sample.pdf", "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
pdf_writer = PyPDF2.PdfWriter()
rotated_page = pdf_reader.pages[0].rotate(90) # Rotate first page 90 degrees
pdf_writer.add_page(rotated_page)
with open("rotated.pdf", "wb") as output:
pdf_writer.write(output)
Now the page is rotated!
9. Extracting Images from a PDF
If a PDF contains images, extract them:
from PyPDF2 import PdfReader
with open("sample.pdf", "rb") as file:
pdf_reader = PdfReader(file)
for page in pdf_reader.pages:
for image in page.images:
with open(image.name, "wb") as img_file:
img_file.write(image.data) # Save image
Now images are extracted from PDFs!
10. Automating PDF Tasks with Schedule
To automate PDF processing every day:
import schedule
import time
def process_pdfs():
print("Processing PDFs...")
# Add your PDF automation tasks here
schedule.every().day.at("08:00").do(process_pdfs)
while True:
schedule.run_pending()
time.sleep(60)
Now PDF tasks run automatically!