Course Content
What is Python?
Introduction of Python and Its setup
0/2
Control Statement
Control statements are used to control the flow of execution depending upon the specified condition/logic.
0/4
File Handling
File handling is an important component of any application. Python has multiple functions for creating, reading, updating, and deleting files.
0/2
Examples
Following are the examples of python scripts to try hands-on, you are ready to start your python journey.
0/7
Python
About Lesson

PDF text extraction using Python

In this tutorial, you will learn how to extract text from a given PDF in Python. We will be using the PyPDF2 module for extracting the text from PDF files.

Installing the module

To install the PyPDF2 module and some other related dependencies, we can use the pip command:

pip install pypdf2

The Details

For extracting text from a PDF file we will be using the PdfFileReader class which is used to initialize the PdfFileReader object, taking a stream parameter, in which we will provide the file stream for the PDF file.

Link to Sample File

http://www.africau.edu/images/default/sample.pdf

The Code (pdf_extractor.py)

from PyPDF2 import PdfFileReader

# opening the pdf file in binary ready mode
file = open("sample.pdf", "rb")

# instantiate the object
reader = PdfFileReader(file)

print(f"Printing the document info: {reader.getDocumentInfo()}")

print("*******************************************************")

# get number of pages of the document
print(f"Number of Pages: {reader.getNumPages()}")

print(f"PDF file created By: {reader.getDocumentInfo().creator}")

print("*******************************************************")

pages = reader.getNumPages()

for i in range(0, pages):
    print(f"Page Number: {i+1}")
    print("--------------------------------------")
    pageObj = reader.getPage(i)
    print(pageObj.extractText())
    print("--------------------------------------")

# close the pdf file object
file.close()

Other applications of PyPDF2 Module

  • Rotating a PDF file page by any defined angle.
  • Merging two or more PDF files at a defined page number
  • Appending two or more PDF files, one after another.
  • Find all the meta information for any PDF, like creator, author, date of creation, etc.
  • We can even create a new PDF file using the text coming from some text file.