Daily Blog Post: January 24th, 2023

Jan 24th, 2023

Working with PDF and CSV files in Python using PyPDF2 and csv modules

Python offers a wide range of built-in modules for working with various file formats, such as PDF and CSV. In this post, we'll take a look at two popular modules for working with these file formats, PyPDF2 and csv, and the methods associated with them.

Many employee's entire job at work is recieving emails containing PDF files, and inserting this data into csv files, particularly at finacial firms. Automating this can save firms thousands of dollars on labor that is easily replaceable, and can speed up the workflow of documents through their company, greater pleasing customers. The knowledge on how to do this is extremely valuable, and in this post we will explain the basics on doing just this.

First, let's start with the PyPDF2 module. This module allows you to work with PDF files in Python, including reading, writing, and manipulating them. One of the most commonly used methods of this module is the PdfFileReader() method, which is used to read a PDF file. Here's an example of how to use it:

import PyPDF2

# Open the PDF file
pdf_file = open('sample.pdf', 'rb')

# Create a PdfFileReader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Print the number of pages in the PDF file
print(pdf_reader.numPages)

# Close the PDF file
pdf_file.close()

In this example, we first import the PyPDF2 module. Then, we open a PDF file using the open() function, passing in the filename and the mode 'rb' (reading in binary mode). Next, we create a PdfFileReader object, passing in the pdf_file object as an argument. The numPages attribute of the pdf_reader object gives the total number of pages in the pdf file.

Another useful method of the PyPDF2 module is the PdfFileWriter() method, which is used to create a new PDF file or add pages to an existing one. Here's an example of how to use it:

import PyPDF2

# Create a new PDF file
pdf_writer = PyPDF2.PdfFileWriter()

# Add a page to the PDF file
pdf_writer.addPage(pdf_reader.getPage(0))

# Save the PDF file
with open('new.pdf', 'wb') as pdf_file:
    pdf_writer.write(pdf_file)

In this example, we first import the PyPDF2 module. Then, we create a new PdfFileWriter object. Next, we add a page to the PDF file using the addPage() method of the pdf_writer object, passing in the first page of the pdf_reader object as an argument. Finally, we use the write() method of the pdf_writer object to save the new PDF file.

Next, let's take a look at the csv module. This module allows you to work with CSV files in Python, including reading, writing, and manipulating them. One of the most commonly used methods of this module is the reader() method, which is used to read a CSV file. Here's an example of how to use it:

import csv

# Open the CSV file
with open('sample.csv', 'r') as csv_file:
    # Create a CSV reader object
    csv_reader = csv.reader(csv_file)
    
    # Iterate over the rows in the CSV file
    for row in csv_reader:
        print(row)

In this example, we first import the csv module. Then, we open a CSV file using the open() function, passing in the filename and the mode 'r' (reading in text mode). Next, we create a csv.reader object, passing in the csv_file object as an argument. Finally, we use a for loop to iterate over the rows in the CSV file and print each row.

Another useful method of the csv module is the writer() method, which is used to write data to a CSV file. Here's an example of how to use it:

import csv

# Create some data to write to the CSV file
data = [['Name', 'Age', 'City'], ['John', '30', 'New York'], ['Jane', '25', 'Chicago']]

# Open the CSV file
with open('new.csv', 'w', newline='') as csv_file:
    # Create a CSV writer object
    csv_writer = csv.writer(csv_file)
    
    # Write the data to the CSV file
    csv_writer.writerows(data)

In this example, we first create some sample data that we want to write to the CSV file. Then, we open the CSV file using the open() function, passing in the filename, mode 'w' (writing in text mode) and newline='' (to avoid extra empty rows). Next, we create a csv.writer object, passing in the csv_file object as an argument. Finally, we use the writerows() method of the csv_writer object to write the data to the CSV file.

It is possible to use both the PyPDF2 and csv modules together to extract data from a PDF file and automatically fill it into a CSV file.

Here's an example of how to do this:

import PyPDF2
import csv

# Open the PDF file
with open('sample.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    # Extract the text from the first page of the PDF file
    page = pdf_reader.getPage(0)
    pdf_text = page.extractText()

# Open the CSV file
with open('sample.csv', 'w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file)
    # Split the PDF text by new lines
    lines = pdf_text.split('\n')
    for line in lines:
        # Split each line by commas
        fields = line.split(',')
        csv_writer.writerow(fields)

In this example, we first open the PDF file using the open() function and create a PdfFileReader object. Then, we use the getPage() method to extract the first page of the PDF file, and the extractText() method to extract the text from the page. Next, we open the CSV file using the open() function and create a csv.writer object. Then, we use the split() method to split the PDF text by new lines, and for each line we split it by commas. Finally, we use the writerow() method to write each line as a new row in the CSV file.

This example demonstrate the ability to extract data from a pdf and use it to fill a csv file. However, this is a simple example and in real world use cases the data extraction and formatting process may be more complex.

In conclusion, PyPDF2 and csv are powerful modules in Python that provide easy-to-use methods for working with PDF and CSV files respectively. These modules are widely used in many applications and can help you automate tasks and make your code more efficient.

Note: If you're looking for more advanced functionality for working with PDFs, such as creating PDFs from scratch or working with PDF forms, you may also want to check out other libraries such as reportlab, pdfrw, and PyMuPDF.