How to Extract PDF Text with Python

Learn how to extract text from a PDF via Python and export into JSON for data processing. Available with the pdfRest Extract Text API tool.
Share this page

Why Extract PDF Text with Python?

The pdfRest Extract Text API Tool is designed to help users extract text from PDF documents programmatically. This tutorial will demonstrate how to use Python to send an API call to the Extract Text endpoint.

This functionality can be particularly useful in scenarios where text needs to be extracted for data analysis, content repurposing, or for feeding into other software systems for further processing.

Extract PDF Text Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text'

# The /extracted-text endpoint can take a single PDF file or id as input.
#This sample demonstrates extracting the text from a document to return as JSON
mp_encoder_extractText = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'word_style': 'on',
    }
)

# Let's set the headers that the extracted-text endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extractText.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to extracted-text endpoint...")
response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Source: GitHub - datalogics/pdf-rest-api-samples

A Breakdown of the Code

The code begins by importing the necessary modules:

from requests_toolbelt import MultipartEncoder
import requests
import json

The MultipartEncoder from requests_toolbelt is used for encoding multipart form data. The requests library is used to make HTTP requests, and json is used for JSON parsing.

The API endpoint URL is defined:

extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text'

Next, we create a MultipartEncoder object with the PDF file and additional parameters:

mp_encoder_extractText = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'word_style': 'on',
    }
)

The fields dictionary includes the file to be uploaded and the word_style parameter, which when set to 'on' will include style information for each word in the output.

Headers are set for the request:

headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extractText.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

The 'Content-Type' is set automatically to 'multipart/form-data' by the content_type attribute of MultipartEncoder. The 'Api-Key' must be replaced with your actual API key.

The POST request is sent, and the response is handled:

response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers)

If the response is successful, the JSON response is printed; otherwise, the error text is printed:

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Beyond the Tutorial

In this tutorial, we've walked through how to make a multipart API call to the pdfRest Extract Text endpoint using Python. This allows for the extraction of text from a PDF document and can be used in various applications where text data is needed from PDF files.

For further exploration, you're encouraged to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub - datalogics/pdf-rest-api-samples.

Generate a self-service API Key now!
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.