How to Extract PDF Text with Python
Why Extract PDF Text with Python?
The pdfRest Extract Text API Tool is designed to help users extract text from PDF documents programmatically. This tutorial will demonstrate how to use Python to send an API call to the Extract Text endpoint.
This functionality can be particularly useful in scenarios where text needs to be extracted for data analysis, content repurposing, or for feeding into other software systems for further processing.
Extract PDF Text Python Code Example
from requests_toolbelt import MultipartEncoder import requests import json extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text' # The /extracted-text endpoint can take a single PDF file or id as input. #This sample demonstrates extracting the text from a document to return as JSON mp_encoder_extractText = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'), 'word_style': 'on', } ) # Let's set the headers that the extracted-text endpoint expects. # Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below. headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_extractText.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here } print("Sending POST request to extracted-text endpoint...") response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers) print("Response status code: " + str(response.status_code)) if response.ok: response_json = response.json() print(json.dumps(response_json, indent = 2)) else: print(response.text)
Source: GitHub - datalogics/pdf-rest-api-samples
A Breakdown of the Code
The code begins by importing the necessary modules:
from requests_toolbelt import MultipartEncoder import requests import json
The MultipartEncoder
from requests_toolbelt
is used for encoding multipart form data. The requests
library is used to make HTTP requests, and json
is used for JSON parsing.
The API endpoint URL is defined:
extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text'
Next, we create a MultipartEncoder
object with the PDF file and additional parameters:
mp_encoder_extractText = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'), 'word_style': 'on', } )
The fields
dictionary includes the file to be uploaded and the word_style
parameter, which when set to 'on' will include style information for each word in the output.
Headers are set for the request:
headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_extractText.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here }
The 'Content-Type' is set automatically to 'multipart/form-data' by the content_type
attribute of MultipartEncoder
. The 'Api-Key' must be replaced with your actual API key.
The POST request is sent, and the response is handled:
response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers)
If the response is successful, the JSON response is printed; otherwise, the error text is printed:
if response.ok: response_json = response.json() print(json.dumps(response_json, indent = 2)) else: print(response.text)
Beyond the Tutorial
In this tutorial, we've walked through how to make a multipart API call to the pdfRest Extract Text endpoint using Python. This allows for the extraction of text from a PDF document and can be used in various applications where text data is needed from PDF files.
For further exploration, you're encouraged to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation.
Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub - datalogics/pdf-rest-api-samples.