How to Check PDF Conditions and Metadata with Python, Tutorial

Share this page

Why Use Query PDF with Python?

The pdfRest Query PDF API Tool is a powerful resource for developers who need to extract information from PDF files programmatically. This tutorial will guide you through the process of sending an API call to the Query PDF endpoint using Python.

This can be particularly useful in scenarios such as automating document processing systems, where you might need to retrieve metadata like the title, page count, or author of a PDF document before taking further action.

Query PDF with Python Code Example

The following code is a complete example of how to call the Query PDF API endpoint with Python. This code is sourced from the pdfRest API samples available on GitHub at pdf-rest-api-samples.

from requests_toolbelt import MultipartEncoder
import requests
import json

pdf_info_endpoint_url = 'https://api.pdfrest.com/pdf-info'

# The /pdf-info endpoint can take a single PDF file or id as input.
#This sample demonstrates querying the title, page count, document language and author
mp_encoder_pdfInfo = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'queries': 'title,page_count,doc_language,author',
    }
)

# Let's set the headers that the pdf-info endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdfInfo.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to pdf-info endpoint...")
response = requests.post(pdf_info_endpoint_url, data=mp_encoder_pdfInfo, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Breaking Down the Code

The code snippet above is broken down into several parts:

MultipartEncoder: This is used to encode the files and fields for a multipart/form-data POST request.
fields: Here we define the data to be sent. 'file' is the PDF file to be queried and 'queries' is a comma-separated list of the PDF attributes we want to retrieve.
headers: These are the HTTP headers sent with the request. The 'Content-Type' is set automatically by the MultipartEncoder. 'Api-Key' should be replaced with your actual API key.
requests.post: This line sends the POST request to the pdf-info endpoint with the data and headers.
response handling: After the request, the response is checked. If successful, the JSON response is printed; otherwise, the error text is shown.

Beyond the Tutorial

By following the steps above, you've learned how to use Python to call the Query PDF API endpoint to retrieve information from a PDF file. You can now integrate this functionality into your applications, enabling you to automate and streamline processes that involve PDFs.

We encourage you to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation for more details on what you can achieve with pdfRest.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at JSON Payload Examples.