How to Validate PDF/A Conformance with Python, Tutorial

Share this page

Why Validate PDF/A with Python?

pdfRest's Query PDF API Tool is an essential utility for developers and businesses looking to extract specific information from PDF documents programmatically. Included in its many features is the ability to validate conformance for PDF/A documents. This tutorial will demonstrate how to send a PDF/A validation API call to the Query PDF endpoint using Python, which is a popular language for scripting and automation due to its readability and extensive library support.

Businesses across industries can benefit from PDF/A conformance validation for several reasons. Imagine a large legal firm managing a vast digital archive of case documents. Validating these PDFs as PDF/A ensures they'll be accessible and usable far into the future, regardless of the software used to create them. This allows for reliable document retrieval, accurate search based on embedded metadata, and eliminates potential compatibility issues when sharing documents with external parties. By automating PDF/A validation, legal firms (and any business with long-term document storage needs) save time, reduce errors, and guarantee the integrity and accessibility of their critical information.

Validate PDF/A with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

pdf_info_endpoint_url = 'https://api.pdfrest.com/pdf-info'

# The /pdf-info endpoint can take a single PDF file or id as input.
# This sample demonstrates querying the pdfa check, which validates conformance and returns true or false
mp_encoder_pdfInfo = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'queries': 'pdfa',
    }
)

# Let's set the headers that the pdf-info endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdfInfo.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to pdf-info endpoint...")
response = requests.post(pdf_info_endpoint_url, data=mp_encoder_pdfInfo, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Reference: pdf-rest-api-samples on GitHub

Breaking Down the Code

The provided Python code demonstrates how to make a POST request to the pdf-info endpoint of the pdfRest API. The code uses the requests_toolbelt and requests libraries to send a multipart/form-data request, which is suitable for file uploads.

The MultipartEncoder object is created with a dictionary specifying the fields to be included in the request. The 'file' field contains a tuple with the file name, file object, and MIME type. The 'queries' field is a string listing the information to be retrieved from the PDF - in this case, only "pdfa" to perform a PDF/A validation check.

Headers are set to accept JSON responses and to include the content type generated by MultipartEncoder. An 'Api-Key' header is also included, which should be replaced with an actual API key provided by pdfRest.

The requests.post function sends the request to the specified URL with the encoded data and headers. The response is checked for success, and if successful, the JSON response is printed, formatted for readability.

Beyond the Tutorial

By following the steps in this tutorial, you've learned how to send a multipart API request to the pdfRest API to validate conformance of a PDF/A file using Python. This is particularly useful for triggering conditional processing based on the validation results.

Feel free to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference Guide for further details and capabilities of the pdfRest API.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at pdf-rest-api-samples on GitHub.