How to Redact PDF Text with Python

Learn how to use the pdfRest Redact PDF API tool with Python to redact sensitive text on a PDF document.
Share this page

Why Redact PDF Text with Python?

The pdfRest Redact PDF API Tool is a powerful resource for developers looking to automate the process of redacting sensitive information from PDF documents. By using this tool, you can programmatically remove or obscure information such as email addresses, phone numbers, and specific words or phrases. This tutorial will guide you through sending an API call to the Redact PDF endpoint using Python, allowing you to integrate this functionality into your applications seamlessly.

In the real world, there are numerous scenarios where redacting information from PDFs is crucial. For example, a legal firm may need to share documents with clients or opposing counsel but must ensure that certain sensitive information, like client contact details or proprietary terms, is not visible. Similarly, businesses handling confidential information may need to distribute reports internally while ensuring that sensitive data is protected. Using the Redact PDF API, these tasks become efficient and reliable.

Redact PDF Text with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

pdf_with_redacted_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-redacted-text-preview'

redaction_options = [{
        "type": "preset",
        "value": "email",
    },
    {
        "type": "regex",
        "value": "(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{4}",
    },
    {
        "type": "literal",
        "value": "word",
    }]

mp_encoder_redactedtextPDF = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'redactions': json.dumps(redaction_options),
        'output' : 'example_out'
    }
)

headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_redactedtextPDF.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to pdf-with-redacted-text endpoint...")
response = requests.post(pdf_with_redacted_text_endpoint_url, data=mp_encoder_redactedtextPDF, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Source: GitHub

Breaking Down the Code

The code begins by importing necessary libraries: requests_toolbelt for handling multipart form data, requests for making HTTP requests, and json for handling JSON data.

pdf_with_redacted_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-redacted-text-preview'

This line sets the endpoint URL for the Redact PDF API.

redaction_options = [{
        "type": "preset",
        "value": "email",
    },
    {
        "type": "regex",
        "value": "(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{4}",
    },
    {
        "type": "literal",
        "value": "word",
    }]

The redaction_options variable defines the types of redactions to apply. It includes:

  • preset: Redacts email addresses.
  • regex: Uses a regular expression to redact phone numbers.
  • literal: Redacts the word "word".
mp_encoder_redactedtextPDF = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'redactions': json.dumps(redaction_options),
        'output' : 'example_out'
    }
)

The MultipartEncoder is used to create a multipart form-data payload. It includes:

  • file: The PDF file to redact.
  • redactions: The JSON-encoded redaction options.
  • output: The name for the output file.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_redactedtextPDF.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

The headers dictionary sets the request headers, including the API key for authentication.

response = requests.post(pdf_with_redacted_text_endpoint_url, data=mp_encoder_redactedtextPDF, headers=headers)

This line sends a POST request to the API endpoint with the prepared data and headers.

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

The response is checked for success. If successful, the JSON response is printed; otherwise, the error message is displayed.

Beyond the Tutorial

In this tutorial, you learned how to use Python to send an API call to the pdfRest Redact PDF endpoint, allowing you to automate the redaction of sensitive information from PDF documents. This is just one of the many tools available through pdfRest. To explore further, you can demo all of the pdfRest API Tools in the API Lab and refer to the API Reference Guide for more detailed information.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub.

Generate a self-service API Key now!
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.