How to Redact PDF Text with Python
Why Redact PDF Text with Python?
The pdfRest Redact PDF API Tool is a powerful resource for developers looking to automate the process of redacting sensitive information from PDF documents. By using this tool, you can programmatically remove or obscure information such as email addresses, phone numbers, and specific words or phrases. This tutorial will guide you through sending an API call to the Redact PDF endpoint using Python, allowing you to integrate this functionality into your applications seamlessly.
In the real world, there are numerous scenarios where redacting information from PDFs is crucial. For example, a legal firm may need to share documents with clients or opposing counsel but must ensure that certain sensitive information, like client contact details or proprietary terms, is not visible. Similarly, businesses handling confidential information may need to distribute reports internally while ensuring that sensitive data is protected. Using the Redact PDF API, these tasks become efficient and reliable.
Redact PDF Text with Python Code Example
from requests_toolbelt import MultipartEncoder import requests import json pdf_with_redacted_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-redacted-text-preview' redaction_options = [{ "type": "preset", "value": "email", }, { "type": "regex", "value": "(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{4}", }, { "type": "literal", "value": "word", }] mp_encoder_redactedtextPDF = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'), 'redactions': json.dumps(redaction_options), 'output' : 'example_out' } ) headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_redactedtextPDF.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here } print("Sending POST request to pdf-with-redacted-text endpoint...") response = requests.post(pdf_with_redacted_text_endpoint_url, data=mp_encoder_redactedtextPDF, headers=headers) print("Response status code: " + str(response.status_code)) if response.ok: response_json = response.json() print(json.dumps(response_json, indent = 2)) else: print(response.text)
Source: GitHub
Breaking Down the Code
The code begins by importing necessary libraries: requests_toolbelt
for handling multipart form data, requests
for making HTTP requests, and json
for handling JSON data.
pdf_with_redacted_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-redacted-text-preview'
This line sets the endpoint URL for the Redact PDF API.
redaction_options = [{ "type": "preset", "value": "email", }, { "type": "regex", "value": "(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{4}", }, { "type": "literal", "value": "word", }]
The redaction_options
variable defines the types of redactions to apply. It includes:
preset
: Redacts email addresses.regex
: Uses a regular expression to redact phone numbers.literal
: Redacts the word "word".
mp_encoder_redactedtextPDF = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'), 'redactions': json.dumps(redaction_options), 'output' : 'example_out' } )
The MultipartEncoder
is used to create a multipart form-data payload. It includes:
file
: The PDF file to redact.redactions
: The JSON-encoded redaction options.output
: The name for the output file.
headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_redactedtextPDF.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here }
The headers
dictionary sets the request headers, including the API key for authentication.
response = requests.post(pdf_with_redacted_text_endpoint_url, data=mp_encoder_redactedtextPDF, headers=headers)
This line sends a POST request to the API endpoint with the prepared data and headers.
if response.ok: response_json = response.json() print(json.dumps(response_json, indent = 2)) else: print(response.text)
The response is checked for success. If successful, the JSON response is printed; otherwise, the error message is displayed.
Beyond the Tutorial
In this tutorial, you learned how to use Python to send an API call to the pdfRest Redact PDF endpoint, allowing you to automate the redaction of sensitive information from PDF documents. This is just one of the many tools available through pdfRest. To explore further, you can demo all of the pdfRest API Tools in the API Lab and refer to the API Reference Guide for more detailed information.
Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub.