How to Use OCR to Make PDF Image Text Searchable with Python

Learn how to use pdfRest OCR PDF API Tool with Python to make PDF image text searchable
Share this page

Why Use OCR to make Searchable PDF with Python?

The pdfRest OCR PDF API Tool is a powerful utility that allows developers to convert scanned documents and images into searchable and extractable PDFs using Optical Character Recognition (OCR). This tutorial will guide you through the process of sending an API call to the OCR PDF endpoint using Python, enabling you to automate the conversion of image-based text into machine-readable text.

Imagine you work in a law firm that frequently receives scanned legal documents. By using OCR, you can convert these scanned documents into searchable PDFs, making it easier to find and reference specific information. This can save time and improve efficiency, especially when dealing with large volumes of documents.

OCR PDF with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

pdf_with_ocr_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text'

# The /pdf-with-ocr-text endpoint can take a single PDF file or id as input.
# This sample demonstrates a request to add text to a document by using OCR on images of text.
mp_encoder_pdf_with_ocr_text = MultipartEncoder(
    fields={
        'file': ('file_name', open('/path/to/file', 'rb'), 'application/pdf'),
        'output' : 'example_pdf-with-ocr-text_out',
    }
)

# Let's set the headers that the pdf-with-ocr-text endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdf_with_ocr_text.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to pdf-with-ocr-text endpoint...")
response = requests.post(pdf_with_ocr_text_endpoint_url, data=mp_encoder_pdf_with_ocr_text, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

# If you would like to download the file instead of getting the JSON response, please see the 'get-resource-id-endpoint.py' sample.

Source: GitHub Repository

Breaking Down the Code

Let's break down the code to understand how it works:

from requests_toolbelt import MultipartEncoder
import requests
import json

This section imports the necessary libraries. requests_toolbelt is used to handle multipart form data, requests is used to make HTTP requests, and json is used to handle JSON responses.

pdf_with_ocr_text_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text'

This line sets the endpoint URL for the OCR PDF API.

mp_encoder_pdf_with_ocr_text = MultipartEncoder(
    fields={
        'file': ('file_name', open('/path/to/file', 'rb'), 'application/pdf'),
        'output' : 'example_pdf-with-ocr-text_out',
    }
)

Here, we create a MultipartEncoder object to handle the multipart form data. The fields dictionary includes:

  • file: The PDF file to be processed. Replace /path/to/file with the actual file path.
  • output: The name of the output file.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdf_with_ocr_text.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

This section sets the headers for the API request. The Content-Type is automatically set to multipart/form-data by the MultipartEncoder. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with your actual API key.

response = requests.post(pdf_with_ocr_text_endpoint_url, data=mp_encoder_pdf_with_ocr_text, headers=headers)

This line sends a POST request to the OCR PDF endpoint with the specified data and headers.

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

Finally, this block checks if the request was successful. If so, it prints the JSON response. Otherwise, it prints the error message.

Beyond the Tutorial

In this tutorial, we demonstrated how to use Python to send an API call to the pdfRest OCR PDF endpoint. By following the steps outlined above, you can convert scanned documents into searchable PDFs using OCR.

We encourage you to explore all the pdfRest API Tools in the API Lab. For more detailed information, refer to the API Reference Guide.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub Repository.

Generate a self-service API Key now!

Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.

Compare Plans
Contact Us