How to Use OCR to Extract Text from PDF Images with Python
Why Use OCR to Extract Text from PDF with Python?
The pdfRest OCR PDF API Tool is designed to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will demonstrate how to Extract Text with OCR using Python, making it easy to automate the process of extracting both machine-readable and image-based text from a PDF.
Imagine you have a large number of scanned documents, such as invoices or historical records, and you need to extract the text from these documents. Using OCR, you can convert these scanned images into text that can be extracted and then immediately extract that text, significantly improving your workflow and data management capabilities.
PDF OCR Text Extraction with Python Code Example
from requests_toolbelt import MultipartEncoder import requests api_key = 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here ocr_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text' mp_encoder_pdf = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file.pdf', 'rb'), 'application/pdf'), 'output': 'example_pdf-with-ocr-text_out', } ) image_headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_pdf.content_type, 'Api-Key': api_key } print("Sending POST request to OCR endpoint...") response = requests.post(ocr_endpoint_url, data=mp_encoder_pdf, headers=image_headers) print("Response status code: " + str(response.status_code)) if response.ok: response_json = response.json() ocr_pdf_id = response_json["outputId"] print("Got the output ID: " + ocr_pdf_id) extract_endpoint_url = 'https://api.pdfrest.com/extracted-text' mp_encoder_extract_text = MultipartEncoder( fields={ 'id': ocr_pdf_id } ) extract_text_headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_extract_text.content_type, 'Api-Key': api_key } print("Sending POST request to extract text endpoint...") extract_response = requests.post(extract_endpoint_url, data=mp_encoder_extract_text, headers=extract_text_headers) print("Response status code: " + str(extract_response.status_code)) if extract_response.ok: extract_json = extract_response.json() print(extract_json["fullText"]) else: print(extract_response.text) else: print(response.text)
Source: GitHub Repository
Breaking Down the Code
The provided code demonstrates how to use the pdfRest OCR PDF API Tool to convert a scanned document into a PDF with searchable text and then extract that text. Here is a detailed breakdown of how the code works:
from requests_toolbelt import MultipartEncoder import requests
These lines import the necessary libraries. requests_toolbelt
is used to handle multipart form data, and requests
is used to make HTTP requests.
api_key = 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
Replace the placeholder with your actual API key from pdfRest.
ocr_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text' mp_encoder_pdf = MultipartEncoder( fields={ 'file': ('file_name.pdf', open('/path/to/file.pdf', 'rb'), 'application/pdf'), 'output': 'example_pdf-with-ocr-text_out', } )
This sets the OCR endpoint URL and prepares the multipart form data. The fields
dictionary includes the PDF file to be uploaded and an output identifier.
image_headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_pdf.content_type, 'Api-Key': api_key }
These headers specify the content type, accept type, and API key for the request.
print("Sending POST request to OCR endpoint...") response = requests.post(ocr_endpoint_url, data=mp_encoder_pdf, headers=image_headers)
This sends a POST request to the OCR endpoint with the prepared data and headers.
if response.ok: response_json = response.json() ocr_pdf_id = response_json["outputId"] print("Got the output ID: " + ocr_pdf_id)
If the request is successful, it extracts the output ID from the response JSON.
extract_endpoint_url = 'https://api.pdfrest.com/extracted-text' mp_encoder_extract_text = MultipartEncoder( fields={ 'id': ocr_pdf_id } )
This sets the extraction endpoint URL and prepares the multipart form data with the output ID.
extract_text_headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_extract_text.content_type, 'Api-Key': api_key }
These headers specify the content type, accept type, and API key for the extraction request.
print("Sending POST request to extract text endpoint...") extract_response = requests.post(extract_endpoint_url, data=mp_encoder_extract_text, headers=extract_text_headers)
This sends a POST request to the extraction endpoint with the prepared data and headers.
if extract_response.ok: extract_json = extract_response.json() print(extract_json["fullText"])
If the extraction request is successful, it prints the extracted text from the response JSON.
Beyond the Tutorial
In this tutorial, you learned how to use the pdfRest OCR PDF API Tool to convert a scanned document into a searchable PDF and extract the text using Python. This process can significantly enhance your document management and data extraction workflows.
To explore more functionalities, you can demo all of the pdfRest API Tools in the API Lab. For detailed information on each endpoint and parameter, refer to the API Reference Guide.