How to Use OCR to Extract Text from PDF Images with cURL
Why Use OCR to Extract Text from PDF with cURL?
The pdfRest OCR PDF API Tool is a powerful resource that allows you to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will guide you through the process of sending API calls to OCR PDF and Extract Text with cURL, a command-line tool for transferring data using various network protocols.
Imagine you have a stack of scanned documents that you need to analyze in aggregrate. For instance, a law firm might have hundreds of pages of legal documents that need to be assessed. Using the pdfRest OCR PDF and Extract Text API Tools, you can automate this process, extracting all text for further processing or analysis.
PDF OCR Text Extraction with cURL Code Example
#!/bin/sh # In this sample, we will show how to convert a scanned document into a PDF with # searchable and extractable text using Optical Character Recognition (OCR), and then # extract that text from the newly created document. # # First, we will upload a scanned PDF to the /pdf-with-ocr-text route and capture the # output ID. Then, we will send the output ID to the /extracted-text route, which will # return the newly added text. API_KEY="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # Replace with your API key # Upload PDF for OCR OCR_PDF_ID=$(curl -s -X POST "https://api.pdfrest.com/pdf-with-ocr-text" \ -H "Accept: application/json" \ -H "Content-Type: multipart/form-data" \ -H "Api-Key: $API_KEY" \ -F "file=@/path/to/file.pdf" \ -F "output=example_pdf-with-ocr-text_out"\ | jq -r '.outputId') # Extract text from OCR'd PDF EXTRACT_TEXT_RESPONSE=$(curl -s -X POST "https://api.pdfrest.com/extracted-text" \ -H "Accept: application/json" \ -H "Content-Type: multipart/form-data" \ -H "Api-Key: $API_KEY" \ -F "id=$OCR_PDF_ID") FULL_TEXT=$(echo $EXTRACT_TEXT_RESPONSE | jq -r '.fullText') echo "Extracted text: $FULL_TEXT"
Source: GitHub
Breaking Down the Code
The script begins by defining the API key:
API_KEY="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # Replace with your API key
Replace the placeholder with your actual API key from pdfRest.
Next, we upload the PDF for OCR processing:
OCR_PDF_ID=$(curl -s -X POST "https://api.pdfrest.com/pdf-with-ocr-text" \ -H "Accept: application/json" \ -H "Content-Type: multipart/form-data" \ -H "Api-Key: $API_KEY" \ -F "file=@/path/to/file.pdf" \ -F "output=example_pdf-with-ocr-text_out"\ | jq -r '.outputId')
This cURL command sends a POST request to the https://api.pdfrest.com/pdf-with-ocr-text
endpoint. The headers specify that the request accepts JSON responses and that the content type is multipart/form-data. The -F
flag is used to specify the file to upload and the desired output name. The jq
command extracts the outputId
from the JSON response.
Then, we extract the text from the OCR'd PDF:
EXTRACT_TEXT_RESPONSE=$(curl -s -X POST "https://api.pdfrest.com/extracted-text" \ -H "Accept: application/json" \ -H "Content-Type: multipart/form-data" \ -H "Api-Key: $API_KEY" \ -F "id=$OCR_PDF_ID")
This cURL command sends another POST request, this time to the https://api.pdfrest.com/extracted-text
endpoint. The id
parameter is set to the outputId
obtained from the previous step.
Finally, we extract and print the full text:
FULL_TEXT=$(echo $EXTRACT_TEXT_RESPONSE | jq -r '.fullText') echo "Extracted text: $FULL_TEXT"
The jq
command extracts the fullText
field from the JSON response, and the script prints the extracted text.
Beyond the Tutorial
In this tutorial, you learned how to use cURL to send an API call to the pdfRest OCR PDF API Tool and extract the text from the OCR'd PDF. This process can be incredibly useful for PDF image-based text extraction.
To explore more functionalities, you can demo all of the pdfRest API Tools in the API Lab. For detailed information on each endpoint, refer to the API Reference Guide.