How to Use OCR to Extract Text from PDF Images with PHP, Tutorial

Why Use OCR to Extract Text from PDF with PHP?

The pdfRest OCR PDF API Tool is a powerful service that allows you to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will demonstrate how to send API calls to OCR PDF and Extract Text with PHP, enabling you to integrate OCR functionality into your PHP applications seamlessly.

Imagine you have a large archive of scanned documents, such as invoices or historical records, that you need to search through or analyze. By using OCR, you can extract all text from these image-based documents, making it easier to find specific information. This can save significant time and effort compared to manually reviewing each document.

PDF OCR Text Extraction with PHP Code Example


require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Utils;

/* In this sample, we will show how to convert a scanned document into a PDF with
 * searchable and extractable text using Optical Character Recognition (OCR), and then
 * extract that text from the newly created document.
 *
 * First, we will upload a scanned PDF to the /pdf-with-ocr-text route and capture the
 * output ID. Then, we will send the output ID to the /extracted-text route, which will
 * return the newly added text.
 */

$client = new Client();

$headers = [
  'Api-Key' =--> 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' // Replace with your API key
];

// Upload PDF for OCR
$pdfToOCROptions = [
  'multipart' => [
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file.pdf', 'r'),
      'filename' => 'file.pdf',
      'headers' => [
        'Content-Type' => 'application/pdf'
      ]
    ],
    [
      'name' => 'output',
      'contents' => 'example_pdf-with-ocr-text_out'
    ]
  ]
];

$pdfToOCRRequest = new Request('POST', 'https://api.pdfrest.com/pdf-with-ocr-text', $headers);

echo "Sending POST request to OCR endpoint...\n";
$pdfToOCRResponse = $client->sendAsync($pdfToOCRRequest, $pdfToOCROptions)->wait();

echo "Response status code: " . $pdfToOCRResponse->getStatusCode() . "\n";

$ocrPDFID = json_decode($pdfToOCRResponse->getBody())->outputId;
echo "Got the output ID: " . $ocrPDFID . "\n";

// Extract text from OCR'd PDF
$extractTextOptions = [
  'multipart' => [
    [
      'name' => 'id',
      'contents' => $ocrPDFID
    ]
  ]
];

$extractTextRequest = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);

echo "Sending POST request to extract text endpoint...\n";
$extractTextResponse = $client->sendAsync($extractTextRequest, $extractTextOptions)->wait();

echo "Response status code: " . $extractTextResponse->getStatusCode() . "\n";

$fullText = json_decode($extractTextResponse->getBody())->fullText;
echo $fullText . "\n";

Source: GitHub Repository

Breaking Down the Code

The provided PHP code uses the Guzzle HTTP client to interact with the pdfRest API. Here's a detailed breakdown of the code:

require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Utils;

This section includes the necessary dependencies and namespaces. It uses Composer's autoload feature to load the Guzzle HTTP client and related classes.

$client = new Client();

$headers = [
  'Api-Key' => 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' // Replace with your API key
];

Here, a new Guzzle HTTP client is instantiated, and the headers for the API request are defined. Replace 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with your actual API key.

$pdfToOCROptions = [
  'multipart' => [
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file.pdf', 'r'),
      'filename' => 'file.pdf',
      'headers' => [
        'Content-Type' => 'application/pdf'
      ]
    ],
    [
      'name' => 'output',
      'contents' => 'example_pdf-with-ocr-text_out'
    ]
  ]
];

This section sets up the multipart form data for the OCR request. It includes the PDF file to be uploaded and a name for the output. Make sure to replace '/path/to/file.pdf' with the actual path to your PDF file.

$pdfToOCRRequest = new Request('POST', 'https://api.pdfrest.com/pdf-with-ocr-text', $headers);

A new POST request is created for the OCR endpoint using the specified headers.

$pdfToOCRResponse = $client->sendAsync($pdfToOCRRequest, $pdfToOCROptions)->wait();

The request is sent asynchronously, and the response is awaited. The status code of the response is then printed.

$ocrPDFID = json_decode($pdfToOCRResponse->getBody())->outputId;

The output ID from the OCR response is extracted and printed. This ID will be used in the next request to extract text.

$extractTextOptions = [
  'multipart' => [
    [
      'name' => 'id',
      'contents' => $ocrPDFID
    ]
  ]
];

$extractTextRequest = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);

This section sets up the multipart form data for the text extraction request. It includes the output ID obtained from the previous OCR request.

$extractTextResponse = $client->sendAsync($extractTextRequest, $extractTextOptions)->wait();

The request is sent asynchronously, and the response is awaited. The status code of the response is then printed.

$fullText = json_decode($extractTextResponse->getBody())->fullText;
echo $fullText . "\n";

The full text extracted from the OCR'd PDF is obtained from the response and printed.

Beyond the Tutorial

In this tutorial, you learned how to use the pdfRest OCR PDF and Extract Text API Tools with PHP to convert a scanned document into a searchable PDF and extract the text. This process can be incredibly useful for digitizing and processing large volumes of scanned documents.

To explore more functionalities offered by pdfRest, you can demo all the API Tools in the API Lab. For detailed information on all available endpoints and parameters, refer to the API Reference Guide.

Related Solutions

Related Tutorials

How to Use OCR to Extract Text from PDF Images with PHP

Why Use OCR to Extract Text from PDF with PHP?

PDF OCR Text Extraction with PHP Code Example

Breaking Down the Code

Beyond the Tutorial