How to Use OCR to Make PDF Image Text Searchable with PHP, Tutorial

Share this page

Why Use OCR to make Searchable PDF with PHP?

The pdfRest OCR PDF API Tool is a powerful resource that allows developers to integrate Optical Character Recognition (OCR) functionality into their applications. This tutorial will demonstrate how to send an API call to OCR a PDF using PHP, making it easier to search and find text from scanned documents or images within a PDF file.

Imagine you have a large collection of scanned documents, such as invoices or historical records, and you need to make the text within these documents searchable and extractable. By using the OCR PDF API, you can automate the process of converting these scanned images into text, saving time and effort while improving data accessibility and usability.

OCR PDF with PHP Code Example

require 'vendor/autoload.php'; // Require the autoload file to load Guzzle HTTP client.

use GuzzleHttp\Client; // Import the Guzzle HTTP client namespace.
use GuzzleHttp\Psr7\Request; // Import the PSR-7 Request class.
use GuzzleHttp\Psr7\Utils; // Import the PSR-7 Utils class for working with streams.

$client = new Client(); // Create a new instance of the Guzzle HTTP client.

$headers = [
  'Api-Key' =--> 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' // Set the API key in the headers for authentication.
];

$options = [
  'multipart' => [
    [
      'name' => 'file', // Specify the field name for the file.
      'contents' => Utils::tryFopen('/path/to/file', 'r'), // Open the file specified by the '/path/to/file' for reading.
      'filename' => '/path/to/file', // Set the filename for the file to be processed, in this case, '/path/to/file'.
      'headers' => [
        'Content-Type' => '' // Set the Content-Type header for the file.
      ]
    ],
    [
      'name' => 'output', // Specify the field name for the output option.
      'contents' => 'pdfrest_pdf-with-ocr-text' // Set the value for the output option (in this case, 'pdfrest_pdf-with-ocr-text').
    ]
  ]
];

$request = new Request('POST', 'https://api.pdfrest.com/pdf-with-ocr-text', $headers); // Create a new HTTP POST request with the API endpoint and headers.

$res = $client->sendAsync($request, $options)->wait(); // Send the asynchronous request and wait for the response.

echo $res->getBody(); // Output the response body, which contains the document with text from OCR added.

Source: GitHub

Breaking Down the Code

The provided code demonstrates how to use the Guzzle HTTP client to send a multipart API request to the pdfRest OCR PDF endpoint. Here’s a detailed breakdown:

require 'vendor/autoload.php'; // Require the autoload file to load Guzzle HTTP client.

This line includes the autoload file generated by Composer, which loads all necessary dependencies, including Guzzle.

use GuzzleHttp\Client; // Import the Guzzle HTTP client namespace.
use GuzzleHttp\Psr7\Request; // Import the PSR-7 Request class.
use GuzzleHttp\Psr7\Utils; // Import the PSR-7 Utils class for working with streams.

These lines import the necessary classes from the Guzzle HTTP client library and PSR-7, which are used to create and send HTTP requests.

$client = new Client(); // Create a new instance of the Guzzle HTTP client.

This line creates a new instance of the Guzzle HTTP client, which will be used to send the API request.

$headers = [
  'Api-Key' => 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' // Set the API key in the headers for authentication.
];

Here, the API key is set in the headers array for authentication purposes. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with your actual API key.

$options = [
  'multipart' => [
    [
      'name' => 'file', // Specify the field name for the file.
      'contents' => Utils::tryFopen('/path/to/file', 'r'), // Open the file specified by the '/path/to/file' for reading.
      'filename' => '/path/to/file', // Set the filename for the file to be processed, in this case, '/path/to/file'.
      'headers' => [
        'Content-Type' => '' // Set the Content-Type header for the file.
      ]
    ],
    [
      'name' => 'output', // Specify the field name for the output option.
      'contents' => 'pdfrest_pdf-with-ocr-text' // Set the value for the output option (in this case, 'pdfrest_pdf-with-ocr-text').
    ]
  ]
];

The $options array is configured with multipart form data. The first part includes the file to be processed, specifying its name, contents, filename, and content type. The second part specifies the output format for the OCR result.

$request = new Request('POST', 'https://api.pdfrest.com/pdf-with-ocr-text', $headers); // Create a new HTTP POST request with the API endpoint and headers.

This line creates a new HTTP POST request to the pdfRest OCR PDF endpoint, including the headers for authentication.

$res = $client->sendAsync($request, $options)->wait(); // Send the asynchronous request and wait for the response.

The asynchronous request is sent using the sendAsync method, and the script waits for the response.

echo $res->getBody(); // Output the response body, which contains the document with text from OCR added.

Finally, the response body, which contains the OCR-processed document, is outputted.

Beyond the Tutorial

In this tutorial, you learned how to send a multipart API request to the pdfRest OCR PDF endpoint using PHP. This process allows you to convert scanned documents into searchable and editable text efficiently.

To further explore the capabilities of pdfRest API tools, visit the API Lab and refer to the API Reference Guide for detailed documentation.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub.