How to Extract PDF Text with PHP, Tutorial

Share this page

Why Extract PDF Text with PHP?

The pdfRest Extract Text API Tool is a powerful resource that allows users to extract text from PDF documents programmatically. This tutorial will guide you through the process of sending an API call to the Extract Text endpoint using PHP.

This functionality can be particularly useful in scenarios where you need to obtain text data from a large number of PDFs quickly, like indexing documents for search, data mining, or automating content extraction for further processing.

Extract PDF Text PHP Code Example

 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
];

$options = [
  'multipart' => [
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file', 'r'),
      'filename' => '/path/to/file',
      'headers' => [
        'Content-Type' => ''
      ]
    ],
    [
      'name' => 'word_style',
      'contents' => 'on'
    ]
  ]
];

$request = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);

$res = $client->sendAsync($request, $options)->wait();

echo $res->getBody();

Source code reference: GitHub - datalogics/pdf-rest-api-samples

Breaking Down the Code

The code begins by loading the necessary classes using the Composer autoload feature:

require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Utils;

Next, an instance of the Guzzle HTTP client is created, and the API key is set in the headers for authentication:

$client = new Client();
$headers = [
  'Api-Key' => 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
];

The options array is prepared with multipart form data. This includes the file to be processed, the content type of the file, and the 'word_style' parameter, which specifies how the text should be extracted:

$options = [
  'multipart' => [
    // File details
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file', 'r'),
      'filename' => '/path/to/file',
      'headers' => [
        'Content-Type' => ''
      ]
    ],
    // Word style option
    [
      'name' => 'word_style',
      'contents' => 'on'
    ]
  ]
];

A new POST request is created with the API endpoint and headers, and the request is sent asynchronously. The response is then waited for and the body of the response, which contains the extracted text, is output:

$request = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);
$res = $client->sendAsync($request, $options)->wait();
echo $res->getBody();

Beyond the Tutorial

In this tutorial, we have successfully walked through the process of calling the pdfRest Extract Text API using PHP. By sending a multipart POST request, we can extract text from a PDF document and use it for various applications.

To explore and demo all of the pdfRest API Tools, visit the API Lab. For more detailed information on the API, refer to the API Reference documentation.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub - datalogics/pdf-rest-api-samples.