How to Extract PDF Text with PHP

Share this page

How and Why to Extract Data from a PDF?

The pdfRest Extract Text API Tool is a powerful resource that allows users to extract text from PDF documents programmatically. This tutorial will guide you through the process of sending an API call to the Extract Text endpoint using PHP.

This functionality can be particularly useful in scenarios where you need to obtain text data from a large number of PDFs quickly, like indexing documents for search, data mining, or automating content extraction for further processing.

PHP Code Sample for Extract Text API

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Utils;

$client = new Client();

$headers = [
  'Api-Key' => 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
];

$options = [
  'multipart' => [
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file', 'r'),
      'filename' => '/path/to/file',
      'headers' => [
        'Content-Type' => '<Content-type header>'
      ]
    ],
    [
      'name' => 'word_style',
      'contents' => 'on'
    ]
  ]
];

$request = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);

$res = $client->sendAsync($request, $options)->wait();

echo $res->getBody();

Source code reference: GitHub - datalogics/pdf-rest-api-samples

Breaking Down the Code

The code begins by loading the necessary classes using the Composer autoload feature:

require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Utils;

Next, an instance of the Guzzle HTTP client is created, and the API key is set in the headers for authentication:

$client = new Client();
$headers = [
  'Api-Key' => 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
];

The options array is prepared with multipart form data. This includes the file to be processed, the content type of the file, and the 'word_style' parameter, which specifies how the text should be extracted:

$options = [
  'multipart' => [
    // File details
    [
      'name' => 'file',
      'contents' => Utils::tryFopen('/path/to/file', 'r'),
      'filename' => '/path/to/file',
      'headers' => [
        'Content-Type' => '<Content-type header>'
      ]
    ],
    // Word style option
    [
      'name' => 'word_style',
      'contents' => 'on'
    ]
  ]
];

A new POST request is created with the API endpoint and headers, and the request is sent asynchronously. The response is then waited for and the body of the response, which contains the extracted text, is output:

$request = new Request('POST', 'https://api.pdfrest.com/extracted-text', $headers);
$res = $client->sendAsync($request, $options)->wait();
echo $res->getBody();

More Utility with pdfRest

In this tutorial, we have successfully walked through the process of calling the pdfRest Extract Text API using PHP. By sending a multipart POST request, we can extract text from a PDF document and use it for various applications.

To explore and demo all of the pdfRest API Tools, visit the API Lab at https://pdfrest.com/apilab/. For more detailed information on the API, refer to the API Reference documentation at https://pdfrest.com/documentation/.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub - datalogics/pdf-rest-api-samples.

Generate a self-service API Key now!

Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.

Compare Plans
Contact Us