How to Use OCR to Extract Text from PDF Images with JavaScript in NodeJS, Tutorial

Share this page

Why Use OCR to Extract Text from PDF with JavaScript?

The pdfRest OCR PDF API Tool allows developers to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will demonstrate how to send an API call to OCR a PDF and then use the Extract Text API Tool to extract the text using JavaScript, making your documents more accessible and easier to manage.

Imagine you have a large collection of scanned documents, such as invoices or contracts, that you need to search through for specific information. By using OCR, you can convert these scanned images into text-searchable PDFs and then extract all text, enabling you to quickly locate the information you need without manually sifting through each document.

PDF OCR Text Extraction with JavaScript Code Example

var axios = require("axios");
var FormData = require("form-data");
var fs = require("fs");

/* In this sample, we will show how to convert a scanned document into a PDF with
* searchable and extractable text using Optical Character Recognition (OCR), and then
* extract that text from the newly created document.
*
* First, we will upload a scanned PDF to the /pdf-with-ocr-text route and capture the
* output ID. Then, we will send the output ID to the /extracted-text route, which will
* return the newly added text.
*/

var apiKey = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Replace with your API key

var ocrData = new FormData();
ocrData.append("file", fs.createReadStream("/path/to/file.pdf"), "file_name.pdf");
ocrData.append("output", "example_pdf-with-ocr-text_out");

var ocrConfig = {
  method: "post",
  maxBodyLength: Infinity,
  url: "https://api.pdfrest.com/pdf-with-ocr-text",
  headers: {
    "Api-Key": apiKey,
    ...ocrData.getHeaders(),
  },
  data: ocrData,
};

console.log("Sending POST request to OCR endpoint...");
axios(ocrConfig)
  .then(function (response) {
    console.log("Response status code: " + response.status);

    if (response.status === 200) {
      var ocrPDFID = response.data.outputId;
      console.log("Got the output ID: " + ocrPDFID);

      var extractData = new FormData();
      extractData.append("id", ocrPDFID);

      var extractConfig = {
        method: "post",
        maxBodyLength: Infinity,
        url: "https://api.pdfrest.com/extracted-text",
        headers: {
          "Api-Key": apiKey,
          ...extractData.getHeaders(),
        },
        data: extractData,
      };

      console.log("Sending POST request to extract text endpoint...");
      axios(extractConfig)
        .then(function (extractResponse) {
          console.log("Response status code: " + extractResponse.status);

          if (extractResponse.status === 200) {
            console.log(extractResponse.data.fullText);
          } else {
            console.log(extractResponse.data);
          }
        })
        .catch(function (error) {
          console.log(error.response ? error.response.data : error.message);
        });
    } else {
      console.log(response.data);
    }
  })
  .catch(function (error) {
    console.log(error.response ? error.response.data : error.message);
  });

Source: GitHub

Breaking Down the Code

The provided code demonstrates how to use the pdfRest OCR PDF API Tool to convert a scanned document into a PDF with searchable text and then extract that text. Let's break down the code step-by-step:

var axios = require("axios");
var FormData = require("form-data");
var fs = require("fs");

Here, we import the necessary modules: axios for making HTTP requests, FormData for handling form data, and fs for file system operations.

var apiKey = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Replace with your API key

Replace the placeholder with your actual API key from pdfRest.

var ocrData = new FormData();
ocrData.append("file", fs.createReadStream("/path/to/file.pdf"), "file_name.pdf");
ocrData.append("output", "example_pdf-with-ocr-text_out");

We create a new FormData object and append the scanned PDF file and an output identifier. The file field is the PDF to be processed, and the output field is a unique identifier for the output file.

var ocrConfig = {
  method: "post",
  maxBodyLength: Infinity,
  url: "https://api.pdfrest.com/pdf-with-ocr-text",
  headers: {
    "Api-Key": apiKey,
    ...ocrData.getHeaders(),
  },
  data: ocrData,
};

We configure the POST request to the /pdf-with-ocr-text endpoint, including the API key and form data in the headers and body, respectively.

axios(ocrConfig)
  .then(function (response) {
    console.log("Response status code: " + response.status);

    if (response.status === 200) {
      var ocrPDFID = response.data.outputId;
      console.log("Got the output ID: " + ocrPDFID);

We send the POST request and check if the response status is 200 (OK). If successful, we capture the outputId from the response, which is used to identify the processed PDF.

      var extractData = new FormData();
      extractData.append("id", ocrPDFID);

      var extractConfig = {
        method: "post",
        maxBodyLength: Infinity,
        url: "https://api.pdfrest.com/extracted-text",
        headers: {
          "Api-Key": apiKey,
          ...extractData.getHeaders(),
        },
        data: extractData,
      };

      console.log("Sending POST request to extract text endpoint...");
      axios(extractConfig)
        .then(function (extractResponse) {
          console.log("Response status code: " + extractResponse.status);

          if (extractResponse.status === 200) {
            console.log(extractResponse.data.fullText);
          } else {
            console.log(extractResponse.data);
          }
        })
        .catch(function (error) {
          console.log(error.response ? error.response.data : error.message);
        });
    } else {
      console.log(response.data);
    }
  })
  .catch(function (error) {
    console.log(error.response ? error.response.data : error.message);
  });

We create another FormData object with the outputId and configure a POST request to the /extracted-text endpoint. If successful, the extracted text is printed to the console.

Beyond the Tutorial

In this tutorial, you learned how to use the pdfRest OCR PDF and Extract Text API Tools to convert a scanned document into a searchable PDF and extract the text using JavaScript. This process can be invaluable for managing and searching through large collections of scanned documents.

To explore more functionalities, try out all the pdfRest API Tools in the API Lab. For detailed information on each endpoint, refer to the API Reference Guide.