How to Use OCR to Extract Text from PDF Images with JavaScript in NodeJS
Why Use OCR to Extract Text from PDF with JavaScript?
The pdfRest OCR PDF API Tool allows developers to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will demonstrate how to send an API call to OCR a PDF and then use the Extract Text API Tool to extract the text using JavaScript, making your documents more accessible and easier to manage.
Imagine you have a large collection of scanned documents, such as invoices or contracts, that you need to search through for specific information. By using OCR, you can convert these scanned images into text-searchable PDFs and then extract all text, enabling you to quickly locate the information you need without manually sifting through each document.
PDF OCR Text Extraction with JavaScript Code Example
var axios = require("axios"); var FormData = require("form-data"); var fs = require("fs"); /* In this sample, we will show how to convert a scanned document into a PDF with * searchable and extractable text using Optical Character Recognition (OCR), and then * extract that text from the newly created document. * * First, we will upload a scanned PDF to the /pdf-with-ocr-text route and capture the * output ID. Then, we will send the output ID to the /extracted-text route, which will * return the newly added text. */ var apiKey = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Replace with your API key var ocrData = new FormData(); ocrData.append("file", fs.createReadStream("/path/to/file.pdf"), "file_name.pdf"); ocrData.append("output", "example_pdf-with-ocr-text_out"); var ocrConfig = { method: "post", maxBodyLength: Infinity, url: "https://api.pdfrest.com/pdf-with-ocr-text", headers: { "Api-Key": apiKey, ...ocrData.getHeaders(), }, data: ocrData, }; console.log("Sending POST request to OCR endpoint..."); axios(ocrConfig) .then(function (response) { console.log("Response status code: " + response.status); if (response.status === 200) { var ocrPDFID = response.data.outputId; console.log("Got the output ID: " + ocrPDFID); var extractData = new FormData(); extractData.append("id", ocrPDFID); var extractConfig = { method: "post", maxBodyLength: Infinity, url: "https://api.pdfrest.com/extracted-text", headers: { "Api-Key": apiKey, ...extractData.getHeaders(), }, data: extractData, }; console.log("Sending POST request to extract text endpoint..."); axios(extractConfig) .then(function (extractResponse) { console.log("Response status code: " + extractResponse.status); if (extractResponse.status === 200) { console.log(extractResponse.data.fullText); } else { console.log(extractResponse.data); } }) .catch(function (error) { console.log(error.response ? error.response.data : error.message); }); } else { console.log(response.data); } }) .catch(function (error) { console.log(error.response ? error.response.data : error.message); });
Source: GitHub
Breaking Down the Code
The provided code demonstrates how to use the pdfRest OCR PDF API Tool to convert a scanned document into a PDF with searchable text and then extract that text. Let's break down the code step-by-step:
var axios = require("axios"); var FormData = require("form-data"); var fs = require("fs");
Here, we import the necessary modules: axios
for making HTTP requests, FormData
for handling form data, and fs
for file system operations.
var apiKey = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Replace with your API key
Replace the placeholder with your actual API key from pdfRest.
var ocrData = new FormData(); ocrData.append("file", fs.createReadStream("/path/to/file.pdf"), "file_name.pdf"); ocrData.append("output", "example_pdf-with-ocr-text_out");
We create a new FormData
object and append the scanned PDF file and an output identifier. The file
field is the PDF to be processed, and the output
field is a unique identifier for the output file.
var ocrConfig = { method: "post", maxBodyLength: Infinity, url: "https://api.pdfrest.com/pdf-with-ocr-text", headers: { "Api-Key": apiKey, ...ocrData.getHeaders(), }, data: ocrData, };
We configure the POST request to the /pdf-with-ocr-text
endpoint, including the API key and form data in the headers and body, respectively.
axios(ocrConfig) .then(function (response) { console.log("Response status code: " + response.status); if (response.status === 200) { var ocrPDFID = response.data.outputId; console.log("Got the output ID: " + ocrPDFID);
We send the POST request and check if the response status is 200 (OK). If successful, we capture the outputId
from the response, which is used to identify the processed PDF.
var extractData = new FormData(); extractData.append("id", ocrPDFID); var extractConfig = { method: "post", maxBodyLength: Infinity, url: "https://api.pdfrest.com/extracted-text", headers: { "Api-Key": apiKey, ...extractData.getHeaders(), }, data: extractData, }; console.log("Sending POST request to extract text endpoint..."); axios(extractConfig) .then(function (extractResponse) { console.log("Response status code: " + extractResponse.status); if (extractResponse.status === 200) { console.log(extractResponse.data.fullText); } else { console.log(extractResponse.data); } }) .catch(function (error) { console.log(error.response ? error.response.data : error.message); }); } else { console.log(response.data); } }) .catch(function (error) { console.log(error.response ? error.response.data : error.message); });
We create another FormData
object with the outputId
and configure a POST request to the /extracted-text
endpoint. If successful, the extracted text is printed to the console.
Beyond the Tutorial
In this tutorial, you learned how to use the pdfRest OCR PDF and Extract Text API Tools to convert a scanned document into a searchable PDF and extract the text using JavaScript. This process can be invaluable for managing and searching through large collections of scanned documents.
To explore more functionalities, try out all the pdfRest API Tools in the API Lab. For detailed information on each endpoint, refer to the API Reference Guide.