How to Extract PDF Text with JavaScript in NodeJS

Learn how to extract text from a PDF via JavaScript and export into JSON for data processing. Available with the pdfRest Extract Text API tool.
Share this page

Why Extract PDF Text with JavaScript?

The pdfRest Extract Text API Tool is a powerful resource for developers who need to extract text from PDF documents programmatically. This tool can be particularly useful in scenarios such as content analysis, data migration, or when you need to repurpose text content from PDF files into other formats or applications.

In this tutorial, we will demonstrate how to make an API call to the Extract Text endpoint using JavaScript to retrieve text from a PDF document.

Extract PDF Text JavaScript Code Example

// This request demonstrates how to extract text from a PDF document.
var axios = require("axios");
var FormData = require("form-data");
var fs = require("fs");

// Create a new form data instance and append the PDF file and parameters to it
var data = new FormData();
data.append("file", fs.createReadStream("/path/to/file"));
data.append("word_style", "on");

// define configuration options for axios request
var config = {
  method: "post",
  maxBodyLength: Infinity, // set maximum length of the request body
  url: "https://api.pdfrest.com/extracted-text",
  headers: {
    "Api-Key": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", // Replace with your API key
    ...data.getHeaders(), // set headers for the request
  },
  data: data, // set the data to be sent with the request
};

// send request and handle response or error
axios(config)
  .then(function (response) {
    console.log(JSON.stringify(response.data));
  })
  .catch(function (error) {
    console.log(error);
  });

Source of the provided code: pdf-rest-api-samples on GitHub

Breaking Down the Code

The code uses the Axios library to make HTTP requests and the Form-Data library to handle multipart/form-data, which is necessary for file uploads.

var axios = require("axios");
var FormData = require("form-data");
var fs = require("fs");

These lines import the required modules. Axios is used for making the HTTP request, FormData for constructing the multipart request body, and fs (file system) for accessing the file system to read the PDF file.

var data = new FormData();
data.append("file", fs.createReadStream("/path/to/file"));
data.append("word_style", "on");

A new FormData instance is created, and the PDF file is appended along with the 'word_style' parameter. The 'word_style' parameter, when set to 'on', includes style information for each word in the output.

var config = {
  method: "post",
  maxBodyLength: Infinity,
  url: "https://api.pdfrest.com/extracted-text",
  headers: {
    "Api-Key": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    ...data.getHeaders(),
  },
  data: data,
};

This is the configuration for the Axios request. It specifies the HTTP method, the endpoint URL, headers (including the API key and the headers from FormData), and the request body.

axios(config)
  .then(function (response) {
    console.log(JSON.stringify(response.data));
  })
  .catch(function (error) {
    console.log(error);
  });

The Axios library sends the request with the given configuration. The response is then logged to the console. If an error occurs, it is caught and logged.

Beyond the Tutorial

In this tutorial, we have learned how to use JavaScript to make an API call to pdfRest's Extract Text endpoint. This allows for the extraction of text from a PDF document, which can be used in various applications and services. The provided code demonstrates the process of setting up a multipart request, including a file and parameters, and handling the response from the API.

For further exploration, you are encouraged to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at pdf-rest-api-samples on GitHub.

Generate a self-service API Key now!
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.