Extract Text

Extract Text is a REST API tool designed to efficiently extract all text from PDF documents, with options to include detailed style and positional information. This tool is ideal for developers and businesses looking to streamline data extraction processes and integrate text content into various applications, including AI-driven workflows.

Key Benefits of Extract Text API

Extract text from PDFs with precision, capturing all content for further analysis or integration into databases and other systems.
Include optional style information, such as font type, size, and color, to preserve the original document's appearance in the extracted text.
Utilize positional data to maintain text layout and structure, essential for applications requiring exact text placement, such as digital archiving and document conversion.
Automate text extraction workflows to enhance efficiency and reduce manual data entry, perfect for high-volume document processing.
Enhance data accessibility by converting static PDF content into dynamic, usable data for business intelligence, compliance, and reporting purposes.
Supply large language models (LLM) with rich content from PDF archives, enabling advanced AI applications like natural language processing and sentiment analysis. Transform static documents into valuable data sources for AI-driven insights and decision-making.

Try Now with API Lab

Start right from your browser - upload files, choose parameters, generate code, and send API Calls directly from API Lab!

Request

POST

Headers

Api-Key

Don't have a key? Create an account to get one.

Response-Type

Choose between a full response after processing completes or an immediate response containing only the requestId to poll for the processing status later.

Full Response

Request ID

Required Parameters

file

File to be uploaded and processed

Alphanumeric ID (UUID) of existing file on server to be processed

Optional Parameters

full_text

Extract the full text from the document

preserve_line_breaks

When enabled, this feature identifies and maintains the original line breaks within the text, inserting a newline character ("\n") at each break point.

word_style

Extract styling information for the words

word_coordinates

Extract coordinate information for the words

output_type

Specify whether to save output as a file with .json extension or return output directly in the JSON response

Code

curl -X POST "https://api.pdfrest.com/extracted-text" \
  -H "Accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -H "Api-Key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \

Response

The response for your API Call will display here.

Once you've sent your POST request and received a valid response, you can download your output file using the output URL.

Build Your Solution

You have document processing problems, we have Solutions. Explore the many ways pdfRest can align your documents with your business objectives.

Browse all solutions

Generating Summaries of PDF Documents using ChatGPT

Translate PDF Text to New Language with ChatGPT

Parse PDF Files to Streamline Data Extraction

Create Searchable PDF Files with OCR

Discover Sentiment Insights from PDF Documents with pdfRest and ChatGPT

Convert PDF to Text to Unlock Trapped Data

Why is pdfRest the best API to extract text from PDF?

pdfRest offers the best solution for extracting text from PDF documents, because it preserves positional data, includes text style information, and taps into data.

Extract Text with Precise Positional Data from PDFs

Unlike most PDF text extraction tools, pdfRest's Extract Text API can optionally include page and coordinate metadata for each word extracted from the PDF in an easy-to-parse JSON format. By enabling the word_coordinates parameter, you gain access to:

Detailed positional data that preserves the exact layout of text, crucial for applications requiring precise text placement.
The ability to create PDF viewers with searchable and selectable text, enhancing user interaction and accessibility.
Leveraging positional data for AI models to understand document structure, improving tasks like document classification and layout analysis.

This capability is essential for developers and businesses aiming to maintain the integrity of text layout when converting PDFs to other formats or integrating into complex systems.

Preserve Detailed Styling with PDF Extracted Text

With the word_style option, pdfRest's Extract Text API provides detailed style information about each word extracted from the PDF, including font type, size, color, and color space. This feature supports:

Preservation of the original document's appearance, ensuring that text maintains its intended look and feel in other formats or user interfaces.
Enhanced document fidelity, particularly important for industries where visual consistency is critical, such as publishing and design.
The option to combine style and positional data for a comprehensive extraction, or to disable style information when not needed, offering flexibility in your data processing.

Unlock and Aggregate Valuable Text Data from PDF Archives

The world's collective archive of PDFs is estimated to contain over 2.5 trillion documents, representing a vast opportunity for discovering new sources of untapped data. pdfRest's Extract Text API empowers you to:

Batch process large volumes of PDFs, automating workflows to efficiently extract and aggregate text data.
Facilitate easy database entry and integration with other services, transforming static documents into dynamic data sources.
Leverage advanced extraction capabilities to support AI-driven insights, business intelligence, and data-driven decision-making.
Supply extracted text data to large language models (LLMs) for natural language processing, enabling advanced AI applications such as automated content generation and sentiment analysis.

Check out other videos

Start from Code Examples

See more code examples in our GitHub repository

Need more help?

Start with a Tutorial for step-by-step guidance

How to Extract PDF Text in .NET with C#

How to Extract PDF Text with cURL

How to Extract PDF Text with JavaScript in NodeJS

How to Extract PDF Text with PHP

How to Extract PDF Text with Python

How to Programmatically Extract Text from PDF

How to Use OCR to Extract Text from PDF Images in .NET with C#

How to Use OCR to Extract Text from PDF Images with cURL

11 items

Customize Your Solution

Learn about the parameters for this tool to create your custom solution.

File

The file parameter allows you to select a local file to be uploaded to pdfRest’s processing server.

See Documentation

The id parameter allows you to submit a resource ID generated by one of our API Tools. Each of our API Tools assigns a unique resource ID to your output file(s), allowing you to chain requests together without having to download intermediate files between requests.

See Documentation

Word Style

The word_style parameter allows you to toggle whether or not to extract styling information about font and color for individual words in the document.

See Documentation

Word Coordinates

The word_coordinates parameter allows you to toggle whether or not to extract coordinate information for the text boxes of individual words in the document.

See Documentation

Full Text

The full_text parameter allows you to specify whether to extract the full text of a document. The three options are as follows:

off: Do not extract the full text
by_page: Extract the full text of each page and return them as separate chunks
document: Extract the full text of the document and return it as a single block of text

See Documentation

Preserve Line Breaks

When preserve_line_breaks is set to on, this feature identifies and maintains the original line breaks within the text, inserting a newline character ("\n") at each break point.

See Documentation

Safe & Secure

Confidently process your sensitive data with pdfRest. Our platform is fortified for robust, Enterprise-grade security and compliance, including GDPR, HIPAA, and SOC 2 Type 2 certification in progress. Your data's protection is our priority.

How We Protect Your Data

Frequently Asked Questions

Need more help? Contact Us or visit our documentation.

Generate a self-service API Key now!

Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.

Extract Text

Key Benefits of Extract Text API

Extract Text with Precise Positional Data from PDFs

Preserve Detailed Styling with PDF Extracted Text

Unlock and Aggregate Valuable Text Data from PDF Archives

Need more help?

Safe & Secure

What is the Extract Text API and how does it work for PDF text extraction?

Why should I use pdfRest's Extract Text API for extracting text from PDFs?

What is positional data in PDF text extraction and why is it important?

Can I extract text with style information from a PDF document?

How can I preserve line breaks when extracting text from a PDF?

What output formats are available for extracted text from PDFs?

Can I extract text from specific sections of a PDF document?

How do I integrate the Extract Text API into my workflow for PDF text extraction?

Can I extract PDF text without writing code?

How does pdfRest ensure quality and accuracy in PDF text extraction?

Are there any limitations on the type of text that can be extracted from PDFs?

Is there a self-hosted option for pdfRest's text extraction?

Extract Text

Key Benefits of Extract Text API

Extract Text with Precise Positional Data from PDFs

Preserve Detailed Styling with PDF Extracted Text

Unlock and Aggregate Valuable Text Data from PDF Archives

Need more help?

Safe & Secure

What is the Extract Text API and how does it work for PDF text extraction?

Why should I use pdfRest's Extract Text API for extracting text from PDFs?

What is positional data in PDF text extraction and why is it important?

Can I extract text with style information from a PDF document?

How can I preserve line breaks when extracting text from a PDF?

What output formats are available for extracted text from PDFs?

Can I extract text from specific sections of a PDF document?

How do I integrate the Extract Text API into my workflow for PDF text extraction?

Can I extract PDF text without writing code?

How does pdfRest ensure quality and accuracy in PDF text extraction?

Are there any limitations on the type of text that can be extracted from PDFs?

Can pdfRest extract text from PDFs under GDPR compliance?

Is there a self-hosted option for pdfRest's text extraction?