How to Extract PDF Text in .NET with C#
Why Extract PDF Text with C#?
The pdfRest Extract Text API Tool is a powerful service that allows users to extract text from PDF documents programmatically. This tutorial will guide you through the process of making an API call to the Extract Text endpoint using C# to retrieve text content from a PDF file.
For instance, a user might use the Extract Text API to automate the process of converting PDF reports into editable text for further data analysis or integration into other applications.
Extract PDF Text C# Code Example
using System.Text; using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") }) { using (var request = new HttpRequestMessage(HttpMethod.Post, "extracted-text")) { request.Headers.TryAddWithoutValidation("Api-Key", "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"); request.Headers.Accept.Add(new("application/json")); var multipartContent = new MultipartFormDataContent(); var byteArray = File.ReadAllBytes("/path/to/file"); var byteAryContent = new ByteArrayContent(byteArray); multipartContent.Add(byteAryContent, "file", "file_name"); byteAryContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf"); var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes("on")); multipartContent.Add(byteArrayOption, "word_style"); request.Content = multipartContent; var response = await httpClient.SendAsync(request); var apiResult = await response.Content.ReadAsStringAsync(); Console.WriteLine("API response received."); Console.WriteLine(apiResult); } }
Source code reference: pdfRest API Samples on GitHub
Breaking Down the Code
The code block above demonstrates how to make a multipart/form-data POST request to the pdfRest Extract Text API endpoint. Let's break down the key parts of the code:
using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") })
This initializes a new instance of the HttpClient class and sets the base address to the pdfRest API.
request.Headers.TryAddWithoutValidation("Api-Key", "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx");
The API key is added to the request headers. Replace the placeholder with your actual API key to authenticate your request.
var multipartContent = new MultipartFormDataContent();
A new instance of MultipartFormDataContent is created to hold the parts of the multipart request.
var byteArray = File.ReadAllBytes("/path/to/file"); var byteAryContent = new ByteArrayContent(byteArray); multipartContent.Add(byteAryContent, "file", "file_name"); byteAryContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf");
The PDF file is read into a byte array, wrapped in a ByteArrayContent object, and added to the multipart content with the name "file". The content type is set to "application/pdf".
var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes("on")); multipartContent.Add(byteArrayOption, "word_style");
An additional option is added to the request. The "word_style" parameter is set to "on", which can be used to preserve the styling of words when extracting text.
var response = await httpClient.SendAsync(request);
The request is sent asynchronously, and the response is awaited.
var apiResult = await response.Content.ReadAsStringAsync();
The response content is read as a string, which contains the extracted text from the PDF.
Beyond the Tutorial
In this tutorial, you learned how to construct and send a multipart/form-data POST request to the pdfRest Extract Text API using C#. By executing the code, you can extract text from a PDF file and use it for various purposes in your applications.
I encourage you to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation for more details on how to use the various endpoints provided by pdfRest.
Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at pdfRest API Samples on GitHub.