How to Use OCR to Make PDF Image Text Searchable in .NET with C#
Why Use OCR to make Searchable PDF with C#?
The pdfRest OCR PDF API Tool is a powerful resource for converting scanned documents and images into searchable PDF files. This tutorial will show you how to send an API call to OCR PDF with C#, allowing you to integrate this functionality into your applications seamlessly.
Imagine you have a large archive of scanned documents that you need to make searchable for easy retrieval. Using OCR (Optical Character Recognition), you can convert these scanned documents into searchable PDFs. This is invaluable for businesses and organizations that need to manage large volumes of documents efficiently.
OCR PDF with C# Code Example
using System.Text; using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") }) { using (var request = new HttpRequestMessage(HttpMethod.Post, "pdf-with-ocr-text")) { request.Headers.TryAddWithoutValidation("Api-Key", "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"); request.Headers.Accept.Add(new("application/json")); var multipartContent = new MultipartFormDataContent(); var byteArray = File.ReadAllBytes("/path/to/file"); var byteAryContent = new ByteArrayContent(byteArray); multipartContent.Add(byteAryContent, "file", "file_name"); byteAryContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf"); var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes("converted")); multipartContent.Add(byteArrayOption, "output"); request.Content = multipartContent; var response = await httpClient.SendAsync(request); var apiResult = await response.Content.ReadAsStringAsync(); Console.WriteLine("API response received."); Console.WriteLine(apiResult); } }
Source: GitHub
Breaking Down the Code
Let's break down the provided code to understand how it works:
using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") })
This line initializes an HttpClient
object with the base address set to the pdfRest API.
using (var request = new HttpRequestMessage(HttpMethod.Post, "pdf-with-ocr-text"))
Here, an HttpRequestMessage
is created with the HTTP method set to POST and the endpoint specified as "pdf-with-ocr-text".
request.Headers.TryAddWithoutValidation("Api-Key", "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"); request.Headers.Accept.Add(new("application/json"));
These lines add the API key to the request headers and set the Accept header to "application/json". The API key is essential for authenticating the request.
var multipartContent = new MultipartFormDataContent();
A MultipartFormDataContent
object is created to hold the file and other form data.
var byteArray = File.ReadAllBytes("/path/to/file"); var byteAryContent = new ByteArrayContent(byteArray); multipartContent.Add(byteAryContent, "file", "file_name"); byteAryContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf");
This snippet reads the file into a byte array and adds it to the multipart content. The content type is set to "application/pdf".
var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes("converted")); multipartContent.Add(byteArrayOption, "output");
Here, the output option is added to the multipart content, specifying that the output should be "converted".
request.Content = multipartContent; var response = await httpClient.SendAsync(request);
The multipart content is assigned to the request, and the request is sent asynchronously. The response is then awaited.
var apiResult = await response.Content.ReadAsStringAsync(); Console.WriteLine("API response received."); Console.WriteLine(apiResult);
The response content is read as a string and printed to the console.
Beyond the Tutorial
In this tutorial, you learned how to make an API call to the pdfRest OCR PDF endpoint using C#. This allows you to convert scanned documents into searchable PDFs programmatically.
To explore more features and tools offered by pdfRest, visit the API Lab. For detailed information on the API, refer to the API Reference Guide.
Note: This example demonstrates a multipart API call. For code samples using JSON payloads, visit the GitHub repository.