How to Extract Images from PDF Files with Java, Tutorial

Share this page

Why Extract PDF Images with Java?

The pdfRest Extract Images API Tool is a powerful resource for developers who need to programmatically extract images from PDF documents. By leveraging this API, you can automate the process of retrieving images from PDFs, which is especially useful in applications where large volumes of documents need to be processed. This tutorial will guide you through the process of sending an API call to extract images using Java, providing a practical example of how to implement this functionality in your own projects.

Imagine a scenario where a digital marketing agency receives hundreds of PDF brochures from clients every month. These brochures contain images that need to be extracted and used in various marketing campaigns. By using the Extract Images API, the agency can automate the extraction process, saving time and reducing the potential for human error. This allows the team to focus on more strategic tasks, knowing that the image extraction is handled efficiently and accurately.

Extract PDF Images with Java Code Example

import io.github.cdimascio.dotenv.Dotenv;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
import okhttp3.MediaType;
import okhttp3.MultipartBody;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.RequestBody;
import okhttp3.Response;
import org.json.JSONObject;

public class ExtractedImages {

  // Specify the path to your file here, or as the first argument when running the program.
  private static final String DEFAULT_FILE_PATH = "/path/to/file.pdf";

  // Specify your API key here, or in the environment variable PDFREST_API_KEY.
  // You can also put the environment variable in a .env file.
  private static final String DEFAULT_API_KEY = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx";

  private static final String PAGES = "1-last";

  public static void main(String[] args) {
    File inputFile;
    if (args.length > 0) {
      inputFile = new File(args[0]);
    } else {
      inputFile = new File(DEFAULT_FILE_PATH);
    }

    final Dotenv dotenv = Dotenv.configure().ignoreIfMalformed().ignoreIfMissing().load();

    final RequestBody inputFileRequestBody =
        RequestBody.create(inputFile, MediaType.parse("application/pdf"));
    RequestBody requestBody =
        new MultipartBody.Builder()
            .setType(MultipartBody.FORM)
            .addFormDataPart("file", inputFile.getName(), inputFileRequestBody)
            .addFormDataPart("pages", PAGES)
            .addFormDataPart("output", "pdfrest_extracted_images")
            .build();
    Request request =
        new Request.Builder()
            .header("Api-Key", dotenv.get("PDFREST_API_KEY", DEFAULT_API_KEY))
            .url("https://api.pdfrest.com/extracted-images")
            .post(requestBody)
            .build();
    try {
      OkHttpClient client =
          new OkHttpClient().newBuilder().readTimeout(60, TimeUnit.SECONDS).build();
      Response response = client.newCall(request).execute();
      System.out.println("Result code " + response.code());
      if (response.body() != null) {
        System.out.println(prettyJson(response.body().string()));
      }
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }

  private static String prettyJson(String json) {
    // https://stackoverflow.com/a/9583835/11996393
    return new JSONObject(json).toString(4);
  }
}

Source: GitHub

Breaking Down the Code

The code begins by importing necessary libraries, such as OkHttp for HTTP requests and Dotenv for environment variable management. The `DEFAULT_FILE_PATH` and `DEFAULT_API_KEY` are placeholders for the PDF file path and API key, respectively. These can be overridden by command-line arguments or environment variables.

File inputFile;
if (args.length > 0) {
  inputFile = new File(args[0]);
} else {
  inputFile = new File(DEFAULT_FILE_PATH);
}

This snippet determines the file to be processed. If an argument is provided, it uses that as the file path; otherwise, it defaults to `DEFAULT_FILE_PATH`.

final RequestBody inputFileRequestBody =
    RequestBody.create(inputFile, MediaType.parse("application/pdf"));

This line creates a request body for the PDF file, specifying its media type as "application/pdf".

RequestBody requestBody =
    new MultipartBody.Builder()
        .setType(MultipartBody.FORM)
        .addFormDataPart("file", inputFile.getName(), inputFileRequestBody)
        .addFormDataPart("pages", PAGES)
        .addFormDataPart("output", "pdfrest_extracted_images")
        .build();

The multipart request body is constructed here, including the file, pages to extract ("1-last" indicates all pages), and output format. According to the pdfRest Cloud API Reference Guide, these parameters are required for the Extract Images endpoint.

Request request =
    new Request.Builder()
        .header("Api-Key", dotenv.get("PDFREST_API_KEY", DEFAULT_API_KEY))
        .url("https://api.pdfrest.com/extracted-images")
        .post(requestBody)
        .build();

This snippet builds the HTTP request, setting the API key in the header and specifying the endpoint URL. The request body is attached using the POST method.

OkHttpClient client =
    new OkHttpClient().newBuilder().readTimeout(60, TimeUnit.SECONDS).build();
Response response = client.newCall(request).execute();

An OkHttpClient is instantiated with a 60-second read timeout, and the request is executed. The response code and body are printed to the console.

Beyond the Tutorial

In this tutorial, you learned how to extract images from a PDF using the pdfRest API with Java. This process involves setting up a multipart API call and handling the response. To further explore the capabilities of pdfRest, consider trying out other API tools available in the API Lab. For more detailed information, refer to the API Reference Guide.

Note: This example demonstrates a multipart API call. For examples using JSON payloads, visit this GitHub repository.