How to Extract Images from PDF Files with Python, Tutorial

Share this page

Why Extract PDF Images with Python?

The pdfRest Extract Images API Tool is a powerful resource for developers who need to extract images from PDF documents programmatically. By leveraging this tool, users can automate the process of extracting images from PDFs, which can be particularly useful in scenarios where manual extraction would be too time-consuming or cumbersome. This tutorial will guide you through the process of making an API call to extract images using Python.

Imagine you are working for a digital marketing agency that frequently receives PDF reports from clients. These reports often contain valuable images that need to be used in presentations or social media posts. Instead of manually extracting each image, you can use the pdfRest Extract Images API to automate this process, saving time and reducing the risk of errors.

Extract PDF Images with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

extracted_images_endpoint_url = 'https://api.pdfrest.com/extracted-images'

# The /extracted-images endpoint can take a single PDF file or id as input.
# This sample demonstrates image extraction from all pages of a document.
mp_encoder_extractedImages = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'output' : 'example_extractedImages_out',
        'pages': '1-last',
    }
)

# Let's set the headers that the extracted-images endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extractedImages.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to extracted-images endpoint...")
response = requests.post(extracted_images_endpoint_url, data=mp_encoder_extractedImages, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

# If you would like to download the file instead of getting the JSON response, please see the 'get-resource-id-endpoint.py' sample.

Source: GitHub

Breaking Down the Code

The provided code begins by importing necessary libraries: requests_toolbelt for handling multipart form data, requests for making HTTP requests, and json for handling JSON data.

extracted_images_endpoint_url = 'https://api.pdfrest.com/extracted-images'

This line sets the endpoint URL for the Extract Images API.

mp_encoder_extractedImages = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'output' : 'example_extractedImages_out',
        'pages': '1-last',
    }
)

This snippet creates a MultipartEncoder object, which is used to encode the fields for the multipart form data. The fields include:

file: The PDF file to be processed. It is opened in binary read mode.
output: The desired name for the output file.
pages: Specifies the range of pages to extract images from, here set to '1-last' to extract from all pages.

headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extractedImages.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

The headers dictionary includes:

Accept: Specifies that the client expects a JSON response.
Content-Type: Set automatically by MultipartEncoder to 'multipart/form-data'.
Api-Key: A placeholder for your API key, which authenticates your request.

response = requests.post(extracted_images_endpoint_url, data=mp_encoder_extractedImages, headers=headers)

This line sends a POST request to the API endpoint with the encoded data and headers. The response is stored in the response variable.

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent = 2))
else:
    print(response.text)

If the response is successful, the JSON response is printed in a formatted manner. Otherwise, the error message is printed.

Beyond the Tutorial

In this tutorial, you learned how to extract images from a PDF document using the pdfRest Extract Images API with Python. This process can be a great time-saver when dealing with multiple PDFs or large documents. To explore more functionalities, consider trying out all the pdfRest API Tools in the API Lab. For detailed information on each API endpoint, refer to the API Reference Guide.

Note: This is an example of a multipart API call. For code samples using JSON payloads, visit GitHub.