How to Extract Pages from PDF Files with Python
Why Extract PDF Pages with Python?
The pdfRest Split PDF API Tool allows users to programmatically extract pages from PDF documents to separate files. This can be particularly useful in scenarios where you have a large document that needs to be divided into smaller sections, such as when distributing individual chapters of a book to different reviewers, or when extracting specific pages from a report to share with a team.
By using Python, you can automate this process and integrate it into your workflow or application.
Extract PDF Pages with Python Code Example
The following code is a complete example of how to call the Split PDF API using Python. It was sourced from the pdfRest API samples available on GitHub:
from requests_toolbelt import MultipartEncoder import requests import json split_pdf_endpoint_url = 'https://api.pdfrest.com/split-pdf' # The /split-pdf endpoint can take one PDF file or id as input. # This sample takes one PDF file that has at least 5 pages and splits it into two documents when given two page ranges. # Create a list of tuples for data that will be sent to the request split_request_data = [] split_request_data.append(('file',('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'))) split_request_data.append(('pages', '1,2,5')) split_request_data.append(('pages', '3,4')) split_request_data.append(('output', 'example_splitPdf_out')) mp_encoder_splitPdf = MultipartEncoder( fields=split_request_data ) # Let's set the headers that the split-pdf endpoint expects. # Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below. headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_splitPdf.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here } print("Sending POST request to split-pdf endpoint...") response = requests.post(split_pdf_endpoint_url, data=mp_encoder_splitPdf, headers=headers) print("Response status code: " + str(response.status_code)) if response.ok: response_json = response.json() print(json.dumps(response_json, indent = 2)) else: print(response.text) # If you would like to download the file instead of getting the JSON response, please see the 'get-resource-id-endpoint.py' sample.
Reference: GitHub Repository
Breaking Down the Code
The code snippet above demonstrates how to split a PDF document using the pdfRest API in Python. Let's break it down:
from requests_toolbelt import MultipartEncoder import requests import json
This imports the necessary modules. MultipartEncoder
is used for creating a multipart/form-data payload, which is required for file uploads.
split_pdf_endpoint_url = 'https://api.pdfrest.com/split-pdf'
This sets the API endpoint URL for splitting PDFs.
split_request_data = [] split_request_data.append(('file',('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'))) split_request_data.append(('pages', '1,2,5')) split_request_data.append(('pages', '3,4')) split_request_data.append(('output', 'example_splitPdf_out'))
Here, we're creating the data to be sent with the request. We specify the PDF file, the page ranges for splitting, and the output name.
headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_splitPdf.content_type, 'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' }
The headers include the API key, which you need to replace with your own. The Content-Type
is set automatically by the MultipartEncoder
.
response = requests.post(split_pdf_endpoint_url, data=mp_encoder_splitPdf, headers=headers)
This sends the POST request to the API endpoint with the data and headers.
Beyond the Tutorial
In this tutorial, we've learned how to split a PDF into separate documents using the pdfRest API and Python. You can now use this code as a starting point to integrate PDF splitting functionality into your applications.
I encourage you to demo all of the pdfRest API Tools in the API Lab and refer to the API Reference documentation for further exploration.
Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub Repository.