How to Convert PDF to Markdown with Python, Tutorial

Share this page

Why Convert PDF to Markdown with Python?

The pdfRest PDF to Markdown API Tool is a powerful resource for developers who need to convert PDF documents into Markdown format programmatically. This tutorial will guide you through the process of sending an API call to the PDF to Markdown endpoint using Python, allowing you to automate the conversion of PDF files into a more editable and web-friendly format.

A user might need to convert a PDF document to Markdown for easier editing and integration into web content management systems. For instance, a technical writer could use this tool to convert PDF documentation into Markdown to be hosted on a website or a wiki, ensuring that the content remains accessible and easy to update.

PDF to Markdown with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests
import json

markdown_endpoint_url = 'https://api.pdfrest.com/markdown'

# The /markdown endpoint can take a single PDF file or id as input.
# This sample demonstrates converting the document to markdown and returning it as JSON.
mp_encoder_markdown = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'page_break_comments': 'on',
    }
)

# Let's set the headers that the markdown endpoint expects.
# Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_markdown.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

print("Sending POST request to markdown endpoint...")
response = requests.post(markdown_endpoint_url, data=mp_encoder_markdown, headers=headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    print(json.dumps(response_json, indent=2))
else:
    print(response.text)

Source: GitHub Repository

Breaking Down the Code

The code begins by importing necessary libraries: requests_toolbelt for handling multipart form data, requests for making HTTP requests, and json for handling JSON data. The markdown_endpoint_url variable stores the API endpoint URL.

mp_encoder_markdown = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file', 'rb'), 'application/pdf'),
        'page_break_comments': 'on',
    }
)

This snippet creates a MultipartEncoder object to handle the multipart form data. The fields dictionary includes the PDF file to be converted and the page_break_comments parameter, which is set to 'on' to include comments at page breaks in the Markdown output.

headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_markdown.content_type,
    'Api-Key': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here
}

The headers dictionary specifies the request headers. The 'Accept' header indicates that the client expects a JSON response. The 'Content-Type' is automatically set to 'multipart/form-data' by the MultipartEncoder. The 'Api-Key' is a placeholder for your actual API key, which is required for authentication.

response = requests.post(markdown_endpoint_url, data=mp_encoder_markdown, headers=headers)

This line sends a POST request to the markdown endpoint with the specified data and headers. If the request is successful, the response will contain the converted Markdown data in JSON format.

Beyond the Tutorial

In this tutorial, you learned how to use Python to send an API call to the pdfRest PDF to Markdown endpoint, converting a PDF document into Markdown format. This example demonstrates a multipart API call, which is useful for handling file uploads.

To explore more, you can demo all of the pdfRest API Tools in the API Lab. For further details, refer to the API Reference Guide.

Note: This is an example of a multipart API call. Code samples using JSON payloads can be found at GitHub Repository.