A PDF leads to a long text file which leads to a robot with the OpenAI logo and finally to a short form text file

Generating Summaries of PDF Documents using ChatGPT

Learn how to use pdfRest with ChatGPT to Extract Text from a PDF and Summarize the Complete Document
Share this page

ChatGPT and other Large Language Models (LLMs), like Google's Bard, excel at processing and analyzing vast amounts of text at speeds beyond human capability. Businesses can leverage this speed from AI services to efficiently and effectively summarize the contents of PDF documents to gain a number of valuable advantages, including:

  • Efficiency and Time-Saving: Manually summarizing lengthy PDF documents can be time-consuming and tedious. Using a text summarization service can automate this process, significantly reducing the time it takes to extract key information from these documents. This can free up employees to focus on more strategic tasks.
  • Accuracy and Consistency: Text summarization services utilize advanced algorithms and natural language processing techniques to identify the most important information and generate concise summaries. These services can often produce more accurate and consistent summaries than manual efforts, reducing the risk of errors or omissions.
  • Scalability and Large Volume Processing: When dealing with large volumes of PDF documents, manually summarizing each document becomes impractical. Text summarization services can handle large batches of documents efficiently, making it easier to process and analyze large datasets.
  • Knowledge Management and Information Retrieval: Summarizing documents can make it easier to organize and store information, making it more readily searchable and accessible. This can improve knowledge management and information retrieval, allowing employees to quickly find relevant information when needed.
  • Decision-Making and Strategic Planning: By quickly extracting key information from documents, businesses can make more informed decisions based on accurate and up-to-date data. This can improve strategic planning and decision-making processes.

In order to use AI services to summarize PDF documents, you'll need to start with a reliable solution for extracting the text out of your documents. Join us as we share a step-by-step demonstration for integrating pdfRest Extract Text API Tool into your AI workflows, creating a practical solution that generates a summary of an entire document's contents.


API keys

First, you'll need to sign up for API keys to send calls to each service:

Extracting the text

Next, configure code to send API calls to the pdfRest /extracted-text endpoint:

extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text'

# The /extracted-text endpoint can take a single PDF file or id as input.
mp_encoder_extractText = MultipartEncoder(
    fields={
        'file': (
            file_name,
            open(file_path + file_name, 'rb'),
            'application/pdf'
        )
    }
)


# Let's set the headers that the /extracted-text endpoint expects. Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below.
headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extractText.content_type,
    'Api-Key': pdfRest_api_key
}

# Send the POST request to the /extracted-text endpoint
response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers)

Summarizing the text

Once you get a response from the pdfRest service, chain the contents of the document off to ChatGPT for summarization. OpenAI offers many different models to use, and this can be adjusted based on your needs.

For this example, we used the new gpt-4-1106-preview model, as it offers a much larger 128,000 token context. This larger context allows for documents to be summarized in fewer calls to OpenAI's service, and is sufficient for most medium-to-large sized documents.

if response.ok:
    print("Building prompt...")
    response_json = response.json()

    # To get the full text of the document, we grab the fullText attribute out of the resulting JSON
    fullText = response_json["fullText"]

    # In order to keep track of where we are in the document, we're going to split the resulting string into a list, delimited by spaces.
    fullTextArray = fullText.split()

    # Append the query_prompt to the beginning of our JSON output returned from /extracted-text
    query_string = query_prompt

    # This logic sets up a loop that will continue until all the contents of the document have been processed, keeping track of any summaries returned by ChatGPT.
    shouldLoop = True
    summaryList = []
    i = 0
    while shouldLoop:
        shouldLoop = False
        while len(enc.encode(query_string)) < MAX_CONTEXT_SIZE and i < len(fullTextArray):
            query_string += fullTextArray[i] + " "
            i += 1
            shouldLoop = True

        # For visual feedback, just printing out how much of the document has been processed by each request being sent
        print(f"Got to element #{i} out of {len(fullTextArray)}. \n")

        # Send the query off to ChatGPT using the gpt-4-1106-preview model (also known as GPT 4 turbo)
        chat_completion = completion_with_backoff(model="gpt-4-1106-preview",
                                                  messages=[{"role": "user", "content": query_string}])

        # Reset query_string back to the default value of query_prompt
        query_string = query_prompt

        # Add the newly returned summary to the summaryList
        summaryList.append(chat_completion.choices[0].message.content)

        # If either of these conditions happen, we should break from the loop
        if len(enc.encode(query_string)) > MAX_CONTEXT_SIZE or i >= len(fullTextArray):
            break

Finally, check to see if the document was processed in one go. If it wasn't, compile the summaries and generate a singular, more cohesive summary:

# If it took multiple summaries to process the entire document, compile the summaries and summarize them again into a more cohesive singular summary.
    if len(summaryList) > 1:
        summary_string = ""
        for summary in summaryList:
            summary_string += " " + summary

        summary_query = ("Assuming the following text is a compilation of summaries about the contents of a single PDF "
                         "document, create a detailed comprehensive summary of the given text. \n\n") + summary_string
        final_chat_completion = completion_with_backoff(model="gpt-4-1106-preview",
                                                        messages=[{"role": "user", "content": summary_query}])

        print("\n" + final_chat_completion.choices[0].message.content + "\n")
    elif len(summaryList) == 1:
        print("\n" + summaryList[0] + "\n")

If you intend on processing very large documents, or using a model with a smaller allowed context (such as gpt-3.5-turbo), it can be useful to include helper functions to allow for the chat completion to backoff, since OpenAI employs rate limits on the number of tokens you can send in a certain window of time:

# Helper function that allows larger documents to process without hitting the rate limits of ChatGPT
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

You can find the full code example in our GitHub repository.


Example Output

We used the steps above to extract text from a PDF containing the complete Constitution of the United States of America with pdfRest and summarize the text with ChatGPT. Here was the final summarization output:

The document contains the full text of the Declaration of Independence and the United States Constitution, including the Bill of Rights and subsequent amendments. The Declaration of Independence, adopted on July 4, 1776, articulates the Enlightenment principles of individual rights and government by consent, declaring the American colonies' independence from Britain. The Constitution, which began implementation on March 4, 1789, structures the federal government into three branches, enumerates its powers, and provides a system of checks and balances.


The Constitution's first three articles establish the legislative (Congress), executive (President), and judicial (Federal Courts) branches. It also addresses states' powers, procedural rules, and provisions for amendments. The first ten amendments, known as the Bill of Rights (ratified on December 15, 1791), safeguard freedoms such as speech, assembly, and religion; protections against arbitrary arrest and seizure; and rights to fair trials and due process.


Successive amendments cover various rights and government functions, including abolishing slavery (13th Amendment), defining citizenship and equal protection (14th Amendment), women's suffrage (19th Amendment), voting rights for 18-year-olds (26th Amendment), and terms for Congressional pay adjustments (27th Amendment). The document concludes with historical dates and a message affirming the importance of returning to fundamental principles to preserve free government.

Conclusion

Your business may benefit from an advanced PDF summarization workflow to save time and money, learn quickly and accurately, and support data-driven decision-making. pdfRest Extract Text API Tool pairs perfectly with OpenAI's ChatGPT API to summarize PDF document text. Give the above example a try, and let us know if there's anything we can do to help.




Extract Text


Generate a self-service API Key now!
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.