ChatGPT and other Large Language Models (LLMs), like Google's Bard, excel at processing and analyzing vast amounts of text at speeds beyond human capability. Businesses can leverage this speed from AI services to efficiently and effectively summarize the contents of PDF documents to gain a number of valuable advantages, including:
In order to use AI services to summarize PDF documents, you'll need to start with a reliable solution for extracting the text out of your documents. Join us as we share a step-by-step demonstration for integrating pdfRest Extract Text API Tool into your AI workflows, creating a practical solution that generates a summary of an entire document's contents.
First, you'll need to sign up for API keys to send calls to each service:
Next, configure code to send API calls to the pdfRest /extracted-text
endpoint:
extract_text_endpoint_url = 'https://api.pdfrest.com/extracted-text' # The /extracted-text endpoint can take a single PDF file or id as input. mp_encoder_extractText = MultipartEncoder( fields={ 'file': ( file_name, open(file_path + file_name, 'rb'), 'application/pdf' ) } ) # Let's set the headers that the /extracted-text endpoint expects. Since MultipartEncoder is used, the 'Content-Type' header gets set to 'multipart/form-data' via the content_type attribute below. headers = { 'Accept': 'application/json', 'Content-Type': mp_encoder_extractText.content_type, 'Api-Key': pdfRest_api_key } # Send the POST request to the /extracted-text endpoint response = requests.post(extract_text_endpoint_url, data=mp_encoder_extractText, headers=headers)
Once you get a response from the pdfRest service, chain the contents of the document off to ChatGPT for summarization. OpenAI offers many different models to use, and this can be adjusted based on your needs.
For this example, we used the new gpt-4-1106-preview
model, as it offers a much larger 128,000 token context. This larger context allows for documents to be summarized in fewer calls to OpenAI's service, and is sufficient for most medium-to-large sized documents.
if response.ok: print("Building prompt...") response_json = response.json() # To get the full text of the document, we grab the fullText attribute out of the resulting JSON fullText = response_json["fullText"] # In order to keep track of where we are in the document, we're going to split the resulting string into a list, delimited by spaces. fullTextArray = fullText.split() # Append the query_prompt to the beginning of our JSON output returned from /extracted-text query_string = query_prompt # This logic sets up a loop that will continue until all the contents of the document have been processed, keeping track of any summaries returned by ChatGPT. shouldLoop = True summaryList = [] i = 0 while shouldLoop: shouldLoop = False while len(enc.encode(query_string)) < MAX_CONTEXT_SIZE and i < len(fullTextArray): query_string += fullTextArray[i] + " " i += 1 shouldLoop = True # For visual feedback, just printing out how much of the document has been processed by each request being sent print(f"Got to element #{i} out of {len(fullTextArray)}. \n") # Send the query off to ChatGPT using the gpt-4-1106-preview model (also known as GPT 4 turbo) chat_completion = completion_with_backoff(model="gpt-4-1106-preview", messages=[{"role": "user", "content": query_string}]) # Reset query_string back to the default value of query_prompt query_string = query_prompt # Add the newly returned summary to the summaryList summaryList.append(chat_completion.choices[0].message.content) # If either of these conditions happen, we should break from the loop if len(enc.encode(query_string)) > MAX_CONTEXT_SIZE or i >= len(fullTextArray): break
Finally, check to see if the document was processed in one go. If it wasn't, compile the summaries and generate a singular, more cohesive summary:
# If it took multiple summaries to process the entire document, compile the summaries and summarize them again into a more cohesive singular summary. if len(summaryList) > 1: summary_string = "" for summary in summaryList: summary_string += " " + summary summary_query = ("Assuming the following text is a compilation of summaries about the contents of a single PDF " "document, create a detailed comprehensive summary of the given text. \n\n") + summary_string final_chat_completion = completion_with_backoff(model="gpt-4-1106-preview", messages=[{"role": "user", "content": summary_query}]) print("\n" + final_chat_completion.choices[0].message.content + "\n") elif len(summaryList) == 1: print("\n" + summaryList[0] + "\n")
If you intend on processing very large documents, or using a model with a smaller allowed context (such as gpt-3.5-turbo
), it can be useful to include helper functions to allow for the chat completion to backoff, since OpenAI employs rate limits on the number of tokens you can send in a certain window of time:
# Helper function that allows larger documents to process without hitting the rate limits of ChatGPT @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)) def completion_with_backoff(**kwargs): return openai.ChatCompletion.create(**kwargs)
You can find the full code example in our GitHub repository.
We used the steps above to extract text from a PDF containing the complete Constitution of the United States of America with pdfRest and summarize the text with ChatGPT. Here was the final summarization output:
The document contains the full text of the Declaration of Independence and the United States Constitution, including the Bill of Rights and subsequent amendments. The Declaration of Independence, adopted on July 4, 1776, articulates the Enlightenment principles of individual rights and government by consent, declaring the American colonies' independence from Britain. The Constitution, which began implementation on March 4, 1789, structures the federal government into three branches, enumerates its powers, and provides a system of checks and balances.
The Constitution's first three articles establish the legislative (Congress), executive (President), and judicial (Federal Courts) branches. It also addresses states' powers, procedural rules, and provisions for amendments. The first ten amendments, known as the Bill of Rights (ratified on December 15, 1791), safeguard freedoms such as speech, assembly, and religion; protections against arbitrary arrest and seizure; and rights to fair trials and due process.
Successive amendments cover various rights and government functions, including abolishing slavery (13th Amendment), defining citizenship and equal protection (14th Amendment), women's suffrage (19th Amendment), voting rights for 18-year-olds (26th Amendment), and terms for Congressional pay adjustments (27th Amendment). The document concludes with historical dates and a message affirming the importance of returning to fundamental principles to preserve free government.
Your business may benefit from an advanced PDF summarization workflow to save time and money, learn quickly and accurately, and support data-driven decision-making. pdfRest Extract Text API Tool pairs perfectly with OpenAI's ChatGPT API to summarize PDF document text. Give the above example a try, and let us know if there's anything we can do to help.
Extract Text |
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.