Translate PDF Text to New Language with ChatGPT
Translating PDF documents into different languages allows businesses to communicate with their customers and partners in their preferred language. For example, a business can translate its customer support documentation into different languages so that its customers can easily find the information they need.
PDF text translation is especially important in countries where businesses are required to provide certain documents, such as contracts and invoices, in the local language. Translating PDF documents into different languages allows businesses to comply with these legal requirements.
Businesses can also use translated PDF documents to reach new markets and increase sales. For example, a business can translate its marketing materials into different languages to target new customers in other countries.
Let's step through an example of translating extracted text from a PDF file using pdfRest and OpenAI's ChatGPT.
Environment
For convenience, we will set up an environment running Jupyter. One way to do that is to create a Python environment and activate it:
python -m venv .venv . ./.venv/bin/activate
Then install Jupyter.
python -m pip install jupyter
You'll also need to install the other Python packages required by these sample notebooks. Those are in a file called requirements.txt
, available in our GitHub repository.
python -m pip install -r requirements.txt
Run Jupyter, opening this notebook.
jupyter notebook extract-and-translate.ipynb
API keys
You'll need to sign up for API keys in order to use this example:
Create a file called .env
in the same directory as this notebook, and places the keys into it, like this:
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx PDFREST_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
First, we will start by importing some Python modules that we need, and acquiring API keys.
import os from pathlib import Path import openai import requests from dotenv import load_dotenv from IPython.display import display_markdown from requests_toolbelt import MultipartEncoder load_dotenv() openai.api_key = os.getenv("OPENAI_API_KEY") pdfrest_api_key = os.getenv("PDFREST_API_KEY") REQUEST_TIMEOUT = 30
Extracting the text
Below, we'll define a function that extracts the text from a PDF document represented by a path on disk. It will get the full text by page, returning the JSON data from the endpoint.
def extract_text(document: Path) -> dict: """Extract text on a page-by-page basis from a document, and return the extracted text""" extract_endpoint_url = "https://api.pdfrest.com/extracted-text" # Define the file to upload, and request full text on a per-page basis request_data = [ ("file", (document.name, document.open(mode="rb"), "application/pdf")), ("full_text", "by_page"), ] mp_encoder_upload = MultipartEncoder(fields=request_data) # Let's set the headers that the upload endpoint expects. # Since MultipartEncoder is used, the 'Content-Type' header gets set to # 'multipart/form-data' via the content_type attribute below. headers = { "Accept": "application/json", "Content-Type": mp_encoder_upload.content_type, "Api-Key": pdfrest_api_key, } print("Sending POST request to extract text endpoint...") response = requests.post( extract_endpoint_url, data=mp_encoder_upload, headers=headers, timeout=REQUEST_TIMEOUT, ) # Print the response status code and raise an exception if the request fails print("Response status code: " + str(response.status_code)) response.raise_for_status() return response.json()
TranslationChatbot
Let's define a chatbot whose main purpose is translation. This is a Python class, which makes a persistent object that can be used for a continuing conversation.
We start with a system instruction. The system instruction indicates to OpenAI what the purpose of the conversation is, what role it should take, and any additional instructions.
When translating, we also prepend the material to be translated with an instruction to translate to English.
Each interaction is recorded in self.messages
, which contains content and a role:
system
means that the content is a system instruction. System instructions are usually present at the start of a conversation, but are typically not presented to the user, for instance, in ChatGPT.user
means that the content is part of the conversation that was uttered by the user.assistant
means that the content is a reply from the AI.
This class makes it easy to have a conversation with GPT-4. We call translate_text()
to supply text to be translated, and chat()
if we want to continue the conversation.
class TranslationChatbot: """A chatbot that specializes in translation, but can have a continuing conversation.""" SYSTEM_INSTRUCTION = """ You are a helpful translator. Given an input text, translate it to the requested language. If there are any ambiguities, or things that couldn't be translated, please mention them after the translation. The output can use Markdown for formatting. """ TRANSLATION_INSTRUCTION = """ Please translate the following to English: """ def __init__(self): self.messages = [ {"content": self.SYSTEM_INSTRUCTION, "role": "system"}, ] def get_openai_response(self, new_message): """Request chat completion from OpenAI, and update the messages with the reply. Returns the response from OpenAI.""" self.messages.append(new_message) response = openai.ChatCompletion.create( model="gpt-4", temperature=0, messages=self.messages, ) message = response["choices"][0]["message"] self.messages.append(message) return response def translate_text(self, text: str) -> str: """Translate text, and return OpenAI's reply.""" response = self.get_openai_response( {"content": f"{self.TRANSLATION_INSTRUCTION}{text}", "role": "user"} ) message = response["choices"][0]["message"] return message["content"] def converse(self, text: str) -> str: """Add a message to the conversation, and return OpenAI's reply.""" response = self.get_openai_response({"content": text, "role": "user"}) message = response["choices"][0]["message"] return message["content"] def chat(self, text: str) -> str: """A simple method for chatting. OpenAI returns results formatted with Markdown, and may contain text styling and lists.""" display_markdown(self.converse(text), raw=True)
Extract the text
Here, we simply call extract_text()
with the path to the input document. In this case, the PDF file contains Article 1 of the Universal Declaration of Human Rights in Greek.
After that, we get the text of the first page. As you can see from the code, the fullText
dictionary contains an array pages
which contains each page. The code gets the first page, indexed by 0
, and retrieves the text
from it.
extracted_text = extract_text(Path("pdf/UDHR_Article_1_Greek.pdf")) page_1_text = extracted_text["fullText"]["pages"][0]["text"]
Sending POST request to extract text endpoint...
Response status code: 200
Using the TranslationChatbot
Create a TranslationChatbot
. Use it to translate the text, and ask it to translate the text of the page.
The chatbot retains the history of the conversation, so that we can make further inquiries about the text that was translated.
Since this code is running in the context of a Jupyter notebook, we use display_markdown()
to print output with style attached. GPT-4 also provides Markdown formatted content, so if the response has any lists or tables in it, they will render nicely.
chatbot = TranslationChatbot() display_markdown(f"**Text before translation:** {page_1_text}", raw=True) translated_text = chatbot.translate_text(page_1_text) display_markdown(f"**Text after translation:** {translated_text}", raw=True)
Text before translation: ΑΡΘΡΟ 1 ' Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα. Είναι προικισμένοι με λογική και συνείδηση, και οφείλουν να συμπεριφέρονται μεταξύ τους με πνεύμα αδελφοσύνης.
Text after translation: ARTICLE 1: All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Conclusion
You may wish to consider translating your PDF documents into different languages to reach new markets, improve customer service, and comply with legal requirements. pdfRest Extract Text API Tool pairs perfectly with OpenAI's ChatGPT API to translate PDF document text to new languages. Give the above example a try, and let us know if there's anything we can do to help.
Extract Text |