Compare revisions

Juuso Rytilahti · Juuso Rytilahti · Juuso Rytilahti · 8acdf91b · 8acdf91b
--- a/README.md
+++ b/README.md
@@ -11,6 +11,14 @@

 # Translator script

+## Translating text in latex format
+ChatGPT can keep the correct syntax of latex while translating text. Please use GPT-4 as the model for this.
+
+Important note: If the translated text compiles without errors, but the pdf preview is missing some part of the text, check that there has not appeared additional `\end{document}`.
+
+
+## Translating and localizing study materials
+
 ChatGPT can also localize the translation. For example it can automatically:

 - Localize names, e.g. Ossi -> Oliver.
@@ -43,12 +51,14 @@ ChatGPT can also localize the translation. For example it can automatically:

 ## Initial set-up
 1. Change the API key.
-2. Check the initial prompt. Currently it is set to do Finnish to English translation. IMPORTANT: wite the initial prompt with the target language! (so if the translation is Finnish to Swedish, replace the initial prompt with english language.)
+2. In code line 206, select one of the prompts as the initial prompt. Currently it is set to do Finnish to English translation. IMPORTANT: write the initial prompt with the target language! (so if the translation is Finnish to Swedish, replace the initial prompt with a translated prompt written in Swedish.)
+3. In the code line 228, set the chunk size to correct one.
 4. Install required libraries.

 ## Workflow
-1. Paste the text to the `input.md`. You don't need to worry about exceeding maximum context, the script handles that for you. The text can be in plain text or in mardown format. [Here](https://github.com/rytilahti-juuso/ChatMD) you can find a good Python and Javascript templates on how to extract also the "syntactic sugar" from websites. The script requires some alteration if you want to extract e.g. code syntax from the website.
-2. Run code in `translator_script.py`. (First run might take long time.) The console prints some of the progress when the process is running. The GPT-4 translation can take up to 2min/chunk, GPT-3.5 is quicker.
+1. Complete the initial set-up. Update selected prompt and chunk size if necessary.
+2. Paste the text to the `input.md`. You don't need to worry about exceeding maximum context, the script handles that for you. The text can be in plain text, latex or in markdown format. [Here](https://github.com/rytilahti-juuso/ChatMD) you can find a good Python and Javascript templates on how to extract also the "syntactic sugar" from websites. The script requires some alteration if you want to extract e.g. code syntax from the website.
+2. Run code in `translator_script.py`. (First run might take long time.) The console prints some of the progress when the process is running. The GPT-4 translation can take up to 2min/chunk, GPT-3.5 is quicker. For latex use always GPT-4.
 3. Finally to see the translated text, open `output.md` with a viewer that supports `markdown` file format.

 ## Translator FAQ

--- a/translator_script.py
+++ b/translator_script.py
@@ -72,7 +72,7 @@ def create_messages(prompt, serverAnswer):
 def count_words(input_string):
    return len(input_string.split())

-def split_into_chunks(input_string, chunk_size=290):
+def split_into_chunks(input_string, chunk_size=240):
    """
    Args:
        input_string: Whole input string, should be in md-format
@@ -199,8 +199,12 @@ def are_texts_similar(text1, text2, threshold=0.987):
    print("similarity is: " + similarity.astype(str))
    return similarity > threshold

+LATEX_GENERAL_PROMPT = "You are a translator. Translate material in the latex file to English. Don't translate the comments. Do not alter the latex syntax, even if you deem it, for example, to miss some elements."
+TRANSLATE_AND_LOCALIZE_STUDY_MATERIAL_PROMPT = "You are a translator. Localize and translate the study materials to English. Keep the meaning of the exercise in translation, but it does not need to be literal translation. If there are Finnish names change them to names used in England. Keep every actor the same."
+
 # ------------ SET-UP ------------
-INITIAL_PROMPT = "You are a translator. Localize and translate the study materials to English. Keep the meaning of the exercise in translation, but it does not need to be literal translation. If there are Finnish names change them to names used in England. Keep every actor the same."
+# Set the initial prompt 
+INITIAL_PROMPT = ""
 # Load BERT tokenizer and model
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 model = BertModel.from_pretrained('bert-base-uncased')
@@ -210,9 +214,19 @@ file_path = "input.md"
 file_content = read_from_file(file_path)
 # ---------------------------------

+if not INITIAL_PROMPT:
+    print("There seems to be some additional steps that you need to take.")
+    print("1.) In code line 206, select one of the prompts as the initial prompt")
+    print("2.) In the code line 228, set the chunk size to correct one.")
+    print("3.) Run the program again.")
+    print("Program terminating...")
+    exit(1)

 if file_content:
-    chunks = split_into_chunks(file_content)
+    USE_DEBUG_TEXT_IN_THE_OUTPUT = False
+    CHUNK_SIZE_LATEX_GPT_4 = 240
+    CHUNK_SIZE_PLAIN_TEXT_OR_MD_GPT_4 = 290
+    chunks = split_into_chunks(file_content, chunk_size=CHUNK_SIZE_LATEX_GPT_4)
    final_text = ""
    previous_messages = None
    print("input.md has been broken down to "+str(len(chunks)) + " chunks.")
@@ -232,9 +246,11 @@ if file_content:
        # Latest element, value of content property
        trans = messages[len(messages)-1]["content"]
        
-        #Divination between chuns to add readability (Normally if the translation fails, the translation of the whole chunk fails)
+        #Divination between chuns to add readability (Normally woith GPT-3.5 if the translation fails, the translation of the whole chunk fails)
        chunk_divination = "\n\n---\n# Chunk "+ str(i)+"\n---\n\n" 
-        
+        if not USE_DEBUG_TEXT_IN_THE_OUTPUT:
+            final_text = final_text + trans # exclude the debug text
+        else:
            final_text =final_text + chunk_divination + trans    
    print("  ")
    print("  ")
No results found