block
columns 1
block:before_outer:1
columns 1
block:before:1
a("指令檔案") space
b("鬆散小說") space
c("格式化小說") space
d("格式化XML")
end
before_label["翻譯前"]
end
block:after_outer:1
columns 1
after_label["翻譯後"]
block:after:1
h("指令檔案") space
g("鬆散小說") space
f("XML") space
e("格式化XML")
end
end
a --"解析"--> b
b --"取代"--> c
c --"打包"--> d
d --"翻譯"--> e
e --"代回"--> f
f --"拆行"--> g
g --"包裝"--> h
style before fill:none,stroke:none,font-weight:bold
style after fill:none,stroke:none,font-weight:bold
style before_label fill:none,stroke:none,font-weight:bold
style after_label fill:none,stroke:none,font-weight:bold
注意,這個翻譯前後的處理邏輯不是完全對稱的。 由於「多行打包」後,重新拆行涉及斷詞,而純文本跟 XML 混合的內容,斷詞長度不好控制,所以優先將專有名詞跟指令換回後,才斷詞。
res = "" for script in scripts: name = script['name'] text = script['text'] line = script['line'] res += f"<pack line='{line}'>{name}{colon}{text}</pack>"
return res
AI 翻譯
AI 翻譯很方便,但由於以下限制:
AI 總有機會出錯
AI 對於同樣的輸入,會給同樣的輸出
個人電腦的顯卡 token 數有限
基於以上三點,當 AI 模型出錯時(by 1),若直接引入歷史對話流,會導致 token 無法容納(by 3),又因為同輸入會得到同結果(by 2)。
所以,我的 prompts 採用以下方案:
1
You should translate the HTML to English, but keep HTML tag information. (<retry_number>)
在 system role 的 prompts 中,插入嘗試次數,來作為亂數改變模型輸出。然後,在 try...catch... 中,按流程逐步解回指令檔案,如果失敗,就當作 AI 模型出錯,我們重新生成翻譯:
deftranslate(origin: str, propers: List[Any], colon: str = ': ', line_splitter: Callable = en_line_splitter, settings: str = "You should translate the HTML to English, but keep HTML tag information.", retry: int = 3) -> str:
id = 0 for proper in propers: proper.update({'id': id}) jieba.add_word(proper['origin'], freq=0x7fffffff) jieba.add_word(proper['translation'], freq=0x7fffffff) id += 1
formated_colon = format(colon, propers)
# Preprocess original text origin = format(origin, propers) origin = pack(origin, formated_colon)
# Try to translate original text while retry > 0: try: translation = chat(origin, settings + f"({retry})") translation = deformat(translation, propers) translation = depack(translation, colon, line_splitter) if re.search(r'<[^>]+>', translation): raise RuntimeError('Found XML tag')
<packline=3> <speaker>Man</speaker> <text>It's raining, <commandid="4"wait="500" />we need to head back to <properid="1" /> to find an umbrella.</text> </pack>
pattern = r'([^.,!?:;\n]+[.,!?:;\n]*)' segments = re.findall(pattern, text) chunks = [s.strip() for s in segments if s.strip()] whilelen(chunks) != line_count: