ChatGPT 私房菜：翻译内容，不翻译格式，OpenAI Api 翻译电子书详细教程

如何优雅地翻译一整本电子书

前言

OpenAI 接口上下文，根据不同的模型，是由长度限制的。而一本电子书，轻轻松松几十K，上百K。所以，api 是无法一次处理完成的。

对于普通的文本，纯内容，没有格式，处理起来不需要额外的技术手段。本文使用一本斯洛文尼亚语的《欧几里得平面几何》的 LaTex 文件，将其翻译为英文。

处理流程

基于上下文长度限制，需要对原始文件内容分块处理，每一块大约一页大小。

1，读取数据

使用以下Python代码：

1
2
3
4
5
6
7
8


import openai
from transformers import GPT2Tokenizer

# 计算token数量
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("data/geometry_slovenian.tex", "r") as f:
    text = f.read()

2，根据换行分块并计算token数量

上一段代码，把整个文件以只读方式，读取到 text 变量内。使用\n\n分隔符对原始文本切割，并计算每一分块的token数量。

1
2
3
4
5


chunks = text.split('\n\n')
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)

之所以选用两个换行符作为分割符，是根据文本内容确定的、比较好的方案。翻译文案，使用 text-davinci-002 模型，上下文是 4K token。

下面实现分块的逻辑，每一块最多 1000 个token。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


def group_chunks(chunks, ntokens, max_len=1000, hard_max_len=3000):
    """
    Group very short chunks, to form approximately page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0
    
    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        # discard chunks that exceed hard max length
        if ntoken > hard_max_len:
            print(f"Warning: Chunk discarded for being too long ({ntoken} tokens > {hard_max_len} token limit). Preview: '{chunk[:50]}...'")
            continue

        # if room in current batch, add new chunk
        if cur_tokens + 1 + ntoken <= max_len:
            cur_batch += "\n\n" + chunk
            cur_tokens += 1 + ntoken  # adds 1 token for the two newlines
        # otherwise, record the batch and start a new one
        else:
            batches.append(cur_batch)
            cur_batch = chunk
            cur_tokens = ntoken
            
    if cur_batch:  # add the last batch if it's not empty
        batches.append(cur_batch)
        
    return batches


chunks = group_chunks(chunks, ntokens)
len(chunks)

上述代码中，group_chunks 函数内，先把接收到的区块数据，与每一块所对应的 token 数组合并为新的列表——zip(chunks, ntokens)，结果类似于：

1

[('Alice', 85), ('Bob', 92), ('Charlie', 78)]

如果，区块长度，如 Alice，长度大于 hard_max_len，此处是 3000，直接废弃。太长了无法解析，需要手动先调整一下原始文件，能多加个换行就多加一个换行。

依次遍历，汇总几个区块字符串，比如 Alice Bob Charlie，连起来总长度没有超过 max_len 1000，就视为一个块，放到 batches 数组内。

如果再加一个区块字符串，超过 1000 了，就放入下一个 batches。

3，提示词

根据ChatGPT的工作原理，要结合上下文，给ChatGPT合适的提示词，还有简单示例。限定要求包括：

只翻译内容，不要翻译LaTex的格式化标签
给ChatGPT举一些简单的例子

编写提示词，下面是一个样例：

1
2
3
4
5
6
7
8


Translate only the text from the following LaTeX document into English. Leave all LaTeX commands unchanged
    
"""
\poglavje{Osnove Geometrije} \label{osn9Geom}"
\item Naj bodo $P$, $Q$ in $R$ notranje točke stranic trikotnika
"""

\poglavje{The basics of Geometry} \label{osn9Geom}

把提示词内变动的部分，提取出，作为函数的参数。下面是示例Python代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


def translate_chunk(chunk, engine='text-davinci-002', dest_language='English'):
    prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
    
"""
\poglavje{Osnove Geometrije} \label{osn9Geom}
{chunk}"""

\poglavje{The basics of Geometry} \label{osn9Geom}
'''
    response = openai.Completion.create(
        prompt=prompt,
        engine=engine,
        temperature=0,
        top_p=1,
        max_tokens=1500,
    )
    result = response['choices'][0]['text'].strip()
    return result.replace('"""', '')

方法准备就绪，就可以对分块之后的数据循环处理了：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


dest_language = "English"

translated_chunks = []
for i, chunk in enumerate(chunks):
    print(str(i+1) + " / " + str(len(chunks)))
    # translate each chunk
    translated_chunks.append(translate_chunk(chunk, engine='text-davinci-002', dest_language=dest_language))

# join the chunks together
result = '\n\n'.join(translated_chunks)

# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
    f.write(result)

这一段示例代码，只是展示功能用的。所以简单地使用了 for...in 循环，而且每次处理成功的内容，临时存储在 translated_chunks 变量内，作为运行脚本，这是不完善的。你要加一些容错处理，让程序更健壮。

如果失败，记录区块的索引，以便之后单独处理；
每个区块翻译成功后，立即写入目标文件，防止丢失；
OpenAI api token 是收费的，所以尽量节省。

最后

根据区块的多少，电子文档的长短，执行一整本数的翻译，同步阻塞处理，往往需要很长时间。你可以使用上一篇文章提供的批处理思路，对上述代码进行改进。

我是@程序员小助手，专注编程知识，圈子动态的IT领域原创作者。