Catch up on the latest AI articles

Machine Translation With Unsupervised Learning From Generative Language Models Only!

Machine Translation With Unsupervised Learning From Generative Language Models Only!

Natural Language Processing

3 main points
✔️ Derive a machine translation function from a language model that has only been trained on a single language
✔️ Produce translation examples through back translation
✔️ Synthesize a dataset by amplifying the translation examples

AutoFormer: Searching Transformers for Visual Recognition
written by 
Jesse Michael HanIgor BabuschkinHarrison EdwardsArvind NeelakantanTao XuStanislas PoluAlex RayPranav ShyamAditya RameshAlec RadfordIlya Sutskever
(Submitted on 11 Oct 2021)
Comments: Published on arxiv.

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)



The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Unsupervised neural machine translation includes methods such as bootstrapping a weak translation model and then amplifying the translation capability through back-translation. In this work, we further simplify the existing unsupervised neural machine translation research and use only generative language models. We show that it is possible to derive state-of-the-art unsupervised neural machine translation using only pre-trained language models.


Back translation was introduced as a method of data enhancement that uses monolingual data on the target side by sampling synthetic source-to-target data from a target-to-source translation model.

In this work, we view machine translation as a language modeling task, where we jointly train and sample from a single language model for both source-to-target and target-to-source translation.

Given a bitext <seq1, seq2> in languages L1 and L2, we formulate the translation task as follows.

[L1] <seq1> [[TRANSLATE]] [L2] <seq2>

When testing, the language model is input as [L1] <seq> [[TRANSLATE]] [L2], and the candidate translation <sampledSeq> is analyzed from the output. The reverse translation is implemented by reversing seq, and sampledSeq, and fine-tuning bitext〈sampledSeq, seq〉. Note that the same language model is used for both directions of translation.

Use a single language model for both forward and backward translation, and train it jointly in both directions at each iteration. There are different ways to train a model using back-translation.

Algorithm 1 is an implementation of back translation using the generative language model pθ.

pθ is a formatted

([L1] <seq1> [[TRANSLATE]] [L2]) from

([L1] <seq1>[[TRANSLATE]] [L2] <seq2>) to

Suppose you've already learned to complete it.

To complete this back translation, we need to prepare such a language model. Here, we use the GPT-3 family of language models trained on large-scale data from the Internet. Large-scale generative language models are known to have strong in-context meta-learning capabilities. Two of the special cases are (1) following instructions and (2) few-shot prompts.

Since large-scale language models benefit from detailed natural language descriptions in tasks, they can achieve strong performance in a variety of tasks (question answering, reasoning, translation) by providing examples in context. The few-shot translation capability of the pre-trained model needs to be adapted to a zero-shot format for back-translation, which is done in a two-step process. First, a small number of zero-shot translations are sampled from GPT-3.

Given a srcLang and tgtLang bitext<srcSeq, tgtSeq> and a stop sequence <sep>, use the following format for the zero-shot prompt.

<sep> Given the following passage in <srcLang>: <sep> <srcSeq> <sep> a good <tgtLang> translation is: <sep> <tgtSeq> <sep>.

When testing, we sample until a stop sequence <seq> is detected, and set <sep> to \n---\n throughout. This zero-shot transformation is used as a few-shot prompt to amplify the translation by sampling a larger synthetic data set from a smaller model.

Next, we create a language model for this task by fine-tuning it with bittext.

Implement the bootstrap as follows.

1 Generative pre-training of the language model pθ(⋅) on a large corpus

2 For the few-shot prompt, we sample a pool of zero-shot NS synthetic target-side transformations and NS target-side transformations from another language model q(⋅).

  Using k shot examples randomly extracted from NS (or NT), we sample a CS-synthesized target-side translation (or CT-synthesized source-side translation) from pθ(⋅) using the source-side corpus MS (or target-side corpus MT).

3 (Prompt, sample translation) Reformat the data and fine-tuning the language model pθ(⋅) for these data.

4 Reverse all the data and continue fine-tuning the language model pθ(⋅) with back translation (sample translation, prompt).

Why amplification and extraction?

While few-shot prompts are flexible and can be used to extract capabilities from generative models for a variety of tasks, their benefits are most pronounced for large models with large data.

It remains unclear how to iteratively fine-tune the language model in a way that preserves the few-shot capability while still adapting to the zero-shot format. few-shot amplification can generate unsupervised bootstrap data. While distillation allows for iterative back-transformations, few-shot may avoid the overhead of few-shot sampling from GPT-3 itself by encouraging smaller model pθ( ⋅).


The experiments will focus on the well-known WMT14 English-French benchmark.

Algorithm 1 uses only half English text and half French text to obtain the source MS and target MT. It avoids the implicit sentence-level alignment of source and target. At each iteration of back-translation, we sample 1 million translations in either direction. Unless otherwise specified, we repeat the inverse translation 40 times after bootstrapping and use the final model to measure BLEU.

To implement bootstrapping, we prepare an additional 2048 training examples and sample 1024 (English-French) (or 1024 (French-English)) translations of zero shots from GPT-3 to use as few-shot prompts. few-shot During amplification, we sample 4 million initial target-side and source-side translations, respectively. We fine-tuned two forward epochs (distillation) and two reverse epochs (initial inverse translation).

During bootstrapping, we sample from a single model, train it to mimic it, and then back-transform our several-shot prompt generation. In these experiments, the Few-shot demonstrations themselves are generated in zero-shot by GPT-3. This is followed by the iterative inverse transformation procedure described above. We apply this method to the GPT-3 family of 125M, 350M, 760M, and 1.3B parameter models.

Prior research has shown that pre-learning models of English perform much better at translating to English than from English to another language.

Interestingly, after only two epochs of back-translation with a small amount of few-shot prompt data, the accuracy is reversed and all models achieve (English-French) BLEU significantly higher than (French-English) BLEU. This suggests that the models do not lack knowledge of French, but are simply off and that potential knowledge of translation from English can be surfaced using back-translation. On a related note, the high-quality samples from the back translation rounds lead to high-quality synthetic bitext for training the next backward translation.

We compare BLEU scores from the best model (a 1.3B parameter model distilled with self-amplified GPT-3 and subsequent 40 rounds of back-translation) with prior unsupervised neural machine translation work on the WMT14 English-French benchmark.


Note that back-translation, like reinforcement learning, is simply a way to exchange computation and data. This research can be seen as part of the recent data-driven architecture engineering.

Here, the task-specific induced biases are not hard-coded into the model, but are incorporated into and learned from the training data. Formulating the translation task in terms of language modeling shows that the input-output-induced biases imposed by the encoder-decoder architecture can be simulated in a rapid format. Although this work focuses only on machine translation, the method can be applied to inter-sequence tasks where the forward and reverse directions are

(1) Can be learned collaboratively by transformers with autoregressive decoders only.

(2) It can deal with few-shot prompts after large-scale generative pre-training.

The reverse translation is simply reverse self-training and is essentially not tied to the translation task. We recommend future research in general applications of transformer architectures, other than translation.

Tak avatar
Ph. D (informatics)

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us