Catch up on the latest AI articles

What Is A Good Vocabulary In Machine Translation?

What Is A Good Vocabulary In Machine Translation?

Natural Language Processing

3 main points
✔️ Introduces a method to solve the "problem of not fully accounting for the effect of vocabulary size".
✔️ Proposes a VOLT algorithm that uses a measure called MUV, which is analogous to marginal utility in economics.
✔️ VOLT efficiently finds high-performing vocabularies in a variety of environments.

Vocabulary Learning via Optimal Transport for Neural Machine Translation
wrriten by 
Jingjing XuHao ZhouChun GanZaixiang ZhengLei Li
(Submitted on 31 Dec 2020)
Comments:Accepted by ACL 2021
Subjects:Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The purpose of this paper is to investigate token vocabulary selection as it affects machine translation performance and to explore how to find the optimal vocabulary. It also examines whether it is possible to find the optimal vocabulary without trial training.

Text vocabulary construction is important in neural machine translation(NMT) and othernatural languageprocessing (NLP), andsubword approaches such as byte-pair encod ing (BPE) have yielded promising results. These approaches resemble data compression and facilitate learning and prediction.

The problem with current approaches is that only criteria such as frequency and entropy are considered, and the impact of vocabulary size is not properly taken into account. Previous work has shown that vocabulary size also affects the downstream performance of the system, especially for resource-constrained tasks. However, because searching for an appropriate vocabulary size requires trial training for all sizes and is computationally expensive, existing studies have used only general settings. For example, the range of30000-40000 is the most common size setting in the Machine Translation Conference (WMT) paper.

This paper proposes a method that simultaneously considers entropy and vocabulary size to achieve automatic lexicalization without trial training for all sizes. However, there are two problems with the design of this method.

(⑴ Difficulty in finding a suitable objective function. (2) Increasing vocabulary size decreases the entropy of the corpus, which is advantageous for model learning, but too many tokens can cause dilution of tokens, which negatively affects model learning.

⑵ It is difficult to solve the"discrete optimization problem"of finding the optimal solution from a finite number of alternatives, and there is an exponential search space.

VOLT (Vocabulary Learning via Optimal Transport)

A lexical decision method called VOLT was proposed to address the above problem.

This approach finds an appropriate vocabulary by considering the entropy of the corpus and the vocabulary size. In particular, we use a concept called marginal utility of economics (MUV ) to find the balance of the vocabulary; MUV is defined as the negative derivative of entropy with respect to vocabulary size.

Definition of MUV

Formally, MUV represents the negative derivative of entropy with respect to size. For the sake of simplicity, we estimate the MUV at implementation using a smaller vocabulary. In particular, the MUV is computed as follows

Here, ) (�)and(�+�) represent the vocabulary at vocabulary sizeandk+m�+�, respectively. Also ,(��) represents the entropy when the vocabulary( ���) . To avoid the effect of token length, entropy is normalized here by the average length of the tokens, and the final entropy is defined as

p(j) is the relative frequency of tokens andlvisthe average length of tokens in the vocabulary.

Higher MUVs are desired due to Pareto optimality.Figure 1 shows an example regarding marginal utility.

Figure 1.

Sampling BPE-generated vocabularies of various sizes from Eo-En translations and drawing their entropy and BLEU lines. The "star" represents the vocabulary with the largest marginal utility. Marginal utilityevaluates the increase in benefit (decrease in entropy) with increasing cost (size).

Next, our goal is to generate a vocabulary with maximum diversity at a tractable time-computational cost. To solve the discrete optimization problem, we use linear programming to solve the optimal transport problem.

This approach has been shown to be successful in machine translation tasks, where the method VOLT has been shown to outperform widely used vocabularies. VOLT is also an efficient solution and does not require expensive computational resources.

Lexicalization Limitations Usefulness

In this section, suggestions are made for finding appropriate lexical measures while considering vocabulary size and entropy. Increasing vocabulary size reduces entropy, which benefits model learning, while a large vocabulary can lead to parameter explosion and token dilution problems, which negatively affect model learning.

To address this problem, there is a proposal to use the marginal utility of lexification (MUV) as an optimization goal; the MUV evaluates the cost of a corpus versus the benefit gained; the higher the MUV, the higher the benefit-to-cost ratio is expected.

Tentative results confirm that MUV correlates with downstream performance in two-thirds of translation tasks (see Figure 2); MUV correlates with translation task performance, and the goal is to maximize MUV with a tractable time computation rate.

Figure 2.

The x-axis classifies Spearman scores into different groups; the y-axis shows the number of tasks within each group. The central Spearman score is 0.4.

This paper views vocabulary construction as a discrete optimization problem to find the best vocabulary.

However, the discrete nature of the vocabulary makes discrete optimization difficult. Therefore, the paper proposes to simplify the original problem by searching for the optimal vocabulary from a fixed-size vocabulary.

Specifically, the MUV is defined as the derivative of entropy with respect to vocabulary size, and auxiliary variables are introduced to approximate the calculation. This allows the optimal vocabulary to be found simply by computing the MUV between vocabulary sizes.

VOLT Algorithm

The author points out that there is a problem with the method for maximizing MUV: MUV is the amount of entropy change in the vocabulary, and some method must be used to build the vocabulary. The author proposes a method using optimal transport.

In other words, VOLT calculates MUV by constructing a vocabulary at a particular vocabulary size, measuring the entropy of that vocabulary, and then measuring the vocabulary and entropy at another vocabulary size as well.

The algorithm for VOLT is as follows.

The algorithm uses L as the words in the subword segmented training corpus in order of frequency, C as each letter in the training corpus, and S as the vocabulary size. VOLT is performed using Sinkhorn's theorem as the optimal transport.

VOLT subwords the training data with a large vocabulary size in advance and sets up multiple candidate vocabulary sizes. For each vocabulary size, it uses the optimal transport and calculates the entropy at that time. It then compares the change in entropy at each vocabulary size and determines the vocabulary size with the highest MUV.

Experimental results

To evaluate the performance of VOLT, experiments will be conducted on three datasets, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation.

Table 1: Comparison of VOLT vocabulary search with widely used BPE vocabulary.
Here the vocabulary size is adopted from the X-En setting.

The vocabulary retrieved by VOLT is significantly reduced in size and achieves a higher BLEU score (a measure of machine translation accuracy), indicating that VOLT is a practical approach that can find high BLEU, small size, and high performance vocabulary.

The size of the retrieved vocabulary is about 110K; VOLT achieves higher BLEU scores for most pairs.

Table 2: Comparison of lexical retrieval with VOLT and BPE-1K as recommended by Ding et al.
for low-resource datasets. Here we use the TED X-En bilingual as an example.

This table shows that the vocabulary retrieved by VOLT is equivalent to the heuristically retrieved vocabulary in terms of BLEU scores.

In terms of BLEU scores, VOLT finds vocabularies that are as good as those retrieved heuristically; BPE-1K was chosen based on a large number of experiments. In contrast, VOLT requires only one trial for evaluation and takes only 0.5 CPU hours and 30 GPU hours to find the best vocabulary.

Table 3: Comparison of VOLT and widely used BPE vocabulary in multilingual translation.

The size of the retrieved vocabulary is approximately 110K. As you can see, VOLT achieves higher BLEU scores for most pairs.

Table 4: VOLT, MUV-Search, and BPE-Search results.

A comparison of VOLT and BPE-Search shows that VOLT is a lightweight solution that can find competing vocabularies within 0.5 hours on a single CPU, compared to BPE-Search, which takes several hundred hours on a GPU.

MUV-Search combines MUV with a general approach by selecting the vocabulary with the highest MUV as the final vocabulary; MUV-Search does not require full downstream training, but still takes a lot of time to generate the vocabulary and calculate the MUV. In this context, VOLT is the most efficient approach.

Table 5: VOLT vs. Strong Baseline

Compared to the block approaches above, VOLT achieves nearly the best performance with a much smaller vocabulary. These results show that with a well-defined vocabulary, good results can be achieved with a simple baseline; VOLT performs significantly better than SentencePiece and WordPiece, with more than one BLEU improvement.

Table 6: Vocabulary sizes for different architectures

The VOLT retrieved vocabulary also works with the competing BLEU Convolutional Seq2Seq, but in a much smaller size, indicating that the model with the VOLT retrieved vocabulary (11.6k tokens) can process 133 sentences per second, while the model with BPE-30K (33.6k tokens) can only run 101 per second. The model with the VOLT search vocabulary (11.6k tokens) can process 133 sentences per second, while the model with BPE-30K (33.6k tokens) can execute only 101 sentences per second.

It can also be seen that although VOLT reduces the Softmax computation, it does not significantly increase the Softmax runtime because the parallel computation on the GPU is optimized.

These two lexicons show a great deal of overlap, especially in high-frequency words. They perform similarly on the downstream side. Thus, from an empirical perspective, a VOLT-sized BPE would also be a good choice.


In this study, we proposed a new approach to lexical retrieval without trial training. Based on information theory, lexicalization is formulated as two separate optimization goals, and an efficient solution, a lexical decision method called VOLT, is designed.

Experimental results show that VOLT can effectively find high-performing vocabulary in a variety of environments.


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us