Catch up on the latest AI articles

Meta Representation Transformation For Transfer Learning To Low-resource Languages : MetaXL

Meta Representation Transformation For Transfer Learning To Low-resource Languages : MetaXL

Natural Language Processing

3 main points
Transfer learning to languages with little or no training data is possible
✔️ The idea is to transform linguistic representations
✔️ Introducing a representation transformation network to perform meta-learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning
written by Mengzhou XiaGuoqing ZhengSubhabrata MukherjeeMilad ShokouhiGraham NeubigAhmed Hassan Awadallah
(Submitted on 16 Apr 2021)
Comments: Accepted by NAACL 2021.
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)


first of all

Advances in multilingual learning models have made it possible to achieve success in a wide range of natural language processing tasks, but there is still the problem that languages with very low resources cannot be easily migrated.

For example, multilingual BERT (mBERT) has been pretrained in 104 languages with many articles on Wikipedia, and XLM-R has been pretrained in 100 languages. However, these models still leave more than 200 languages with very little data available, not to mention the 6700 or so languages that have no Wikipedia text at all.

This kind of transfer learning to languages with very small resources is essential for better information access but has not been well studied.

Research on interlanguage transfer learning using the pretrain model has mainly focused on transfer between languages for which there is sufficient training data, so fine-tuning cannot be done effectively for these few languages due to insufficient data. Learning word embedding alone requires a large enough monolingual corpus, but these corpora are difficult to obtain, as mentioned above.

Furthermore, recent research shows that representations of different languages do not always lie in approximate locations, and can be located in very distant spaces, especially for languages with little data. We use MetaXL, a meta-learning method, to bridge this representation gap and enable effective interlanguage transfer to low-data languages.


The standard transfer learning method for language models is to jointly fine-tuning a multilingual language model using labeled data from both the source and target languages. However, in a problem like this one, there is not enough labeled data available in the target language.

The key idea of the proposed method is toIt is to explicitly learn to transform the source language representation. In addition to the existing multilingual pretrain model, we introduce an additional network, called the representation transformation network, to explicitly model this transformation.

  1. The source language passes through the first Transformer layer, the Representation Transformation Network (RTN), and the remaining Transformer layers to compute the training loss from the corresponding source language labels.
  2. The training loss is backpropagated to the Transformer layer only but does not update the representation transformation network.
  3. Meta-losses are computed from the output of the target language data and the target language labels, and only the representation transformation network is updated.

A representation transformation network takes a d-dimensional linguistic representation as input and outputs a transformed representation in d dimensions.

Assuming that there is a representation transformation network that can appropriately transform expressions from the source language to the target language, the source data can be seen as roughly equivalent to the target data at the representation level.


We use the pretrain model to initialize the model parameters θ and randomly initialize the parameters of the representation transformation network Φ.

The meaning of the update equation for Φ,θ is.

First, if a representation transformation network Φ effectively transforms a source language representation, then such a transformed representation f(x; Φ,θ) should be more beneficial to the target language than the original representation f(x;θ).

This can be formulated as a two-level optimization problem since the model wants to keep the loss small in the target language. (Equation (2))

L() is a loss function. The parameter Φ of the representation transformation network is a meta-parameter, which is used only during training and discarded during testing.

The exact solution requires solving for the optimal θ* each time Φ is updated. This is virtually impossible for complex cases such as the Transformer language model, as it is computationally prohibitive.

Similar to existing work involving such optimization problems (), instead of solving for the optimal θ* for any Φ, we employ a stochastic gradient descent update method for θ as the optimal estimate of a particular Φ.

Update θ with Eq. (3) and Φ with Eq. (4) until convergence is achieved.

Training and Evaluation

We conduct experiments on two tasks: the Named Entity Recognition (NER) and the sentiment analysis classification task. For the NER task, we use the Wikiann data set across languages. The size per language ranges from 100 to 20k.

The sentiment analysis task uses the 200k English portion of the Multilingual Amazon Review Corpus (MARC), as well as the Telugu and Persian corpora SentiPers is a Persian sentiment corpus consisting of 26k sentences of users' opinions on digital products Sentiraama is a Telugu (tel) sentiment analysis dataset.

The results of NER are shown. The results are compared with those of JT (Joint Training) when using 5k data in the source language.

The accuracy is much better when the source language is also used than when only the target language is used, but it is clear that the effect is greater when the relevant language is used than when English is used.

Because of the sentiment analysis, we show the accuracy comparison when using English for 1k data.


The proposed method, MetaXL, enables effective transfer from data-rich source languages and can reduce the gap between multilingual representations. Future work includes studying the transfer from multiple languages to further improve the performance and placing multiple representation transformation networks in multiple layers of the pretrain model.

Tak avatar
Ph. D (informatics)

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us