Catch up on the latest AI articles

PALO: Innovative Multilingual Large-scale Multimodal Model With 10 Languages Covering Two-thirds Of The World's Population

PALO: Innovative Multilingual Large-scale Multimodal Model With 10 Languages Covering Two-thirds Of The World's Population

Large Language Models

3 main points
✔️ Development of the multilingual large-scale multimodal model "PALO": The first open source multilingual large-scale multimodal model "PALO" was developed for 10 major languages covering 65% of the world population. It primarily targets language groups that have not been adequately represented by multimodal models in the past.
✔️ Building an extensive instruction tuning dataset: creating a high-quality multilingual vision language instruction dataset across 10 languages. The dataset is essential for improving the accuracy of language processing and generation across multiple languages and is based on translations of state-of-the-art large-scale language models.

✔️ Improved multilingual performance of large multimodal models and demonstrated scalability: improved multilingual performance of state-of-the-art large multimodal models at three different scales: 1.7B, 7B, and 13B parameters. Achieved significant improvements in comprehension and content generation for low-resource languages, while maintaining high performance for high-resource languages and simultaneously demonstrating improved performance on a variety of language tasks.

PALO: A Polyglot Large Multimodal Model for 5B People
written by Muhammad MaazHanoona RasheedAbdelrahman ShakerSalman KhanHisham CholakalRao M. AnwerTim BaldwinMichael FelsbergFahad S. Khan
(Submitted on 22 Feb 2024 (v1), last revised 5 Mar 2024 (this version, v2))
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The dramatic evolution of generative AI has led to the emergence of large-scale multimodal models that seamlessly link visual and linguistic tasks, resulting in revolutionary progress in this area. However, while pioneering efforts such as LLaVA and miniGPT4 have achieved remarkable results in effective text response generation from visual input, their focus has been primarily limited to English, and they have not done enough in multimodal comprehension for non-English speakers. As a result, current large-scale multimodal models tend to overlook the linguistic diversity of language groups that comprise the majority of the world's population, such as Chinese, Hindi, Spanish, and French.

Focusing on this imbalance and focusing on languages that have not been adequately represented in multimodal models, this paper develops PALO, the first open-source multilingual large-scale multimodal model encompassing 10 major languages covering 65% of the world population.

This initiative addresses the lack of high-quality multilingual multimodal data in languages other than English. In particular, data is limited for Hindi, Arabic, Bengali, and Urdu. This paper addresses these challenges through careful analysis and refinement of translations generated by state-of-the-art large-scale language models for each target language. Through human intervention to check and correct the accuracy of the translations, we have created a high-quality multilingual dataset that ensures precision and subtlety across languages.

In addition, we are developing a unified model, PALO, that can simultaneously answer questions in 10 different languages, utilizing a high-quality multilingual vision language instruction data set and the latest advances in large-scale multimodal modeling technology. This model can maintain or even improve performance in high-resource languages while providing significant performance gains in low-resource languages.

PALO: Large multimodal model with multilingual support

Here, we introduce the Polyglot Large Multimodal Model (PALO), which supports 10 major languages covering about two-thirds of the world's population and aims for global accessibility. The model is designed to demonstrate its versatility in different computational environments and is based on LLaVA for the large model (7/13B) and MobileVLM for the lightweight model (1.7B).

PALO seamlessly integrates a vision encoder and a language model to generate accurate natural language responses based on incoming images and user text queries. The model uses the CLIP ViT-L/14 advanced vision encoder and processes vision tokens through a projector to convert them into a form easily understood by the language model. Of particular note is the lightweight downsample projector (LDP) designed for mobile models, which significantly reduces training and inference time, allowing for efficient model operation.

PALO is training in 10 languages and leverages a rich multimodal instructional tuning dataset. This allows the model to learn through more challenging examples in a richer context, greatly improving its ability to understand and generate responses across diverse language sets.

The large-scale model uses Vicuna, a large-scale language model, and the mobile model uses MobileLLaMA, both of which have been trained or fine-tuned on large amounts of text data collected from a variety of sources, including ShareGPT, a state-of-the-art data set. ShareGPT, a state-of-the-art dataset. The figure below provides an overview of PALO's architecture.

PALO's comprehensive language support and advanced technology make its model more accessible to users around the world. By doing so, PALO helps open up new possibilities for global communication by bridging diverse languages and cultures.

The paper also develops a multilingual vision and language-directed tuning dataset. This dataset covers a wide range of language diversity and aims to maximize the potential of the state-of-the-art large-scale multimodal model by Liu et al. (2023b). Specifically, we introduce a semi-automatic translation pipeline based on the large-scale language model by Brown et al. (2020) to optimize the translation process from English to multiple languages. This approach allows us to build high-quality multilingual datasets while addressing the unique challenges of each language, such as punctuation errors and grammatical subtleties.

The pipeline combines automated scripting with manual review by native speakers of each language to increase translation accuracy and consistency across languages. In particular, it addresses detailed language-specific issues such as accurate use of gender and overall linguistic consistency.

In addition, recognizing the limitations of large-scale language models, we fine-tune them using high-quality datasets consisting of 1K conversations that have been manually validated and modified in each language. This fine tuning focuses not only on improving translation accuracy, but also on improving consistency with the characteristics of each language, e.g., tone and notation. This improved large-scale language model is then used for translation into an extensive VLM instruction tuning dataset containing approximately 150K instructions, which are further tuned in OpenAI's fine tuning platform.

This meticulous preparatory process results in a comprehensive, high-quality, multilingual data set that is essential for PALO fine tuning.

This dataset significantly improves the model's ability to produce contextual and grammatically accurate content in all the languages it contains. For example, the figure below highlights two major improvements in English to Arabic translation. In the first example, lexical accuracy is improved, and in the second example, grammatical agreement is improved.

Integrating this dataset into the learning process of a large-scale multimodal model is key to expanding the ability to effectively include English and nine other languages.


The research team conducted a comprehensive validation across a variety of languages to assess multilingual proficiency. The validation used a high-quality assessment set that was translated and manually fine-tuned using the GPT-4-Turbo. The set consisted of 24 images of a wide variety of indoor and outdoor scenes, memes, and works of art, along with 60 questions that measure the ability to understand and generalize these images.

Experimental results show that "PALO" performs robustly in high-resource languages, particularly the 7B and 13B models, which score an average of 59.0 and 63.8 points in these languages, respectively. This indicates that the multilingual extensions were effectively incorporated without compromising the models' inherent capabilities. Furthermore, for the low-resource languages, the performance of both models improved significantly, increasing their scores to 55.6 and 59.2, respectively. The results are shown in the table below.

Overall performance across all 10 languages also improved, with the 7B model achieving an average score of 57.65 and the 13B model 61.97. This indicates that we have successfully developed a more comprehensive, diverse, and higher-performing visual language model (VLM) that is capable of handling the complex landscape of the world's languages in visual language tasks.

The figure below also shows qualitative results that demonstrate PALO's multilingual capabilities. In response to user queries, the model generates accurate textual responses related to the visual content and associated language. This illustration highlights the ability to link visual and linguistic understanding through a variety of languages. The illustration examines interactions in two high-resource languages - Spanish and Chinese - and two low-resource languages - Hindi and Arabic.

PALO accurately interprets an unusual aspect of the image, which features two individuals in medieval costumes within a modern supermarket. The model shows creative imagination in Chinese and suggests a backstory in which these characters could be the king and queen in a storybook. In Hindi, PALO demonstrates scenario building by describing a situation in which a medieval couple may have reached the present day as time travelers; PALO displays a touch of humor in Arabic, imagining playful dialogue in which the king might say, context and culture specific humor It demonstrates a subtle understanding. This image effectively visualizes a sophisticated ability to process and produce content in multiple languages, reflecting a high degree of linguistic accuracy and cultural intelligence.

The figure below also depicts qualitative results that demonstrate PALO's visual inference and proficiency in multiple languages; PALO accurately responds to visual content in a contextually appropriate manner for each language. The figure depicts conversations in three high-resource languages-French, Russian, and Japanese-and one low-resource language-Urdu. In the French segment, the model demonstrates pragmatic reasoning by suggesting recipes using the available ingredients in the refrigerator, linking visual recognition to food suggestions. In Russian, PALO identifies items rich in vitamin C. In the Urdu example, the model organizes the refrigerator contents into food groups, demonstrating the ability to classify items and apply nutritional knowledge. This effectively highlights the ability to switch between languages while maintaining conversational context and reflects the ability to produce relevant and culturally aware content in both high- and low-resource languages.

Notably, the mobile model also showed consistent improvements in both high- and low-resource languages, resulting in a significant overall score improvement of 33.9 points on average over the MobileVLM baseline. Interestingly, this mobile version also improved performance in high-resource languages such as English and Chinese. This difference can be attributed to the pre-training data of the language model. Specifically, the difference between LLaMA-2, trained on 2 trillion tokens with better representation of high resource languages, and MobileLLaMA, trained primarily on 1.3 trillion English tokens.

This research opens up new possibilities for model performance and versatility in multilingual visual language tasks. These results suggest that our approach has the potential to significantly improve our understanding and ability to respond to visual language tasks in languages around the world.


This paper develops a new multilingual large-scale multimodal model, PALO. This innovative model was developed to serve approximately two-thirds of the world's population, or 5 billion people, PALO treats both images and text queries as input and can effectively interact in a wide range of languages, from major languages such as English and Chinese to less supported languages such as Arabic and Hindi languages, from major languages such as English and Chinese to less supported languages such as Arabic and Hindi. The model was refined by translating 150,000 instructions across 10 languages, with 1,000 human-annotated conversations per language; PALO improved its overall performance in visual and verbal evaluation, through training at three different scales: 170 million, 7 billion, and 13 billion, It has demonstrated its versatility and scalability.

However, semi-automatic translation processes may not fully capture the cultural nuances unique to each language, which may affect cultural depth and accuracy. Furthermore, while the 10 selected languages provide extensive coverage, they also suggest room for expansion to more languages. In addition, the inherent biases of large-scale language models, especially in low-resource languages, may pose risks related to nuanced interpretation of visual data, such as misinterpretation of cultural symbols and gestures. Careful evaluation and application of learning are needed to address these challenges and ensure accuracy in culturally sensitive contexts.

The development and implementation of PALO is a major step toward reducing interlanguage barriers and enriching communication around the world, but its implementation requires careful consideration and improvement. The authors of this paper state that the code, model, and dataset will be made publicly available. Further contributions to the further development of this field are expected.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us