[IndoMMLU] Dataset For Performance Evaluation Of LLM In Indonesian Language

Large Language Models 21/03/2024

3 main points
✔️ Multilingual Proficiency Assessment: Moving away from English-biased assessments, we assessed the performance of large-scale language models such as the GPT-3.5 and Falcon on proficiency, especially in Indonesian and regional languages, based on test questions used in Indonesian educational settings.
✔️ IndoMMLU Dataset: Created the first Indonesian-specific dataset of multiple-choice questions covering a wide variety of subjects and educational levels from elementary school to university entrance examinations in Indonesia to assess language proficiency and knowledge of large-scale language models across a wide range of subjects.
✔️ Performance Analysis Based on Real-World Knowledge and Education Level: Analyzed the performance of large-scale language models by subject and education level, with GPT-3.5 in particular showing the highest accuracy, but with challenges in understanding local language and culture.

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU
written by Fajri Koto, Nurul Aisyah, Haonan Li, Timothy Baldwin
(Submitted on 7 Oct 2023 (v1), last revised 21 Oct 2023 (this version, v2))
Comments: Accepted at EMNLP 2023
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

English-biased datasets have been used predominantly to assess the ability of large-scale language models (LLMs). These models have shown their performance on tests designed to assess linguistic ability, reasoning ability, and real-world knowledge. However, since the advent of LLMs trained in multiple languages, such as GPT-3.5, Falcon, and BLOOMZ, there is a need to assess performance in languages other than English. In particular, school tests have been carefully designed by education experts and have proven useful in assessing not only language proficiency, but also advanced cognitive skills such as comprehension, analytical skills, and the ability to apply knowledge in a variety of scenarios.

In addition to traditional English-based evaluations, new attempts to reflect region-specific language and culture are required. This is to address the issues of translation noise, lack of region-specific content, and failure to capture language-specific nuances. Against this background, this paper assesses LLM competence using test questions from an Indonesian educational setting. The study will collect exam questions from a wide range of educational stages, from elementary school to university level, and analyze them across a variety of subject areas, including STEM, social sciences, humanities, Indonesian language, and regional language and culture.

Our study introduces the first Indonesian MMLU dataset, IndoMMLU, which includes 64 different tasks, including nine regional languages and cultures specific to Indonesia. The dataset includes test questions from grades 1 through 12 and even university entrance exams, allowing for a detailed assessment of LLM Indonesian language proficiency. In addition, we will assess several multilingual LLMs, including the GPT-3.5 and LLaMA, to test the extent to which these models have an understanding of local languages and cultures. Such an effort would be a step forward in the evolution of multilingual LLMs and a better understanding of languages from a broader cultural context.

IndoMMLU

IndoMMLU is a multiple-choice question set specifically designed for the Indonesian education system. The dataset covers 64 subjects across various educational levels, from elementary school to university entrance exams, and follows the English MMLU format, but is built upon the more finely categorized Indonesian education curriculum.

The Indonesian education system is divided into three levels: elementary school (6 years), middle school (3 years), and high school (3 years), with different subjects taught at each school level. In elementary school, students in all grades are taught Indonesian language, civics, mathematics, arts, sports, and religion; in grades 4-6 and middle school, students further study foreign languages, local language/culture, science, and social sciences. In upper secondary school, students study more specialized natural and social science subjects such as physics, chemistry, biology, geography, sociology, economics, and history, etc. At IndoMMLU, mathematics is explicitly excluded because the questions consist primarily of symbols and have little language content.

Regional language/culture subjects also vary in each Indonesian province and depend on local government policy. For example, in West Sumatra, Minangkabau culture is taught in Indonesian, while in West Java, students are exposed to Sundanese language and culture. This means that IndoMMLU reflects the diversity of education in each region.

To create this IndoMMLU, we asked seven professional teachers with a Bachelor's degree in Education to participate and collect exam questions from publicly available schools in Indonesia from web sources. For each question, they were then asked to collect metadata such as source URL, school level, class level, question text, choices, and correct answers. To ensure the quality of the collection effort, workshops on data collection procedures were held and the collected data went through a rigorous quality control process.

Questions collected by each teacher were randomly checked and manually verified for data accuracy. In addition, an automatic filtering process was used to eliminate duplicate questions and questions with no answers. Finally, the data was organized into 14,981 questions, which were categorized into elementary, middle school, high school, and college entrance exam levels. 30% of IndoMMLU questions were from elementary schools, 24% from middle schools, 32% from high schools, and 14% from college entrance exams. The average length of questions varies by educational level and subject, with questions at the elementary school level tending to be relatively short and those at the university entrance exam level longer.

This dataset reflects the complexity and diversity of the Indonesian education system and will be a valuable resource for educational research and machine learning applications.

Experiment

The paper evaluates 24 large language models of different sizes in zero-shot and fourshot settings. These include GPT-3.5, XGLM, Falcon, BLOOMZ, mT0, LLaMA, and Bactrian-X. The questions and choices are preceded by a simple prompt in Indonesian, "Ini adalah soal [subject] untuk [level]. Pilihlah salah satu jawaban yang dianggap benar!" (This is a [SUBJECT] question for [LEVEL]. Please pick the correct answer!") is added to the [level].

For the closed-source model, questions are evaluated by comparing the first generated tokens (e.g., A, B, C) with the answers using regular expressions. For the open source model, we benchmark two strategies. Given a question and corresponding choices, we compute (1) the probability of a complete generated answer (Full Answer Probability) and (2) the probability of the first token of the generated answer (First Token Probability).

Accuracy in the zero-shot setup is shown in the figure below. Among the open source models XGLM (7.5B), Falcon (40B), BLOOMZ (7.1B), mT0xxl (13B), LLaMA (65B), and Bactrian-X (13B), estimating answers based on First Token Probability is the (XGLM is a notable exception).

The table below shows the average accuracy for each subject area for the 24 models. To calculate the scores, we ignore the educational level of the question, come up with an average score based on subject matter, and finally calculate a score across all subject areas.

Random performance varied between 20% and 27% due to the different number of options.

Overall, GPT-3.5 achieves the highest accuracy, but only 53.2%. GPT-3.5 also has the highest accuracy in each subject area, with the exception of local language and culture subjects. Among the open source models, mT0xxl (13B) achieves an average accuracy of 42.5%; Falcon (40B) performs worse than mT0xxl (13B) and BLOOMZ (7B).

Performance based on model size varies, with smaller models such as BLOOMZ (7B) and mT0xxl performing better than Falcon (40B) and LLaMA (65B). This may be due to the lack of Indonesian in the Falcon and LLaMA pre-training data; the lower performance of the 13B and 30B LLaMA models may suggest that the "Emergent Abilities" of large language models generally appear in the same or closely related languages This may be an indication that the "Emergent Abilities" of large language models generally appear in the same or closely related languages. This is further illustrated by the fact that Bactrian-X-LLaMA (13B), a LLaMA model fine-tuned on a 52-language instruction data set that includes Indonesian, shows an average accuracy of +5% compared to LLaMA (13B).

IndoMMLU also includes detailed education level metadata, which allows for a deeper understanding of large scale language model competence in terms of human education level. In the Indonesian context, the minimum passing score for the exam varies from subject to subject, usually varying between 65 and 70. By setting the passing score at 65, the GPT-3.5 assesses real-world knowledge competence, as shown in the table below. Green indicates that the model passed the subject; red indicates that the model failed the subject.

The table reveals that GPT-3.5 performs well on elementary school exams, but lacks an understanding of the local language and culture. in subjects that do not require as much analytical thinking, such as Civics and Religion, GPT-3.5 tends to achieve higher scores on the high school exams. in high school exams.

IndoMMLU includes a variety of Indonesian language exams across all grades and educational levels, which allows us to assess Indonesian language proficiency in a large language model. The results are shown in the figure below.

The GPT-3.5 indicates that the model reaches its highest accuracy in first grade, approaching 90%. However, with increasing levels of education, the model's performance gradually declines: scores drop below 75 in grades 3 and above, and scores in grades 7 and above fail to pass the test. This trend is also true for mT0xxl and BLOOMZ, which are passed only in grades 1, 2, and 3. This detailed assessment provides a valuable benchmark for large-scale language modeling competence in Indonesian.

Summary

This paper introduces IndoMMLU, a new multi-tasking benchmark for language understanding in Indonesian languages. This benchmark is used to assess how well current large-scale language models understand local language and cultural knowledge. The results show that the GPT-3.5 can pass the Indonesian elementary school exam, but smaller models struggle at almost all educational levels. None of the 24 models evaluated here perform well in the area of local language and culture, underscoring the need for greater understanding of those cultures and languages if large-scale language models are to be effective in diverse cultural and linguistic contexts.

We also mention some limitations of the IndoMMLU. The current survey does not include multimodal questions, arithmetic reasoning tasks, or essay-style questions. Therefore, these areas are considered issues that should be addressed in future studies to assess model comprehension and critical thinking skills in greater depth.

It is hoped that further evaluations of diverse languages and cultures will lead to the development of language models that can be used more universally.

Categories related to this article

Large Language Models

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

[IndoMMLU] Dataset For Performance Evaluation Of LLM In Indonesian Language

Summary

IndoMMLU

Experiment

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...