Developed By NAVER! HyperCLOVA X, A Large-scale Language Model Specialized For The Korean Language

Large Language Models 29/10/2024

3 main points
✔️ Focusingon Korean language and culture,we developed HyperCLOVA X, alarge-scale language model thatperforms well in other languages
✔️ Excellent multilingual capabilities for reasoning and problem solving in Korean and English, cross-language reasoning and machine translation
✔️ Through safe and ethical development, Addresses social biases and other issues, resulting in a safe and reliable AI assistant

HyperCLOVA X Technical Report
written by Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han, Youngkyun Jin, Hyein Jun, Jaeseung Jung, Chanwoong Kim, Jinhong Kim, Jinuk Kim, Dokyeong Lee, Dongwook Park, Jeong Min Sohn, Sujung Han, Jiae Heo, Sungju Hong, Mina Jeon, Hyunhoon Jung, Jungeun Jung, Wangkyo Jung, Chungjoon Kim, Hyeri Kim, Jonghyun Kim, Min Young Kim, Soeun Lee, Joonhee Park, Jieun Shin, Sojin Yang, Jungsoon Yoon, Hwaran Lee, Sanghwan Bae, Jeehwan Cha, Karl Gylleus, Donghoon Ham, Mihak Hong, Youngki Hong, Yunki Hong, Dahyun Jang, Hyojun Jeon, Yujin Jeon, Yeji Jeong, Myunggeun Ji, Yeguk Jin, Chansong Jo, Shinyoung Joo, Seunghwan Jung, Adrian Jungmyung Kim, Byoung Hoon Kim, Hyomin Kim, Jungwhan Kim, Minkyoung Kim, Minseung Kim, Sungdong Kim, Yonghee Kim, Youngjun Kim, Youngkwan Kim, Donghyeon Ko, Dughyun Lee, Ha Young Lee, Jaehong Lee, Jieun Lee, Jonghyun Lee, Jongjin Lee, Min Young Lee, Yehbin Lee, Taehong Min, Yuri Min, Kiyoon Moon, Hyangnam Oh, Jaesun Park, Kyuyon Park, Younghun Park, Hanbae Seo, Seunghyun Seo, Mihyun Sim, Gyubin Son, Matt Yeo, Kyung Hoon Yeom, Wonjoon Yoo et al. (296 additional authors not shown)
(Submitted on 2 Apr 2024)
Comments: 44 pages; updated authors list and fixed author names
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The evolution of large-scale language models (LLMs) has focused specifically on the understanding and production of English texts. While this has resulted in a number of powerful large-scale language models that skillfully handle English, these models are limited in their ability to handle non-English languages, particularly Korean, because they primarily reflect the values of North American culture. Korean has unique cultural nuances and region-specific characteristics that are difficult to deal with as is.

To meet these challenges, this paper presents the HyperCLOVA X. It includes the strongest model HCX-L and the lightest model HCX-S. These models are tailored to the linguistic and cultural characteristics of Korean, with the ability to understand and produce several other languages, including English. In the initial stages, they are pre-trained with an equal mix of Korean, English, and programming source code data, and instructional adjustments are made using a high-quality annotated data set.

HyperCLOVA X's capabilities have been proven through benchmark testing on reasoning, knowledge, common sense, facticity, coding, math, chat, instruction following, and innocuousness. Experiments conducted in both Korean and English have shown that HyperCLOVA X has knowledge specific to the Korean language and culture and demonstrates strong reasoning abilities not found in existing models. It also adheres to strict safety guidelines andperforms as well asother good English-centriclarge-scale languagemodels.

In addition, HyperCLOVA X has excellent multilingual capabilities. It is also a state-of-the-art performer in crosslingual reasoning between multiple Asian languages and in machine translation between Korean and other major languages. In particular, crosslingual transfer between Korean and English is effective, and instructional coordination in one language contributes to the ability to follow instructions in other languages.

The company states that the development of this large-scale language model was conducted with a focus on safety and in accordance with ethical principles by NAVER AI. It also states that assessments were conducted to monitor and mitigate the risk of generating harmful or toxic content using red teaming and safety data collection processes.

HyperCLOVA X's high performance in Korean as well as other languages provides valuable guidance for regions and countries to develop their own language models. This initiative will also contribute to the realization of "safe, secure, and reliable" AI systems as promoted by the United Nations.

This articleprovides a very extensive report onHyperCLOVA X,including the learning process, key benchmark evaluations, demonstration of multilingual capabilities, development process and safety concerns, and future directions. This article presents some of the findings.

Learning Methods - Prior Learning

HyperCLOVA X is a large scale language model dedicated to the Korean language and its culture, with excellent performance in English and programming code, available in two variants, HCX-L (large model) and HCX-S (small model), initially pre-trained with Korean, English, and code data and then pre-trained with Korean, English, and code data. After pre-training, their instruction-following ability is enhanced by supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

We begin with the pre-training process:HyperCLOVA X is an updated version of HyperCLOVA (Kim et al., 2021), based on a transformer decoder architecture (Vaswani et al., 2017) with several improvements. It employs rotational position embedding (Su et al., 2024) to increase context length, and pre-normalization and grouped-query attention (Ainslie et al., 2023) to improve learning stability and efficiency. 2023) to improve learning stability and efficiency.

The pre-training data consists of Korean, multilingual (primarily English), and code segments. The multilingual data also includes Japanese, German, and French. The Korean data is particularly strong, accounting for about one-third of the total data. Data is collected from a variety of sources and is filtered for repetition, low-quality documents, documents containing hate speech or advertising, and personal information (PII). In addition, the data containing knowledge is upsampled to improve the performance of the large-scale language model.

The key to designing an effective Korean-centric large-scale language model is good tokenizer preparation. Korean is an agglutinative language, meaning that words are formed by combining morphemes. Reflecting this, we have trained a morpheme-enabled byte-level BPE (Sennrich et al., 2015) with a lexicon size of 100,000. The tokenizer has a significant impact on the performance and inference cost of large-scale language models; HyperCLOVA X's tokenizer is designed to efficiently tokenize Korean documents.

In order to acquire left-to-right language generation and infilling capabilities, joint PSM & SPM training (joint PSM & SPM training) is employed. With this approach, the large-scale language model improves infilling performance and can be adapted to a variety of applications, including coding assistants. Ninety percent of the training is performed at a context length of 4,096, and the remaining 10% at 32,768. Learning is performed using bf16 accuracy, with flash attention and 3D parallelism.

Learning Methods - Alignment Learning

Alignment of pre-trained large-scale language models with human intentions and values is important for their application as AI assistants, and HyperCLOVA X uses two alignment techniques: SFT (supervised fine-tuning) and RLHF (reinforcement learning with human feedback) for training. SFT (supervised fine-tuning) and RLHF (reinforcement learning with human feedback).

The first stage of alignment learning is SFT. In this phase, pre-trained HyperCLOVAs are trained to generate the best response to each prompt; SFT improves the model's ability to follow instructions and solve tasks such as coding and creative writing. It also allows students to utilize a wide range of knowledge, from common sense to science to ethics.

The SFT dataset defines special tokens, '<<user>>', '<<assistant>>', and '<<end>>', to distinguish between user and assistant turns. This ensures that each role within a context is clearly distinguished. For training multi-turn samples, loss masking is applied to text outside of the assistant's turn.

It also uses an efficient batching strategy that groups sequences of similar length to minimize padding within mini-batches and maximize GPU utilization. The maximum number of tokens in each mini-batch is kept constant, but the size of the mini-batch is determined by the average length of the sequence.

The next step is RLHF (Reinforcement Learning with HumanFeedback): the model after SFT can perform many tasks but may produce inaccurate output or harmful content; in RLHF, the model is further tailored to human values (usefulness, facticity, safety). The method uses human preference data to trainthereward modeland then trains the model after SFT with PPO (proximity policy optimization) to maximize the reward returned by the reward model.

The reward model is initialized as a post-SFT model and has a randomly initialized linear head that outputs a scalar reward; based on the Bradley-Terry model, it is trained with a ranking loss that minimizes the negative log-likelihood of the difference between the selected and rejected rewards. The reward modelis trained inonly one epoch, adjusting the optimization step for comparative data to prevent overlearning.

Reward model datasets are collected based on diverse product requirements. Differences in reward distributions across different data sourcespose arisk of reward hacking andlearning difficulty. To mitigate this, normalization and clipping are applied during inference.

In addition,PPO (proximity policy optimization) isused for reinforcement learning. A Kullback-Leibler (KL) penalty term is added to the reward with a coefficient of 0.04, and the policy network is initialized as a post-SFT model. Many prior studies have reported an increase in output length after RLHF, and we observe this phenomenon in this paper as well, indicating that the model tends to favor longer sequences. To prevent this, we use a set of instructions that constrain the length and format of responses and introduce an early stopping mechanism.

Due to the nature of the transformer architecture, large language models are known to be prone to repetition, so to solve this problem, sequence-level unlikelihood training (Unlikelihood Training) is integrated with PPO to effectively This effectively reduces repetition.

The PPO phase requires four times as many models as the SFT, each operating sequentially within each iteration. To optimize this process, the devices in each model are partitioned in a multi-node setup and asynchronous processing is implemented to reduce training time and improve efficiency.

Alignment learning involves a variety of synchronous and asynchronous phases. To automate these workflows, we implement an event-driven pipeline to optimize the process in terms of human resources, computational resources, and time. For example, we automate the evaluation at intermediate checkpoints to reduce study time.

It also automates the SFT, RM, and PPO learning processes, reducing human intervention. Training runs on NAVER Smart Machine Learning (NSML), NAVER's high-performance computing system. Metadata is securely stored and shared in an in-house machine learning operations tool and efficiently analyzed using MLflow.

Benchmark

Many benchmarks along different qualities have been proposed to objectively evaluate the performance of large language models. Here we organize the performance of HyperCLOVA X on the core benchmarks.

A major limitation in the evaluation of multilingual language models is the lack of a comprehensive evaluation framework for languages other than English. Competence in a particular language requires not only linguistic proficiency, but also a deep understanding of the cultural and social nuances specific to the speakers of that language. Therefore, this paper systematically uses widely recognized English and Korean benchmarks to assess HyperCLOVA X's bilingual and general competence.

Because core competencies such as reasoning, world knowledge, and mathematics transcend language, some of the benchmarks assessing these skills will be in English. On the other hand, to assess language-specific questions and cultural nuances, we use benchmark categories tailored to each language. For the Korean assessment, we will use benchmarks that have been meticulously created by experts or curated from existing, widely recognized ones. This includes a comprehensive internally constructed Korean benchmark, KoBigBench (KBB), as well as a set of Korean-specific questions from KMMLU (Son et al., 2024). This ensures that the model's understanding of the cultural and social context of the Korean language is rigorously assessed.

Because HyperCLOVA X has unique capabilities in both Korean and English, and because no directly comparable models exist, it is compared to Korean-specific large-scale language models and general base models to assess its diverse capabilities.

To assess proficiency in Korean, we use large-scale language models designed specifically for the Korean language, as well as further trained large-scale language models. For example, Polyglot-Ko is an open-source language model built specifically for the Korean language. The SOLAR chat variant is also based on the LLaMA 2 architecture and is further trained on the Korean dataset. LLaMA 2 Ko 8 and LLaMA 2 KoEn 9 are also used as Korean language models. KORani 10 is an open source language model built on Polyglot-Ko and Korean models further trained from LLaMA 2, and EEVE-Korean-v (Kim et al., 2024b) is a more efficient Korean lexical extension.

HyperCLOVA X also compares favorably with powerful general infrastructure models, such as Falcon and LLaMA 2, which are known as competitive models in terms of multilingual support and comprehensive capabilities.

To assess a model's knowledge and reasoning ability, questions are posed and the responses obtained are analyzed. There are two main evaluation approaches.

One is an open-ended question-answering approach, which has the user generate free-form answers and see if they match the correct answers (e.g., BigBench-Hard).The other is a closed-ended question-answering approach, in which the respondent is asked to predict one or more answers from a given set of candidates (e.g., MMLU).

Free-form answer generation is relatively simple, but candidate selection requires instruction-following ability and in-context examples of a few shots. Multiple choice questions may be reformatted as likelihood tests, but they are susceptible to prompt sensitivity, and minute changes can cause scores to fluctuate. To reduce this and increase the reliability of the assessment, present the prompts in actual multiple-choice format consistent with the intent of the benchmark.

Performance comparisons between HyperCLOVA X and other leading open-source large-scale language models have been conducted on a wide range of benchmarks combining Korean and English tests, with the largest model in the HyperCLOVA X family shown. The results of the evaluation show that HyperCLOVA X significantly outperforms all other Korean-focused models in the comprehensive Korean benchmark. In addition, HyperCLOVA X also performed as well as the largest LLaMA 2 model in the English-focused benchmarks. Overall, HyperCLOVA X has proven to be a large language model with superior capabilities in bilingual environments that include both Korean and English.

The following benchmarks are used to assess comprehension of the Korean language from multiple perspectives

KoBigBench (KBB)
- KoBigBench is a comprehensive benchmark specific to the Korean language and based on BigBench (Srivastava et al., 2022). The benchmark covers knowledge exploration tasks across disciplines such as law, history, mathematics, and computer science, as well as tasks involving common sense reasoning and bias.
KMMLU
- KMMLU (Korean Massive Multitask Language Understanding) was developed to measure large-scale multitask language understanding in Korean, consisting of 35,030 expert-level multiple-choice questions across 45 subjects to capture linguistic and cultural aspects of the Korean language. Assessments follow the original set-up (5 shots), with some assessments administered internally.
HAE-RAE Bench
- The HAE-RAE Bench is a benchmark designed to assess Korean cultural and linguistic knowledge. It consists of tasks across four main areas: vocabulary, history, general knowledge, and reading comprehension. It uses a zero-shot problem-solving template and follows the original paper setup.

The results from the benchmark are shown in the table below.There is a noticeable performance difference between the Korean-specific and non-Korean-specific models. The differences are particularly large for the HAE-RAE, KBB, and KMMLU benchmarks, which require a deep understanding of social context. This indicates that the acquisition of large, high-quality data from the target group is essential for the success of large-scale language and region-specific language models.

In addition, the following benchmarks are used to assess English comprehension

Massive Multi-Task Language Understanding (MMLU)
- MMLU (Hendrycks et al., 2020) is a benchmark for 57 real-world subjects that assesses broad knowledge and problem-solving skills; it uses a 5-shot example scheme; it is based on a 5-shot example scheme; and it is designed for students who have not yet completed the MMLU.
BigBench-Hard (BBH)
- Part of the BIG-Bench (Srivastava et al., 2023), BBH consists of 23 particularly challenging tasks. Each task uses a 3-shot example to elicit the response of the underlying model without a chain of inference.
AGIEval
- AGIEval (Zhong et al., 2023) tests the model using standardized tests such as college entrance exams and bar examinations; it uses a 0-shot example and utilizes an English subset in multiple-choice format.

Benchmarked results are shown in the table below, withlittle difference in performance between the HCX-L and the largest model in the LLaMA 2 family. HyperCLOVA X improves its problem-solving ability by using an intermediate inference step. When chain-of-sort (CoT) is employed, HCX-L's MMLU score improves by 1.87 points, reaching 69.78. By sampling the self-consistent inference chain 10 times, the score reached 70.79. In contrast, applying the CoT to LLaMA 2 70b reduced the MMLU score by 2.62 points.

In addition, the following benchmarks are used to assess common sense reasoning and comprehension skills in English

HellaSwag
- HellaSwag (Zellers et al., 2019) is a common benchmark for assessing common-sense abilities. It asks language models to complete an ordinary sentence from several candidate options. Questions that seem easy to a human may be challenging for the model. The questions are in multiple-choice format and use a 5-shot example.
Winogrande
- The Winogrande Scheme Challenge (WSC) (Sakaguchi et al., 2021) is a set of cloze-style pronominal resolution problems. These problems are specifically designed to assess common sense reasoning ability. Unlike approaches that rely on simple word association, Winogrande requires deep reasoning. The benchmark consists of two-part questions, and the evaluation protocol uses a five-shot learning approach.
PIQA
- The Physical Interaction Question Answering (PIQA) benchmark (Bisk et al., 2020) tests physical common sense reasoning. In this task, models are asked to answer questions about the physical world. Due to the lack of training and validation sets, the evaluation protocol uses a 0-shot learning scheme.
AI2 Reasoning Challenge (ARC)
- The ARC (Clark et al., 2018) is a common benchmark for assessing common sense reasoning. The dataset consists of elementary school-level questions and answers, and is available in two subsets: easy and hard. The evaluation protocol uses both subsets and employs a prefix matching scheme for fair comparison to the base model.
CommonsenseQA (CSQA)
- CommonsenseQA (Talmor et al., 2019) is a question-answer dataset that requires the use of prior common sense knowledge to predict the correct answer, rather than simple word association. The evaluation protocol uses a 5-shot example to provide a reliable assessment.

The results for common sense reasoning ability are shown in the table below. The performance of Winogrande and CSQA is particularly noteworthy. They eliminate superficial word associations and require a deep understanding of the world and common sense. SOLAR and EEVE, on the other hand, are further learned from the Mistral (Jiang et al., 2023) backbone and show an advantage in common sense reasoning in physical interactions with HellaSwag.

In addition, the following benchmarks are used to assess the knowledge that language models possess

Natural Questions (NQ)
- Natural Questions (Kwiatkowski et al., 2019) is a collection of open-ended questions collected from real search engine queries. Each question has multiple candidate answers, and is considered correct if one of them can be identified. A prefix-matching evaluation method is employed to allow evaluation on base models that have not been trained on the indicative dataset, and a 5-shot example is used.
TriviaQA
- TriviaQA (Joshi et al., 2017) is a large reading comprehension dataset consisting of over 600,000 question-answer and evidence triples. Recent assessments test knowledge of language models using question-answer pairs without context. This benchmark is well suited for assessing the knowledge capacity of models because it includes questions on a variety of facts from around the world; it uses 5-shot and prefix matching and includes the no-directed model as a baseline.
CLIcK
- This new dataset (Kim et al., 2024a) is designed to assess linguistic and cultural knowledge of the Korean language. It curates categories related to Korean popular culture, politics, and traditions and assesses them in a zero-shot setting.
Factscore
- Factscore (Min et al., 2023) assesses the ability to generate factual information about a given entity, such as a biography of a particular person; we used HyperCLOVA X and other LLMs to analyze factuality on English and Korean language data sets. Measuring the Korean Factscore requires the translation of prompts and the use of the Korean Wikipedia dataset. This dataset was curated to include only comprehensive documentation.

However, base models and low-performance large language models often repeat the same sentence at the end of their output. To ensure content quality, these repetitions are immediately removed. Also,if alarge languagemodel produces nonsense words, it is considered to have failed to provide an adequate response. If the model generates an English description of a Korean Wikipedia title, we translate the output and calculate the Factscore.

The resultsare presented inthe table below,showing the HyperCLOVA X assessment results measured using Factscore from the NQ, TriviaQA, a subset of CLIcK, and the Korean Wikipedia dataset.Because the NQ and TriviaQA datasets were collected from English-speaking Korean language models such as KORani and EEVE are further trained from English-centric base models (Mistral and LLaMA 2) and are therefore less affected. The impact is less pronounced for Korean models such as KORani and EEVE.

LLaMA 2 and polyglot LLM are limited in their ability to provide reliable accounts of biographies of Korean and other Asian figures. On the other hand, the HyperCLOVA X model and EEVE-Korean-v1 show a greater ability to accurately convey information about a given entity. This result indicates that the fact-generating ability of the HCX-L model on the Korean dataset is superior to other baseline models.

When using the translation output, scores are marked with an asterisk (*).

Summary

HyperCLOVA X has made important advances in the area of large-scale language modeling. While focusing specifically on Korean language and culture, HyperCLOVA X maintains high proficiency in English and other languages. Through a learning process that balances Korean, English, and programming languages, Supervised Fine-Tuning (SFT) and Reinforcement Learning from HumanFeedback(RLHF)Reinforcement Learning from Human Feedback), HyperCLOVA X excels in a wide variety of tasks.

HyperCLOVA X has demonstrated high performance in a wide range of benchmarks, including Korean and English reasoning, coding and math problem solving. It also has excellent multilingual capabilities, particularly in cross-language reasoning and machine translation, demonstrating its versatility and applicability in diverse language environments. Furthermore, its commitment to responsible AI development and deployment is demonstrated through safety assessments and adherence to ethical principles. Through its advanced handling of ethical issues such as toxicity and social bias, and through systematic red teaming and safety data collection processes, HyperCLOVA X demonstrates its potential as a safe and reliable AI assistant. Overall, HyperCLOVA X sets new standards for bilingual and multilingual large-scale language models and demonstrates the potential of more inclusive and culturally sensitive AI technologies.

The paper states that future work includes exploring multimodality and expanding the ability to seamlessly process and integrate diverse data types such as text, images, and audio.The paperalsostates that it seeks toexplore the effectiveness of model quantization techniques to optimize HyperCLOVA X's inference without compromising accuracy or output quality.

By actively researching the integration of external tools and APIs, HyperCLOVA X is expected to be able to access specialized data sets and services that will greatly improve the factuality of its answers.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Developed By NAVER! HyperCLOVA X, A Large-scale Language Model Specialized For The Korean Language

Summary

Learning Methods - Prior Learning

Learning Methods - Alignment Learning

Benchmark

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...