Is It Useful To "imitate" Language Models?
3 main points
✔️ The latest research suggests that imitating newly developed language models is very difficult. It was found that improvements through fine-tuning are not effective and the basic knowledge of the model does not change much.
✔️ It is becoming more difficult for small and large firms to gain the same advantages, especially those that take advantage of new data and algorithms to take advantage of differences in capabilities may be able to build a competitive advantage.
✔️ The introduction of new methods and data is important, and attention to technological constraints will contribute to sustainable development.
The False Promise of Imitating Proprietary LLMs
written by Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song
(Submitted on 25 May 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
The images used in this article are from the paper, the introductory slides, or were created based on them.
This paper describes an approach to improving inexpensive weak language models by tweaking them based on output from more powerful models. Specifically, it seeks to create a new model that mimics the output of a stronger model, such as ChatGPT, based on its output.
In the paper, we created a variety of models under different conditions and evaluated them. Initially, it appeared that the imitation model was superior in terms of following human instructions, but more specific evaluations revealed that it fell short of the real ChatGPT in certain tasks.
The researchers noted that even though the mimetic model was able to mimic the style of ChatGPT, there were significant differences in actual performance. And they showed that imitation is not as effective as promised and that there are important functional differences between open source and closed models.
Ultimately, the paper concludes that "model imitation is not an easy solution; developing a better basic model is." It argues that tackling difficult challenges, rather than taking easy shortcuts, is the most effective action to improve open source models.
Recent advances in AI technology have seen the emergence of powerful language models such as ChatGPT, Bard, and Claude, which are offered primarily as paid API services by major companies. At the same time, open source language models have evolved, offering the same basic functionality as commercial models (e.g., LLaMA, FLAN-T5). Researchers are contemplating whether the most powerful models will be open source (available to everyone) or closed source (restricted use). Both have advantages and disadvantages, and this could have major implications for policy, corporate strategy, and scientific research.
The research focuses on a technique called model imitation. This attempts to improve open source models by creating new models based on the output of a strong model. However, research has shown that while imitation models may appear to be superior for some tasks, in reality there are significant gaps in their basic functionality that are difficult to fill using current methods.
Researchers argue that it is more effective to enhance the basic functionality of open source models than to imitate them. They point out that it is important to improve models by, for example, using more diverse and higher quality training data. As noted above, the researchers conclude that imitation is not an easy solution and that it is more important to work on improving basic functionality.
Model imitation is a technique that mimics a powerful language model, such as ChatGPT, to train a new model with equivalent or comparable performance. It uses a model provided through an interface called an API as a black box and aims to build a similar model based on its output. The user can send queries to the model through the API, but cannot see the model's training data or internal structure.
The purposes of model imitation vary: scholars may advance new research projects, companies may develop competing services, and malicious users may accelerate their malicious use. There are two approaches to model imitation: local imitation, which focuses on a specific task, and broad imitation, which mimics a model broadly.
A growing number of recent studies have attempted to locally mimic models for specific tasks or to perform extensive model imitation. Many of these studies claim that the mimicked model is roughly equivalent to the target model. However, the goal of this paper is to rigorously evaluate these claims and to train and validate the performance of models that mimic ChatGPT through a variety of experiments.
Construction of mimetic datasets
Building an imitation dataset is essential for model imitation. There are two possible approaches to this task: task-specific imitation and extensive imitation. In both cases, choosing the right input set for the target model is key.
For task-specific mimicry, we constructed a dataset containing knowledge based on natural questions from Wikipedia and other sources. First, we selected a set of QA (question and answer) sheets from the validation dataset and generated similar but different samples for ChatGPT. These examples consisted of a single interaction and are referred to as NQ synthesis.
Extensive imitation made use of a large and diverse sample already widely available on the Web. Specifically, we collected examples from the ShareGPT website, the Human-ChatGPT Comparative Corpus (HC3), and the Discord ChatGPT bot. By using these datasets, we were able to build a large and diverse mimetic dataset for free, without having to submit queries through the API.
This mimicry dataset was used to test the performance of a model that mimics ChatGPT.
As noted above, two methods of constructing an imitation dataset are presented: imitation focused on a specific task and imitation consisting of a wide and diverse set of inputs. If it is difficult to prepare a large input pool, another method is to have LM generate samples from a small input seed set.
The models were then trained using the ShareGPT-Mix and NQ synthetic datasets and evaluated by humans and automatically. We investigated how model imitation could be improved by increasing the amount of imitation data and changing the functionality of the underlying base LM. Results showed that while automatic evaluation of imitated models showed little improvement and sometimes decreased performance, increasing the size of the base LM showed improvement.
It was also found that while the imitation model was better at learning styles, it was less accurate based on actual facts. Crowdworker evaluations showed that the imitation model was as good as or better than ChatGPT, while NLP benchmark results showed weak factuality.
It was noted that imitation models only provide the advantage of mimicking the "style" or "persona" of the target model and do not show much improvement in actual knowledge or function. It was also reported that training models locally is more successful.
Experimental results showed that model evaluation remained flat as the amount of imitation data increased, while model quality improved as the size of the basic model was increased.
Figure 4 shows that there is no improvement in automatic evaluation as the amount of mimetic data is increased, but improvement is seen when the base LM is scaled up.
Table 1 shows that the extensive coverage model does not improve zero-shot NQ, suggesting that the NQ-Synthetic model is feasible in local mimicry.
Figure 5 shows the low toxicity of the mimetic model, highlighting that the mimetic model inherits the safety and toxicity guidelines of the target model.
Recent research suggests that attempts to mimic newly developed language models (LLMs) are more difficult than expected. Researchers found that tweaks to improve existing models were ineffective and that the basic knowledge of the model did not change much.
This could make it difficult for small and large firms to gain equal advantage, especially if the larger firms are expected to get ahead. However, companies that take advantage of new data and algorithmic advances and capitalize on differences in capabilities will be able to build competitive advantages.
On the other hand, caution must be exercised when imitating models. It is difficult to imitate a unique model, which may result in forecasting difficulties and misinformation being communicated.
To address this, the researchers suggest tweaking models as well as using other training methods and new data sets. They also warn that model imitation may have implications in other areas. As advances in artificial intelligence become increasingly diverse, imitation methods may need to adapt to them.
It is also important to introduce new methods and training data when advancing imitation methods. In addition, attention to applications in other areas and technological constraints will contribute to sustainable development.
Categories related to this article