Catch up on the latest AI articles

Chain Of Thoughts Attribute Manipulation (CoTAM), A New Efficient Few-shot Learning Method That Leverages Large-scale Language Models

Chain Of Thoughts Attribute Manipulation (CoTAM), A New Efficient Few-shot Learning Method That Leverages Large-scale Language Models

Large Language Models

3 main points
✔️ Propose CoTAM, a novel few-shot learning method driven by large-scale language models
✔️Efficiently trains small-scale models with data manipulated by large-scale language models
✔️ Fine-tuning and instance-based results on multiple tasks, including text classification, demonstrate CoTAM's benefits

Generating Efficient Training Data via LLM-based Attribute Manipulation
written by Letian Peng, Yuwei Zhang, Jingbo Shang
(Submitted on 14 Jul 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)


code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In recent years, large-scale language models have shown an amazing ability to learn when presented with only a small number of samples. However, this capability requires expensive large-scale models, and their operational cost is a major challenge. In addition, the need to concatenate contexts, including demonstrations, for all test inputs during inference increases the computational burden. To address this,methods are being explored to leveragelarge-scalelanguage models to develop smaller-scale language models.

Previous workhas achieved efficient few-shot learning by generating new data ina large-scale language modelbased on few-shot demonstrationsand then using that data set to fine-tune a small-scale, pre-trained language model. In this method, small-scale modelscan be deployed offline without the need to query thelarge-scale languagemodel, increasing the efficiency of inference. However, with these methods, the data generated is uncontrolled and limited in information content, which can lead to spurious correlations. As shown on the left in the figure below, uncontrolled data is highly variable, making it difficult for small-scale models to learn.


This paper investigates a more controlled and efficient method of generation. The approach proposed in this paper is inspired by attribute manipulation in computer vision, where attributes are manipulated in the encoder's latent space to reconstruct new instances. The idea is also applied to the language domain, where it is proposed to manipulate task-specific semantics (e.g., emotion) while preserving the original meaning of the sentence. As shown in the figure below (reproduced below), controlled attribute manipulation can efficiently find precise decision boundaries by operating in the direction along task-specific attributes and preserving other attributes.


Attribute manipulation in language presents two main challenges: first, it is difficult to select an appropriate set of attributes. Sentences contain a variety of attributes (e.g., topic, sentiment, intent), which may vary from domain to domain and dataset to dataset. Using a predefined set of attributes is labor intensive and limited; second, reconstructing sentences with manipulated attributes requires a sophisticated understanding of semantics. Traditional methods rely on random masking to reconstruct sentences, which significantly reduces the diversity and validity of the generated sentences.

To address these challenges, this paper proposes a Chain-of-Thoughts (CoT) based method called Chain-of-Thoughts Attribute Manipulation (CoTAM).

It uses an instruction-tuned large-scale language model to manipulate attributes and reconstruct new sentences. Specifically,thelarge-scale languagemodel is promptedin three steps.Step 1queries thelarge-scale languagemodel todecompose the sentence into multiple attributes that are independent of task-specific attributes.This set of dynamic attributes captures the uniqueness of a single sentence and fits all domains without model fine-tuning. Step 2instructs thelarge-scale languagemodel tooutput guidelines for switching task-specific attributes and maintaining others.Finally, step 3prompts thelarge-scale languagemodel torestructure the sentence based on the guidelines from step 2.

All of these stepsare performed in a single query of thelarge-scale languagemodel, ensuring consistency in attribute manipulation and reconstruction.Furthermore, the use of a large-scale language model improves the interpretability of the framework proposed in this paper because the attributes are completely transparent to the user.

In this paper, CoTAM is run in a few-shot setting on four natural language tasks (text classification, natural language inference, text similarity, and multiple choice question answering). Compared to the strong baseline,it utilizes thesamelarge-scale languagemodel and generates the same amount of data. We also evaluate the quality of the generated data by fine-tuning the small-scale language model and extend the evaluation to more parameter-efficient methods. Both results show significant and consistent performance gains.

About CoTAM

Language modeling is the basis for human-like linguistic competence in large-scale language models. The goal of learning is to maximize the probability of predicting the next token in a human text. Here, <sos> refers to the starting token of the sequence. By training large-scale language models on very large corpora, current large-scale language models can achieve excellent zero-shot performance and natural language processing according to human prompts. To give instructions to a large-scale language model, one inputs a prompt Z, and the large-scale language model generates a response W based on it. This response is decoded and represented as an output; in CoT (Chain of Thought), the large-scale language model is first guided to solve a simple premise problem and then to better achieve the instructional goal.

Fine tuning is typically used to train a small-scale model. This model has a text embedder E and a classifier C. Upon receiving an input text W, the embedder E maps it to a representation X ∈ Rd where d is the size of the latent space. C then maps X to a probability distribution P ∈ Rc with class number c. The cross-entropy loss between P and the positive value Y is computed and the parameters of the model are updated by back propagation.

The goal of CoTAM is to generate efficient data from large language models to allow for small-scale models that perform well with minimal data for fine tuning. The idea of the method is to generate pairs (or groups) of data that have different classification targets but the same other attributes. This allows variations in the result P between different pairs (groups) to be attributed primarily to variations in the target features. This reduces the complexity typically faced with noisy real-world data. As a result, the cross-entropy loss between P and the positive value Y, which is used to update the parameters of the model by back propagation, becomes a more accurate indicator of the influence of the target feature on the classification.

To create such data, attribute operations are introduced that are applied primarily to facial attributes. As shown in the figure below, the learned encoder maps the input image to a representation in latent space. That representation is then transformed in latent space and reconstructed into an image. As a result, the reconstructed image undergoes a distinct change in its attributes while retaining other attributes. Thus, the difference between the initial image and the reconstructed image allows for efficient training of the classifier on the switched attributes.

Attribute manipulation is applied to text by leveraging the powerful text manipulation capabilities of large language models (OpenAI, 2023). Specifically, a CoT query is created to decompose the input text into a number of attributes, which are then approximated in latent space. The CoT then switches the target attribute for the task, prompting the large-scale language model to reconstruct the manipulated sentence. The main issue here is how to approximate the latent space by attributes. Latent space in facial attribute manipulation represents a fixed set of explicit or implicit attributes. However, a fixed set of attributes cannot be applied to text because general attributes are not shared as they are in face images.

Fortunately, large-scale language models have the ability to propose text attributes (Wang et al., 2023) and meet the requirements of dynamic attribute decomposition. This means that different input texts can be represented using the dynamic set attributes proposed by the large-scale language model. Thus, we use the large-scale language model to build the CoT and draw inspiration from the facial attribute manipulation. First, the large-scale language model is tasked to suggest some attributes other than the known ones (annotation labels). Next, the large-scale language model is instructed to consider how to generate sentences in which only the switching labels differ. Finally, it guides the large-scale language model to complete the attribute manipulation by creating such sentences; in the CoT, the large-scale language model serves as both decomposer and reconstructor due to its powerful text manipulation capabilities.

Following the macro-level design of CoTAM, the first step of the CoT is to decompose the sentence into its various attributes. The instructions for this step are as follows

What are some other attributes of the above sentence except "<Attr>"?
What are some other attributes of the above sentence except "<Attr>"?

Here, <Attr> refers to a known attribute in the data set. For example, <Attr> is "positive emotion" in the emotion analysis task. As a result, the large-scale language model suggests a set of other attributes that should be maintained during reconstruction.

The second step has the large language model suggest ways to reconstruct the sentence using the switched attributes from the decomposition step and other attributes. This step is incorporated to understand how to accomplish the goal and is critical to CoT reasoning. The instructions in this step are as follows

How to write a similar sentence with these attributes and "<New Attr>"?
How to write a similar sentence with these attributes and "<New Attr>"?

Here, <New Attr>becomes theswitched <Attr>, e.g., if <Attr> issentiment:positive, then sentiment:negative. This step outputs guidelines for the final large-scale language model to perform the reconstruction.

The third and final step asks the large language model to reconstruct a sentence with one switched attribute using the following instructions

Write such a sentencewithout any other explanation.

Here, the constraint "without any other explanation/without any other explanation" is added only to improve sampling efficiency.These are the CoT implementation steps in CoTAM. This allows the large-scale language model to effectively decompose and reconstruct sentences.

Experiment

In this experiment, six datasets were used to test CoTAM's benefits in text classification and other tasks. The text classification datasets include SST-2 (emotional polarity), TweetEmo (subtle emotions), and AGNEWS (topics). Other task datasets include MNLI (natural language inference), MRPC (semantic text similarity), and CSQA (multiple choice question answering); MNLI includes matched (MNLIm) and unmatched (MNLImm) datasets for evaluation. To efficiently obtain multiple results, we report results from the validation dataset when the test dataset is not publicly available.

The dataset is constructed by querying GPT-4 using CoT. The temperature of the large-scale language model is set to 0, aiming for high quality and reproducibility. In each dataset, CoTAM is applied to 200 sentences to create a small subset to sample the training data. For fair comparison, this subset is also used to generate data for other baselines.

CoT Data Augmentation (CoTDA) is a large-scale language model-based augmentation strategy refined by the CoT scenario in this paper. Rather than seeking augmentation directly, we let the large-scale language model suggest ways to write sentences with the same attributes as the input sentences according to CoT; CoTDA is the main baseline for exploring the importance of attribute switching in CoTAM. For each seed data, the number of classes in the dataset is N, augmented N-1 times at a temperature of 0.1. Thus, CoTDA generates the same number of new data as CoTAM to ensure fair comparisons.

FlipDAis a traditional label-switching augmentation method based on condition generation with a fully tuned T5. Specifically, a sentence is combined with a switched label and entered into T5. Several spans of the sentence are then randomly masked and recovered by T5 based on the new label, and the meaning of the sentence is switched. Since the original FlipDA requires a large supervised dataset that cannot be applied to small amounts of training, we build a large language model-based FlipDA (FlipDA++) baseline by sending span substitution instructions to a large language model.

It also uses text labeled directly by the human or large-scale language model. Human annotations include settings for K-shots and NK-shots; K-shots represent the baseline before integrating data generated from the large-scale language model; NK-shots have the same amount of training data as CoTAM, but because of the human annotations, this method expected to be an upper bound for CoTAM. However, CoTAM can exceed this upper bound due to the high data quality from attribute manipulation. annotations for large language models in the NK shot represent a simple baseline that is typically applied when large amounts of unlabeled in-domain data are available.By default, K is set to 10 and all results reported are the average of 10 runs to eliminate bias.

An easy way to assess the quality of the data is to tune the model and check its performance. In this paper, RoBERTa-Large is chosen as the learner for different datasets. If a validation dataset is not available, the model is trained on 32 epochs and then evaluated.

As shown in the table below, CoTAM achieves the best fine-tuning results for all seven tasks compared to other large-scale language model-based data generation methods.

In six of the seven tasks, CoTAM exceeds the expected upper limit of human annotations for (N-way) NK shots. This indicates that data carefully generated from large language models may train better models than models trained with the same number of human annotations. It also confirms that CoTAM is a way to improve data efficiency through attribute manipulation.

In the area of small text classification, text embedding has proven to be a powerful tool for improving performance and efficiency.

In instance-based inference, a text-embedding model transforms an input sentence into a representation. The label of this representation is determined based on its proximity to the representation of the annotated sentence. In this experiment, Nearest Centroid (NC) and K-Nearest Neighbors (KNN) are used as tuning-free algorithms and applied to three different text classification datasets NC is defined as the average representation of sentences that share the same label It assigns labels to input sentences according to how close they are to the centroid. In contrast, KNN labels input sentences based on the most common label among the K nearest neighbors. In this experiment, K is set to 5. To encode the text, we use the Simple Contrastive Sentence Embedding (SimCSE) model with RoBERTa-Large as the backbone model.

The table below shows the performance of different data generation methods when using instance-based algorithms. In contrast to methods that generate new text (such as FlipDA and CoTDA), CoTAM shows superior performance in most configurations. This suggests that the data generated by CoTAM also enjoys an improved distribution in the latent space of the text embedding model.

In the AG-NEWS dataset, instance-based algorithms tend to prefer in-domain annotations done by humans or large language models. This underscores the importance of in-domain text when using these algorithms for specific tasks.

Analysis

We are conducting an ablation study to confirm the importance of each thought in the CoT.We are also investigating the effects ofdifferentlarge-scale languagemodels and are experimenting with the GPT-3.5-turbo. The results show that GPT-4 yields much better fine tuning results. It was also shown that this difference can be reduced by using text-embedding models in text classification.

The results of the ablation study are shown in the table below. The study confirms that performance decreases when each thought is removed from the CoT. In particular, the "what" (ablation) thought is found to be more important than the "how" (methodology), indicating the superiority of the attribute suggestion. Label switching requires a CoT, and removing it results in significant performance degradation. Finally, GPT-4 outperforms GPT-3.5-turbo, showing that CoTAM prefers larger, larger language models with better language capabilities, especially for complex tasks such as MNLI.


To confirm the hypothesis that alarge-scale languagemodel adjusts for a single feature while holding other attributes constant, the data-pair representation from CoTAM is shown below, simplified into a two-dimensional space using Principal Component Analysis (PCA) to visualize SimCSE's high-dimensional (1024 dimensions) text representation. and visualized in a two-dimensional space.

The diagram depicts a clear boundary between positive and negative representations, highlighting the value of the proposed method in fine tuning and instance-based reasoning. Furthermore, the direction of representation switching is consistent,indicating that thelarge-scale languagemodel has the ability to tune one attribute while stabilizing the other. This consistency in switching direction suggests that LLMs can predict and control their behavior for specific feature operations; compared to CoTAG, CoTAM draws clearer boundaries, allowing for more efficient data learning than traditional data augmentation.

Summary

In this paper, we propose a new method, Chain of Thoughts Attribute Manipulation (CoTAM). This method uses data generated from large-scale language models to achieve high performance with small amounts of training data.

Inspired by image manipulation techniques, CoTAM generates label-switched data by changing the attributes associated with a particular task and reconstructing new sentences. Testshave confirmed thatCoTAMis more effective thanotherlarge-scale languagemodel-based text generation techniques.

Future work will aim to apply attribute manipulation techniques to smaller language models to increase scalability and accessibility. This is expected to increase efficiency by reducing reliance on resource-intensive processes associated with large language models.Further improvements arealso expected by ensuring output stability and increasing its utility in real-time applications while maintaining performance quality.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us