RiceChem] Dataset For Evaluating Automated Long-form Grading (ALAG) By LLM
3 main points
✔️ Provides a unique "RiceChem" dataset designed specifically for ALAG to facilitate further research in the important area of educational NLP
✔️ Proposes a new scoring format to address the unique complexity of long texts
✔️ Presents a comprehensive evaluation of the ALAG task for large-scale language models and presents challenges and opportunities for future research in this area
Automated Long Answer Grading with RiceChem Dataset
written by Shashank Sonkar, Kangqi Ni, Lesa Tran Lu, Kristi Kincaid, John S. Hutchinson, Richard G. Baraniuk
(Submitted on 22 Apr 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
The field of natural language processing (NLP) in education has focused on short answer scoring and free writing essay scoring. However, this paper takes a new perspective and studies the relatively unexplored area of automated long-answer grading (ALAG).
Free-text essays are evaluated on characteristics such as coherence and originality, whereas long-text responses are factual and require a more sophisticated scoring approach. Traditional automated short answer scoring (ASAG) classifies responses into five categories: correct, partially correct, inconsistent, irrelevant, and out of domain, butlong-form responses may exhibit characteristics of multiple categories simultaneously, making five categoriesinadequate for long-form responses.
Therefore, to further ALAG's research, this paper has created its own dataset, RiceChem, which collects 1,264 long-form responses from college-level chemistry courses. RiceChem assesses each response based on 27 rubric items, for a total of 8,392 data There are a total of 8,392 data points. The average word count for this dataset is 120, much higher than existing datasets (SciEntsBank: 13, Beetle: 10, Texas 2011: 18), making it a suitable dataset for ALAG's research.
Considering the limitations of the traditional ASAG method, we have redefined ALAG as a rubric inclusion task. In this new method, each rubric item serves as a criterion that student responses must meet. A natural language inference model is used to determine if each rubric is encompassed in the response, allowing for more precise and comprehensive scoring.
We are fine-tuning encoder models such as BERT, RoBERTa, and BART to establish a baseline for the ALAG task using the RiceChem data set. As a result, it is clear that using rubric-based methods increases the complexity of ALAG; the rubric-based format in ALAG more accurately captures the subtleties and multifaceted aspects of student responses than traditional score-based approaches.
It also examines the model's performance in a cold-start scenario, providing valuable insights into data efficiency and real-world deployment in an educational environment.
In addition, state-of-the-art open source large-scale language models are benchmarked in RiceChem and these results are compared to the GPT model. This reveals how complex ALAG is compared to ASAG. The lower performance of large language models in RiceChem compared to ASAG's SciEntsBank, even when utilizing rubric-based methods, demonstrates the great difficulty of the ALAG task.
This study is one of the first attempts to address automated long-form grading (ALAG) in the field of educational NLP.
Data Sets and Methods
In this section, we first introduce the original RiceChem dataset and then define the problem definition for the ALAG task.A schematic of the automatic long-form grading (ALAG) using the RiceChem datasetproposed in this paperis shown in the figure below.
This figure highlights a novel way of formulating ALAG as a rubric inclusion question. Each student response (assumption) is paired with the corresponding rubric item (hypothesis), and these pairs are processed by a fine-tuned ALAG-transformer model. The model predicts whether a response implies a rubric item; the use of rubrics in RiceChem allows for detailed point-by-point assessment and makes the scoring process interpretable by design.
As mentioned above, we are developing a RiceChem dataset to validate the ALAG task. This dataset will not only be a valuable resource for researchers working on ALAG, but will also allow us to develop a more reliable and interpretable scoring system that can use rubrics to provide meaningful feedback to students.
RiceChem contains four exam questions, 27 rubric items, and 1264 scored student responses collected from college-level chemistry courses. Multiple teaching assistants rated student responses to individual rubric items with TRUE or FALSE labels. In total, there are 4,880 TRUE labels and 3,512 FALSE labels. Each rubric item has a designated score, and the final score is determined by totaling the scores of correctly answered rubric items.
Given an inference model M : (P, H) → L, it takes as input the premise P and the hypothesis H and predicts a label L ∈ {True, False} indicating whether P implies H. To formulate scoring as an inference problem, we can treat student responses R and rubric item I as premises and hypotheses, respectively. That is, we input (R, I) into the model M to predict the label L.
The ALAG method proposed in this paper achieves this formulation by learning a language model to predict the implications of rubric items from student responses. Prediction can effectively identify correctly addressed rubric items in student responses and provide automatic feedback.
Experiments and Results
In addition, we highlight the importance of implication-based and rubric-based formulations in ALAG and demonstrate their superiority over traditional score-based approaches. We also investigate the performance of these models in situations with limited labeled data (cold start) and discuss implications for practical deployment in educational settings. Finally, we evaluate the performance of state-of-the-art open-source large-scale language models (LLMs) in RiceChem and compare their results to GPT models, showing the increased complexity of ALAG over ASAG.
We begin with the training procedure for the transformer language model using the RiceChemdatasetand the evaluation metrics used throughout the experiment:to fine-tune the transformer model on the RiceChem dataset, we split the data into 80-10-10 training-verification-tests, preprocessed the The data were split into 80-10-10 learning-validation-tests and pre-processed. For each question, 80% of the student responses are randomly selected for training, 10% for validation, and 10% for testing to ensure that these responses do not overlap.
Experiments are performed using the Hugging Face transformers library. The learning process uses an NVIDIA A100-PCIE-40GB GPU. During training, the AdamW optimizer is used and the initial learning rate is set to 2e-5. Each update is done in mini-batch size 16 and the model is trained in up to 10 epochs. Hyperparameters α and β are set to 0.9 and 0.999, respectively. After training, the model with the highest F1 score for the validation data is selected as the best model for evaluation.A comprehensive set of evaluation metrics is used as a baseline for the experiment, including accuracy, goodness-of-fit, reproducibility, and F1 score. To ensure robustness, we report the mean and standard deviation of five runs with five different seeds.
We evaluate the performance of the latest discriminative language models, such as BERT, RoBERTa, and BART, on the RiceChem dataset. In the table below, we compare the results of the base and large-scale models. In particular, the large-scale model outperforms the base model, demonstrating the advantages of using a more advanced model. However, some exceptions can be found with respect to the BERT model due to instability in fine-tuning.
The table below also compares the performance of the language model on the RiceChem dataset with its MNLI fine-tuned version. The results show that the model fine-tuned on the MNLI (Multi-Genre Natural Language Implication Corpus) dataset shows significant improvements in both accuracy and F1 scores, highlighting the value of formulating ALAG as an implication problem.
Formalizing ALAG as an implication task allows the use of the MNLI dataset, which contains premise-hypothesis pairs covering a wide range of topics and language types; with 4 million examples, the MNLI dataset provides a wealth of linguistic knowledge and reasoning power that can be effectively transferred to the ALAG task. and can be effectively transferred to ALAG tasks.
The implication formulation allows us to leverage models that have been pre-trained on the MNLI dataset and canbe efficiently adapted to the specific domain of long-form scoringby fine-tuningmodels with a strong understanding of the implication relationship between assumptions and hypotheses.
The performance gains shown in the table below (reproduced below) confirm the effectiveness of this transfer learning approach: the RoBERTa model,whenfine-tunedwith MNLI, shows a 3.2% improvement in accuracy and a 2.8% improvement in F1 score. Similarly, the BART model shows a 1.8% improvement in accuracy and a 1.4% improvement in F1 score. These improvements indicate a successful transfer of knowledge from the MNLI dataset to the ALAG task, which is achieved through the implied formulation.
The implication formulation not only allows for the use of large data sets such as the MNLI, but also provides a more natural and interpretable approach to ALAG. Aligning the scoring process with the task of determining the implication relationship between student responses and rubric items creates a more intuitive and explainable framework.
Automated scoring using rubrics has also been shown to improve performance in automated short answer grading (ASAG) and automated essay grading (AEG). This experiment also confirms that this also works for automated long answer grading (ALAG). The rubric-based approach shows an average accuracy improvement of 9.2% and an F1 score improvement of 15.4% compared to traditional score-based methods.
Similar to previous studies on ASAG and AEG, our experiments confirm the importance of rubric-based formatting in ALAG. However, the complexity and multifaceted nature of long texts makes the importance of rubric-based forms in ALAGs even more pronounced.
To illustrate this, we compare the traditional score-based approach to the rubric-based ALAG approach. In the score-based approach, the RiceChem dataset is preprocessed, the data is structured into sentences (student responses) and labels (scores), and the language model predicts integer scores from 0 to 8. The rubric-based ALAG format, on the other hand, decomposes the scoring process into smaller, more manageable components, allowing the model to focus on specific aspects of the responses defined by the rubric items.
The figure below (reproduced below) shows a 9.2% improvement in accuracy and a 15.4% improvement in F1 score for the rubric-based method compared to the traditional score-based method. This significant performance improvement underscores the importance of utilizing rubrics in ALAG. By breaking down the complex task of scoring long responses into smaller, clearly defined rubric items, the model can more effectively capture the nuances and multifaceted aspects of student responses.
Creating a high-quality rubric is challenging and requires careful consideration and effort. However, this effort only needs to be done once, and the benefits can be reaped repeatedly in subsequent automated scoring processes. Rubrics provide a comprehensive framework for assessing key aspects of responses, resulting in more accurate and reliable scoring results.The use of rubrics in ALAG not only improves model performance, but also increases the interpretability and transparency of the scoring process. By aligning model predictions with specific rubric items, educators and students can more clearly understand the strengths and weaknesses of responses and facilitate targeted feedback and improvement.
In addition, it is common in education settings to deal with new courses, subject areas, and question types where there is limited training data. Therefore, it is important to evaluate how automated scoring models perform in a cold-start setting and to understand how their performance evolves as training data grows. The analysis in this section provides valuable insight into the data efficiency of the model and helps determine the minimum amount of labeled data needed to achieve satisfactory scoring results.
First, we evaluate the performance of the RoBERTa-Large-MNLI model on unseen questions, tweak the model on some questions, and simulate a scenario in which we score responses to new questions without prior training data. For this study, we trained the model on three questions in the dataset and used the remaining unseen questions for testing.
As shown in the table below, the model exhibits an accuracy of 60.6% to 68.7% across questions and an F1 score of 0.629 to 0.717, indicating that it has a certain generalization ability. This indicates that models fine-tuned to similar types of questions have acquired some transferable knowledge to address unseen questions, which is valuable in educational settings where labeled data on new questions is scarce.
Next, we investigate the performance of the RoBERTa-Large model and its MNLI fine-tuned version as the amount of training data increases from 5% to 80%. The figure below shows the trends in accuracy and F1 scores for both models. As expected, performance consistently improves as the amount of training data increases: for RoBERTa-Large, accuracy increases from 73.2% to 84.1% and the F1 score increases from 0.772 to 0.864. Similarly, for the fine-tuned version of MNLI, accuracy improves from 79.2% to 86.8% and the F1 score increases from 0.823 to 0.888.
The performance gains decrease after 40% of the training data in the case of RoBERTa-Large and after 20% in the case of RoBERTa-Large-MNLI. This observation suggests that the model can achieve competitive scoring results with relatively small amounts of labeled data, and that the benefit of additional data becomes less pronounced beyond a certain degree. Furthermore, the standard deviations of the accuracy and F1 scores are within 1.12% across the different seeds, indicating the reliability and consistency of the model's performance.
In addition, we are evaluating the zero-shot performance of several large-scale language models on the RiceChem dataset to assess the potential of these models in the context of ALAG (automatic long-text scoring).
Despite the high performance of these large-scale language models in many areas, the RiceChem dataset proved to be a very formidable dataset. The best performing model, GPT-4, achieved an accuracy of 70.9% and an F1 score of 0.689, highlighting the complexity of the ALAG task. This result is particularly striking when compared to the results the GPT model showed in the ASAG task.
The difference in complexity between ASAG and ALAG may be greater than the 5-point difference in F1 scores; the use of rubrics in RiceChem provides a structured framework that improves model performance, but GPT-4 still has difficulty matching ASAG without rubrics task, but GPT-4 has difficulty matching its performance in the ASAG task without the rubric.
The results in the table below (reproduced below) also reveal the different performance of various large language models on the RiceChem dataset: while GPT-4 and GPT-3.5 are the top performers, other models such as Qwen1.5 32B Chat and Mistral also show promising results, with with F1 scores of 0.456 and 0.429, respectively. These findings indicate that the architecture and training methods of large-scale language models have a significant impact on their ability to cope with the complexity of ALAG.
In summary, benchmarking large language models on the RiceChem dataset highlights the unique challenges posed by the ALAG task. Even with the benefits of rubrics, the performance gap between the ASAG and ALAG tasks underscores the need for further research and development of models and techniques specifically designed to assess long-form, fact-based responses. As large-scale language models continue to evolve, it is important to explore their potential in the context of ALAG and develop strategies to improve automated scoring systems in educational settings.
Summary
This paper introduces a new task, automated long-answer grading (ALAG), and proposes a RiceChem dataset specifically designed to advance research in this area. the rubric-based formulation of ALAG provides an elaborate and pedagogically appropriate approach to assessing long-answer responses and more comprehensive assessment compared to traditional automated short answer grading (ASAG) methods.
Through extensive experimentation, we demonstrate the importance of rubric-based formulations, the value of implied formulations, and the challenges posed by cold-start scenarios. In addition, benchmarking of state-of-the-art models, including large-scale language models, confirms that ALAG is a much greater challenge than ASAG.
It is hoped that this research will stimulate further research in the important area of educational NLP and contribute to the development of advanced models that can handle the complexity and sophistication of the ALAG task.
Categories related to this article