Development Of LLM Chatbot Specialized For Multiple Choice Questions In Physics At Indian High School Level

Large Language Models 09/09/2024

3 main points
✔️ Research conducted on MM-PhyQA dataset to develop an LLM chatbot specifically for multiple choice questions in Indian high school physics
✔️ Two methods, image captioning and RLHF, were introduced and found to significantly improve LLM accuracy with the addition of image captions
✔️ In the future, various issues need to be addressed, including verification of the effectiveness of RLHF, application to other fields, use in actual educational settings, and ethical considerations

MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering
written by Avinash Anand, Janak Kapuriya, Chhavi Kirtani, Apoorv Singh, Jay Saraf, Naman Lal, Jatin Kumar, Adarsh Raj Shivam, Astha Verma, Rajiv Ratn Shah, Roger Zimmermann
(Submitted on 19 Apr 2024)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Developments in artificial intelligence are transforming the way we learn. In particular, chatbots using large-scale language models (LLMs) are greatly expanding educational possibilities by providing personalized instruction and immediate feedback.

However, many challenges still remain in applying LLMs to the field of education. For example, in physics problem solving, it is essential to compute mathematical equations and understand concepts, but LLM has not performed well in these areas. In addition, when problem statements include images, it is difficult to process this information appropriately.

Therefore, in this study, we worked to develop an LLM chatbot specifically for multiple choice questions in physics at the Indian high school level. By using reinforcement learning and image captioning, we succeeded in significantly improving LLM's problem solving and reasoning abilities. This research is a step forward in opening the door to an educational revolution in the age of AI.

Related Research

Related research includes the development of Vision Language Models (VLMs): models such as Flamingo, GPT4, the LLaVA series, and MiniGPT4 are capable of processing combined visual and verbal information and have shown excellent performance in visual question answering tasks They have shown excellent performance in visual question-answering tasks. Models such as VisionLLM, Kosmos-2, and Qwen-VL with improved visual grounding capabilities are also available.

Reinforcement Learning from Human Feedback (RLHF) was initially focused on tasks such as text summarization and question answering, but has gradually been applied to improve generic language models. models to improve their ability to reason and interact with humans.

For image captions, they have been shown to be effective in reducing the limitations and halucinations of LLM manifold processing. The use of image captions provides more contextual information to the LLM and is expected to improve accuracy.

Applications of LLM in education include providing personalized learning materials, increasing productivity, and improving accessibility. Research is also underway to develop student assistants using LLM and to automate feedback on programming assignments.

However, evaluations of ChatGPT in mathematics education have indicated that there is still room for improvement in terms of domain adaptation and contextual understanding. Based on these related studies, we are developing an LLM chatbot specifically for physics education.

Proposed Method

1. using MM-PhyQA dataset

- Dataset of Indian high school level physics multiple choice questions
- includes question text, choices, correct answers, and explanations
- 3,700 samples for study, 676 samples for testing

2. adding image captions

- Provide a detailed description for each problem image
- Generate image captions using the Infi-MM model
- Minimize hallucinations and image processing errors

3. application of RLHF

- Incorporate human feedback into the model learning process
- Select 2,000 samples from the MM-PhyQA dataset and reason with 5 models
- Rank inference results using Gemini Pro
- Pair highest ranked responses with other responses to create 8,000 priority dataset
- train reward models (RMs) using priority dataset
- update LLMs with RMs using PPO algorithm

4. fine tuning

- Using 7B, 13B, and 13B LoRA large versions of the LLaVA 1.5 model
- Fine tuning using MM-PhyQA dataset
- Efficiently learning parameters using PEFT

An overview of the proposed method is shown in Figure 1: the RLHF process improves the LLM's reasoning ability by creating a priority data set and learning a reward model.

In the experiment, the proposed method can be evaluated by comparing its performance in the following six settings

1. fine tuning using (question text/answer, image, caption)
2 . fine tuning using (question text/answer, caption)
3. fine tuning using (question text/answer, image)
4 . applying RLHF to 1
5 . applying RLHF to 2
6 . applying RLHF to 3

Experiment

Tables 1 through 3 show the accuracy of each model with respect to the test data in settings 1 through 3 of the six experimental settings described in the previous section, where the RLHF is not applied.

Table 1 shows the results of fine tuning using only question text, answers, and images; the accuracy of the 7B, 13B, and 13B LoRA large models in LLaVA 1.5 is 53.3%, 52.7%, and 53.1%, respectively, with no significant differences.

Table 2 shows the results of fine tuning using question text and answers, images, and captions. Adding image captions significantly improves accuracy, with LLaVA 1.5 7B, 13B, and 13B LoRA large models achieving 82.52%, 83.28%, and 82.1% accuracy, respectively, indicating that image captions contribute to LLM performance improvement.

Table 3 shows the results of fine tuning using only question text, answers, and captions. Even without images, the use of captions improves accuracy: the 7B, 13B, and 13B LoRA large models in LLaVA 1.5 have accuracies of 66.95%, 64.0%, and 74.56%, respectively.

These results indicate that image captions play an important role in improving LLM performance. The addition of image captions may have improved problem-solving performance because they provide more contextual information to the LLM.

However, since the paper does not present results for settings 4 through 6 with the RLHF applied, we cannot discuss the effect of the RLHF; it is expected that the application of the RLHF will further improve the LLM's reasoning ability, but verification of this is a subject for future work.

In addition, since the MM-PhyQA dataset used in this study is specific to Indian high school level physics problems, the effectiveness of the proposed method for problems in other disciplines and difficulty levels requires further investigation.

Conclusion

In this study, two methods, image captioning and RLHF, were implemented on the MM-PhyQA dataset to develop an LLM chatbot specifically for multiple choice questions in Indian high school physics. Experimental results showed that adding image captions significantly improved LLM accuracy. On the other hand, the effectiveness of RLHF needs to be verified in the future.

In the future, various issues need to be addressed, including verification of the effectiveness of RLHF, its application to other fields, its use in actual educational settings, and ethical considerations. This study provides important insights into the application of LLHF to the field of education and is expected to contribute to the development of AI education research.

Categories related to this article

Sasayama

Development Of LLM Chatbot Specialized For Multiple Choice Questions In Physics At Indian High School Level

Summary

Related Research

Proposed Method

1. using MM-PhyQA dataset

2. adding image captions

3. application of RLHF

4. fine tuning

Experiment

Conclusion

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...