GPT Takes Bar Exam

Computation And Language 26/02/2024

3 main points
✔️ The bar exam is required to obtain a license to practice law and requires a high level of legal knowledge to pass.
✔️ uses state-of-the-art AI technology to assess performance on portions of the exam.
✔️ demonstrated very good performance on the MBE portion of the bar exam.

GPT Takes the Bar Exam
written by Michael Bommarito II, Daniel Martin Katz
(Submitted on 29 Dec 2022)
Comments: Additional material available online at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

The bar exam is required to obtain a license to practice law and requires a high level of legal knowledge to pass. Many candidates spend several years of law school education to prepare specifically for the exam in order to pass. However, pass rates are relatively low, with approximately one in five failing to pass on the first attempt. Among other things, the multi-state multiple choice (MBE) section consists of multiple-choice questions about basic legal principles and how to apply the law. In order to pass the bar exam, one must generally meet certain passing standards on the MBE section.

Therefore, we are using OpenAI's text-davinci-003 model (also known as GPT-3.5), a state-of-the-art AI technology, to evaluate the performance of the MBE section. GPT-3.5 performs well without specific training data, but can it achieve a high percentage of correct answers on a practice test?

Background

The legal system is becoming increasingly complex and the demand for legal services is growing. To address this situation, AI and process engineering are being introduced to benefit legal professionals as well as the general public.

However, legal documents and terminology can be very complex and difficult to understand. Unlike ordinary language, legal language is highly formalized and can be difficult for the public and AI systems to understand. Also, legal terms may have different meanings depending on the context.

Despite these challenges, advances in AI technology have led to significant advances in the field of natural language processing (NLP). In particular, the advent of transformer-based large-scale language models (LLMs) has enabled advanced text processing. These models have also challenged the evaluation of complex legal issues.

Data

The bar exam is a professional licensing examination not only in law, but also in medicine, dentistry, pharmacy, accounting, engineering, and other professional fields. In the United States, each state administers its own law licensing requirements, but the National Conference of Bar Examiners (NCBE) designs the majority of bar exam materials used throughout the United States.

Passing the bar exam requires a lot of preparation. In general, passing the exam requires a large amount of theoretical knowledge and the ability to understand and answer questions specific to the exam.

In recent years, most states have adopted the Uniform Bar Examination (UBE), which consists in part of three components: a multiple-choice test, an essay test, and a scenario-based performance test. The multiple-choice test typically represents 50% of the overall bar exam score and is designed to test legal knowledge and reading comprehension.

For this study, we purchased standardized test preparation materials provided by the NCBE and used practice and practice exams for the bar exam.

Proposed Method

The text-davinci-003 text completion API used in the experimental evaluation of GPT-3.5 used a technique called zero-shot prompting. This is a technique that can be immediately applied to new tasks and domains without prior training of the model to a specific task.

First, an approach called prompt engineering is introduced. This is the process of designing and tuning the prompts needed for the model to produce appropriate output. Multiple prompt types are tried and found to be most effective.

Next, adjust the API hyperparameters. The following parameters are used by the GPT model to control the quality and diversity of its output when generating text.

Temperature: This parameter controls the diversity of text produced by the model. At lower temperatures, the model produces more confident and predictable text, while higher temperatures produce more diverse text.
Top-p sampling: This is a method for sampling the tokens generated by the model within which the total probability of their occurrence does not exceed a certain threshold. This controls the variety of tokens generated.
best of : This parameter is used to select the most appropriate candidate when the model generates multiple candidates. This increases the probability that the most appropriate answer will be selected.
max tokens : This parameter limits the maximum number of tokens of text that will be generated. This prevents excessive output from being generated.

In addition, fine-tuning is done to adapt the pre-trained model to the specific task. Attempts at fine-tuning using unknown simulated MBE bar exam questions were made, but did not improve the performance of the production models.

Result

A total of 107 trials were conducted across prompt and parameter combinations in this study. The results show that prompt style #7 (ranked order of the top three choices) was the most effective, with 41 sample runs collected across this prompt parameter combination. The results of these runs indicate that the GPT is not yet fully passed compared to the baseline pass rate, but the pass rate is higher than the random probability. Furthermore, the GPT performs as well as human test takers in certain categories, but the differences are larger in other categories. The following chart compares the performance of the GPT-3.5 and NCBE-reported student performance by question category.

This performance difference could be due to the possibility that the GPT was not included in the training data, that a body of knowledge in the model was removed, or the complexity of the test design. To explore these possibilities, we investigated whether the GPT would be a "close" correction. Results showed that GPT responses in certain categories correlated poorly with rank and accuracy, raising the possibility that test design was the cause of poor performance.

Additionally, the second best GPT response is highly correlated with the accuracy rate, with the top two responses exceeding the baseline random probability. Overall, GPT responses significantly outperform the accuracy rate, with the exception of the Civil Litigation category, which shows a particularly strong correlation.

Conclusion

The study showed that GPT-3.5 performed very well on the MBE portion of the bar exam; GPT-3.5 achieved pass rates equal to or better than human test takers without any fine-tuning and significantly above the random guess baseline. This suggests that the GPT-3.5 is considerably more advanced in its understanding and reasoning abilities in the legal domain.

Looking ahead, we expect to see new models such as the GPT-4 and LAION's Bloom family models. These models have the potential to further improve legal understanding and reasoning capabilities. There are also plans to evaluate the GPT-3.5 in test sections other than the MBE, which will further our understanding of how the GPT series and other models work across legal exams.

Categories related to this article

Sasayama

GPT Takes Bar Exam

Introduction

Background

Data

Proposed Method

Result

Conclusion

Visualizing The "inside Of The Head" Of A Language Model - The Internal Mechanism Of LLMs Revealed By The Knowledge Graph

Visualizing The "inside Of The Head" Of A Language Model - The Internal Mechanism Of LLMs Revealed B ...

[IRCoder] Intermediate Representation Makes The Language Model A Robust Multilingual Code Generator

[IRCoder] Intermediate Representation Makes The Language Model A Robust Multilingual Code Generator

The Future Of AI In Education" Human-Centered Learning And Technological Foundations

The Future Of AI In Education" Human-Centered Learning And Technological Foundations

Two Rabbits, One Rabbit: The Trade-off Between Adjusting Controllable Models And Improving Performance

Two Rabbits, One Rabbit: The Trade-off Between Adjusting Controllable Models And Improving Performan ...

Graphix-T5: Database Operations In Natural Language

Graphix-T5: Database Operations In Natural Language

Challenges And Solutions For German Summarization Systems: Analysis Of Training Data And Existing Systems

Challenges And Solutions For German Summarization Systems: Analysis Of Training Data And Existing Sy ...