SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Large Language Models 30/01/2025

3 main points
✔️ Developed and released SportQA, the first dataset to assess LLMs' ability to understand sports
✔️ Using SportQA to analyze LLMs' strengths and weaknesses in their ability to understand sports
✔️ Demonstrates new possibilities for NLP techniques in improving sports journalism and supporting athletes

SportQA: A Benchmark for Sports Understanding in Large Language Models
written by Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan-fang Wang, Weining Shen
(Submitted on 24 Feb 2024 (v1), last revised 18 Jun 2024 (this version, v2))
Comments: NAACL 2024
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In the world of sports, many factors are intertwined: diverse competitions, rules and tactics, individuals and groups, individual player characteristics, and so on. As a result, building a large-scale language model (LLM) with a good understanding and familiarity with sports is extremely difficult.

While LLM has shown excellent performance in tasks such as natural language understanding, information extraction, and question answering, its application in areas such as sports, where complex data and strategies are involved, is far from satisfactory.

For example, if you are a sports fan, you might immediately answer the question, "Which team will win the FIFA World Cup in 2022?" you can quickly answer the question, "The team that will win the FIFA World Cup in 2022. But to answer the question, "Why do beginners use more float serves in a volleyball match and advanced players less so?" requires specialized knowledge.

In order to assess the ability of LLMs in the field of sports,a sports-specific data set is needed that includes such general questions as well as questions that require in-depth analysis.

In the past,datasetssuch as BIG-bench and LiveQAhave been constructedto assess the ability of LLM insports. However, these do not adequately address the extensive knowledge and complex context of sports. Quality has also been an issue, withsome questions containing factual inaccuracies, for example, "Tom Brady (football player) scored a touchdown in a soccer game.

To address this challenge, the study is building a new dataset called SportQA, which contains over 70,000 questions, ranging from basic sport knowledge to complex inference questions to accurately assess LLMs' competence in sport.The SportQA includes over 70,000 questions that cover a wide range of topics from basic knowledge of sports to questions that require complex reasoning in order to accurately assess LLMs' competence in sports.

There are different levels of difficulty in LLM competence in the field of sports. In this paper, theyare classified and defined intothree levels (Level-1 toLevel-3).

Level-1 is 21,385 questions that test factual and historical knowledge. These questions can be answered quickly by sports fans, such as Olympic medalists.

Level-2consists of 45,685 questions that require understanding of rules and tactics. For example, these questions require knowledge of offsides in soccer.

Level-3 is 3,522 questions requiring analysis of complex scenarios. These questions are designed for experts with years of experience. For example,this question asks for advanced judgment on how to break throughthree blockers in a volleyball game.

Level-1 and Level-2 questions are multiple-choice questions in which one appropriate choice is selected, while Level-3 questions are multiple-choice questions in which multiple appropriate choices are selected, and the difficulty level of the questions increases.

The paper uses this SportQA to evaluate the performance of state-of-the-art LLMs, including Llama2, PaLM2, GPT-3.5, and GPT-4. GPT-4 outperforms other models at all levels, achieving 82.16% at Level-1, 75% at Level-2, and 47.14% correct response rate.

However, this Level-3 correct response rate is about 45% lower than that of human experts, which also suggests that there is still room for improvement in this area.

SportQA Dataset

SportQA is built using a combination of automated and manual methods:Level-1 andLevel-2 questions are automatically generated templates that are modified by sports experts to cover a wide range of sports knowledge;Level-3 questions are all manually created by sports experts and require practical analysis .Level-3 questions are manually created by sports experts and are practical and analytical questions.

To ensure the accuracy and consistency of the data set, each question was scrutinized by 36 American and Chinese student athletes. They have a minimum of eight years of sports experience and a thorough understanding of the rules and strategies.In recruiting student athletes, each candidate is interviewed using example questions for each level and trained before being formally assigned to the annotation process.

Level-1

The Level-1 questions are designed to assess the extent to which LLMs know basic knowledge about sports. Primarily asking for factual and historical information, the survey contains 21,385 multiple-choice questions, drawn from a variety of QA datasets.These datasets come in a variety of formats, including true/false, multiple choice, and free response, and we have standardized them into a choice format.

For example, the Trivia QA, QUASAR, and Hotpot QA data sets were converted to a choice format because many of the questions were in an open-ended format. In addition, since the questions in KQA Pro were originally in a choice format, they were used in their original format, after checking the accuracy and relevance of the content.

As mentioned earlier,Level-1questionsare created using a combination of automatic and manual methods.

Level-2

TheLevel-2questions are designed to assess the LLM's understanding of the rules and tactics of the sport, as well as their extensive historical and factual knowledge. 45,685multiple-choicequestions are included, covering a wide range of content across a variety of sports.

We have collected information on 35 different sports from Wikipedia.It includes28 Olympic sports, four new sports that will be adopted for the first time at the 2024 Paris Olympics (breaking, sport climbing, skateboarding, and surfing), as well aspopular sportssuch as baseball and American football thatare not Olympic sports but are stillpopular.

As mentioned above,Level-2questions are alsocreated using a combination of automated and manual methods.All questionsare checked bythe review teamforconsistency with the original sourceand verified to be based on the most current information. In addition, questions that are outdated or no longer relevant are removed to ensure consistency.

Level-3

Level-3 questions are the most challenging questions in SportQA. It contains 3,522 scenario questions on six major sports: soccer, basketball, volleyball, tennis, table tennis, and American football.

These questions are not designed to test simple knowledge, but rather to recreate real-life sports situations and assess how deep the LLM's understanding and analytical skills are. The questions are multiple choice format with one to four correct answers.

As mentioned above,Level-3questionsare not automated, but are created manually.The reason they are done manually is that creating questions of this difficulty level requires specialized knowledge of each sport. Not mere superficial knowledge, but deep insights from people who have actually experienced the sports and are familiar with the strategies and practical situations are required.

First, we have asked coaches of each sport to suggest what angles should be evaluated. The coaches' extensive coaching experience is used to ensure that each question is designed to be effective and practical.Then, based on the evaluation angles suggested by the coaches, the review team uses their expertise in the sport and their own athletic experience to develop questions.

Experiment

The SportQA benchmark is used toevaluate the performance ofmajor LLMs (Llama2-13b-chat, PaLM-bison-chat, GPT-3.5-turbo, GPT-4, etc.).Each experiment was performed multiple times and shows the best results.

Level-1 randomly selected 2,000 questions from the test set;Level-2 employed a different sampling strategy based on the number of questions per sport, with 30% of the total sample for sports with fewer than 200 questions, 15% for sports with 200-800 questions, and 5% for 800-1500 questions, 2.5% for sports with 2,500 to 10,000 questions, and 1.5% for those with more than 10,000 questions, for a total of 2,243 questions selected forLevel-3, based on the number of questions per sport: 20% for soccer, basketball, and tennis; 30% for volleyball, table tennis and American football, and 50% for table tennis and American football, for a total of 980questions.

We also evaluate the model primarily using the Chain of Thought (CoT) prompting method, which is a stepwise reasoning method that has been shown to be particularly effective for complex sport comprehension tasks. In addition, Zero-shot CoT and Few-shot standard prompting (SP) are also employed for comparison.

In addition, we compare their abilities with those of humans.In addition to evaluating model performance, we recruit student athletes who are not participating in the review and have them manually answer a set of Level-3 tests.We compare model and human performance based on the ability of experts familiar with the sport.

A comparison of the performance of the different models at three levels is shown in the table below.

GPT-4 consistently outperforms the other models on all tasks, showing an average performance difference of more than 15% over the other models. We can also confirm that CoT is effective with respect to the form of prompts.

The tendency for gradual prompting with Few-shots to improve model performance, especially in tasks requiring complex reasoning, has been confirmed in previous work (Wei et al., 2022) and is supported by the present experiments.

GPT-4 shows the highest accuracy at Level-1, with a gradual decrease in accuracy as one progresses to Level-2 and Level-3. This is due to the increasing complexity of tasks at each level, with Level-3, which deals with complex scenarios, presenting the greatest challenge to the model.

However, while the GPT-4 shows overall superior performance, Level-3 shows that human experts outperform it by about 30% to 65% in correct responses. Compared to the depth of human knowledge and understanding in sports, this indicates that there is room for improvement in LLM.

Error Analysis

We randomly select 20 questions from each level and manually analyze them for errors. We ask the models to explain the reasons for their own decisions and review their explanations to identify what errors occurred and to find the causes behind them.

In Level-1 and Level-2,"Deficiency in Conceptual Understanding" is the most common error, accounting for 40% of the total.Level-3, with its more complex problems, also has more advanced errors, with"Deficiencyin Conceptual Misunderstanding" being the most common, accounting for 55% of the total.

This error can be seen, for example, in the inability to distinguish the difference between "referee" and "arbitrator". This is likely due to the model's incorrect understanding of the concepts involved in complex scenarios.

Summary

In this paper, we construct a new dataset, SportQA, to assess LLMs' understanding of sports.Whereaspreviousdatasetsfocused on questions about basic facts and sports-related fundamentals, SportQA covers historical facts, rules, strategies, and even questions that seek advanced sports knowledge and insights, such as scenario-based reasoning.

Evaluation results showed that GPT-4 performed well in basic sports knowledge and understanding of rules, but still had challenges with complex scenario-based reasoning, falling short of human expert knowledge.

Further advances in natural language processing (NLP) and AI have been shown to be necessary for LLMs to gain a deeper understanding in a diverse and changing domain such as sports.

SportQA is expected to be widely used in future research as an important tool for measuring and improving LLM comprehension in the field of sports.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Summary

SportQA Dataset

Level-1

Level-2

Level-3

Experiment

Error Analysis

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video