Catch up on the latest AI articles

How Close Is ChatGPT To Human Experts?

How Close Is ChatGPT To Human Experts?

Large Language Models

3 main points
✔️ ChatGPT has attracted attention in the NLP field and has shown excellent performance on a wide variety of tasks.
✔️ We collected the Human ChatGPT Comparison Corpus (HC3), consisting of over 40,000 questions and answers, and performed linguistic analysis on the humans and their results to provide insight into LLM content generation.

The ✔️ detection model has been developed and is available as open source to promote future research on AI-generated content and the regulation of online platforms.

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
written by Biyang GuoXin ZhangZiyuan WangMinqi JiangJinran NieYuxuan DingJianwei YueYupeng Wu
(Submitted on 18 Jan 2023)
Comments: this https URL

Subjects:  Computation and Language (cs.CL)

code:  

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The implementation of ChatGPT2 has received much attention in academia and industry. To understand ChatGPT's capabilities and limitations, this study created a dataset based on questions that included contrasts with human experts; this dataset, called the Human ChatGPT Comparison Corpus (HC3), was used to investigate ChatGPT's response characteristics, human We comprehensively evaluated ChatGPT's generated content and then experimented with methods to detect whether the generated text was ChatGPT or human. This provided insight into ChatGPT's performance and evolution, and revealed new directions for detecting adverse effects.

Introduction

ChatGPT has garnered much attention in the field of natural language processing and has demonstrated excellent performance on many NLP tasks. Its range of applications is wide, covering tasks ranging from text classification to code generation; ChatGPT's performance is based on fine-tuning and human feedback from the GPT-3.5 series, and its superior capabilities have provoked much debate. On the other hand, people are also concerned about the potential risks of ChatGPT, with instances of improper use on UGC platforms affecting its quality and reliability. Particularly in professional fields such as medicine and law, people should be aware of the potential for ChatGPT to generate misinformation. The use of ChatGPT also requires careful evaluation and measures need to be taken to address potential risks.

The authors make the following contributions in light of ChatGPT's transparency and the social risks associated with potential misuse of the model.

1. an extensive dataset, the Human ChatGPT Comparison Corpus (HC3), consisting of over 40,000 questions and answers, was collected to facilitate comparative research between humans and ChatGPT. It covers a variety of domains (medical, legal, financial, etc.) and is a valuable resource for investigating directions for improving language models.
2. We conducted a comprehensive evaluation and linguistic analysis of the human and ChatGPT generated responses and found interesting patterns. These findings help identify LLM-generated content and provide insight into future language model directions.
3. Based on the HC3 dataset and analysis, we developed ChatGPT detection models for a variety of detection scenarios and validated them.
4. Collected comparative data, evaluations, and detection models as open source to facilitate future research on AI-generated content and regulation of online platforms.

Human ChatGPT Comparison Corpus(HC3)

ChatGPT has been pre-trained on an extensive corpus and has the ability to respond to a wide variety of questions. This study evaluates how well ChatGPT's responses match those of humans, verifying their honesty and appropriateness to the user's needs. Public datasets and wiki texts were used to construct the comparative dataset, and information was obtained from expert responses and web user polls.

ChatGPT's answer generation is based on human question data and is used through the preview website. A thread is updated for each question to generate answers, and instructions are added to ChatGPT for specific data sets. On the other hand, it is noted that there is only a small difference in the agreement between human and ChatGPT answers.

This study will be an interesting source of information to evaluate how closely ChatGPT performs with humans in language generation. It should be noted, however, that ChatGPT responses are based on web-crawled information and wiki text, which may not be accurate in specialized domains.

Focusing on the consistency and honesty of ChatGPT responses provides important insights into comparing the performance of language models with humans. However, ChatGPT's sources of information and accuracy challenges in certain areas are areas for further improvement. A cautious approach to the advancement and validation of language models will continue to be required.

Because there may be multiple human/ChatGPT responses to each question in this study, the following format will be used to organize the comparative data

Overall, the English version collected 24,322 questions, 58,546 human responses, and 26,903 ChatGPT responses. For the Chinese version, we collected 12,853 questions, 22,259 human responses, and 17,522 ChatGPT responses. Meta-information for each dataset split is presented in Table 1.

Comprehensive evaluation and characterization of ChatGPT

In this section, a number of volunteers are invited to evaluate ChatGPT and then manually draw conclusions about several features based on the data provided by the volunteers. The main human evaluations are divided into Turing and usefulness tests, which provide a comprehensive assessment of ChatGPT's performance in different areas.

In the comparison dataset, we evaluated how well ChatGPT-generated answers were detected by experts and amateurs. We also evaluated how useful ChatGPT's answers were by a group of experts in a usefulness test. The results showed that there were differences in ChatGPT's performance in different fields, especially in the areas of finance and psychology, but that there was room for improvement in the medical field.

And based on feedback from volunteers, a distinctive pattern of ChatGPT has emerged: while ChatGPT tends to provide organized, detailed answers and reduce bias and damaging information, it should be used with caution in cases where knowledge may be lacking or facts may be fabricated. It should be used with caution, especially in legal questions.

The main difference between ChatGPT and humans is that ChatGPT focuses on questions and provides neutral answers, while human responses are flexible, subjective, colloquial, and express emotion and personality. This improves ChatGPT in a wide range of domains, but it has different characteristics from humans in terms of flexibility and individuality.

Evaluations of ChatGPT are varied and studies are underway on its performance in different areas. Future improvements are expected to address its limited performance in the medical field. In addition, a careful approach is required when using ChatGPT, with an understanding of its unique and outstanding aspects as well as its limitations.

ChatGPT and human answers

Linguistic features in ChatGPT and human responses were analyzed in detail. Human responses are shorter and use a more diverse vocabulary. ChatGPT, on the other hand, has a larger vocabulary but produces shorter responses on average. Part-of-speech and dependency analysis revealed that ChatGPT frequently used words such as NOUN (nouns) and VERB (verbs), while ADVERB (adverbs) and PUNCT were used less frequently. Emotion analysis showed that ChatGPT expresses more neutral emotions, while humans include more negative emotions. Analysis of language model complexity also suggested that ChatGPT exhibited relatively low complexity and the ability to reproduce common patterns learned from a large corpus of text.

This in-depth analysis helps to gain a deeper understanding of the differences in linguistic features between ChatGPT and human responses: while ChatGPT has shown an excellent ability to learn from large data sets and reproduce common patterns, human responses contain a wealth of unique expressions and emotions. This difference is important for understanding the advantages and limitations of ChatGPT and provides insight for improving future language models.

Demonstration of AIGC detection method and performance evaluation of ChatGPT

In this section, detection methods for detecting AIGC and distinguishing between machine-generated and human-generated content are examined in the context of the growing popularity of AI-generated content (AIGC).Demonstrations on the ChatGPT content detection system are conducted in different ways and the performance of methods such as machine learning and deep learning are evaluated under different conditions.

Three detection methods are implemented: a logistic regression model based on GLTR Test-2, a deep classifier for single text detection, and a deep classifier for QA detection GLTR Test-2 provides features to measure text fluency and naturalness, and these methods generate ChatGPT used to identify content. The performance of the methods is evaluated at different granularities and data sources, and detailed results and discussion are provided.

In implementation details, gpt2-small and Wenzhong-GPT2-110M are used as the LMs used for GLTR Test-2, and roberta-base and chinese-roberta-wwm-ext are used for the RoBERTa-based deep classifier. These models are taken from the huggingface transformer, and sklearn and AdamW optimizers are used for training.

The experimental design examines approaches to train binary classifiers of human and ChatGPT responses on the HC3 dataset, and different experimental groups are designed. The influence of directives, sentence-level detection, and the usefulness of the corresponding questions will be tested, and six different versions will be generated based on different training and test set combinations to evaluate the performance of the model.

Detection of AI-generated content is important from a reliability and security perspective. The empirical experiments in this section provide insight into evaluating the performance of detection methods using machine learning and deep learning. The distinction between machine-generated and human-generated content is a complex issue and is expected to be developed in future research.

Experimental results

Results based on multiple experiments show that the RoBERTa-based detector outperforms GLTR and detects ChatGPT-generated text more robustly due to its interference-resistant properties. While it is unaffected by indicator words and performs effectively in out-of-distribution scenarios, GLTR is sensitive to ChatGPT patterns and performs poorly, especially on Chinese language datasets. It is emphasized that the deep learning-based RoBERTa is more beneficial than the logistic regression model and superior in detecting AI-generated content.

We also observed that removing directives improved model performance, but this may be impaired for models trained on sentences, so an appropriate balance is required. Detecting generated text was shown to be more difficult for full sentences than for single sentences, especially for detectors trained on the raw corpus.

In addition, the use of a sentence corpus in training the model was identified as a performance enhancer, highlighting that a QA-style detector is more effective than a single text detector and is particularly suited for filtered scenarios. It was noted that ChatGPT detection difficulties varied depending on the data source and lacked consistency in the transfer of open QA datasets.

In general, these experimental results indicate that ChatGPT's detection performance is complex and is affected by a variety of factors in model training.

Conclusion

The study introduced the HC3 dataset and conducted extensive evaluations and experiments based on human and ChatGPT responses; the human evaluations and linguistic analysis conducted using the HC3 dataset provide insight into the differences between humans and ChatGPT and offer suggestions for future language model directions. The ChatGPT content detection experiments also drew important conclusions for the research and development of AIGC detection tools.

This study also introduces a new dataset in the evaluation of ChatGPT performance and reveals differences between language models and human responses. Looking ahead, the results of these studies will provide a basis for potential improvements and applications of language models. In addition, progress toward more effective and robust AI-generated content detection methods is expected in the research and development of detection tools.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us