Mental Health Care Using Large-Scale Language Models, Effectiveness And Challenges Of AI Counselors
3 main points
✔️ Increasing need for accessible text-based counseling for youth, but lack of experienced counselors
✔️ GPT-4-based counseling dialogue system responses were evaluated by professional counselors and achieved performance comparable to human counselors
✔️ AI Shows Potential to Play a Key Role in Counseling, but Further Improvements Needed for Full Automation
Can Large Language Models be Used to Provide Psychological Counselling? An Analysis of GPT-4-Generated Responses Using Role-play Dialogues
written by Michimasa Inaba, Mariko Ukiyo, Keiko Takamizo
(Submitted on 20 Feb 2024)
Comments: Accepted as a conference paper at IWSDS 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
In today's society,mental health care issuesare becoming increasingly serious. For example, suicide is the leading cause of death among people aged 10 to 39 in Japan. In addition, according to the World Health Organization (WHO), suicide is the leading cause of death among young people worldwide. Against this backdrop, text-based counseling, which provides psychological support via a messaging application, is attracting attention.
Text-based counseling has the advantage of being more accessible, especially for the younger generation, with lower psychological hurdles than telephone or email-based counseling. However, there is a lack of experienced counselors. Even those with experience in face-to-face, phone, or email-based counseling find it difficult to provide text-based counseling without proper guidance and training. Furthermore, there is a shortage of personnel who can provide such appropriate guidance.
Against this backdrop, methods of mental health support utilizing natural language processing are being researched. In particular, the automatic detection of mental health problems and disorders is a research area that has attracted much attention. In the field of dialogue systems, several systems have been developed to improve mental health.On the other hand, recent developments in large-scale language models have shown their adaptability to various tasks and domains, butthe performance of counseling dialogue systems usinglarge-scale languagemodels has not yet been fully evaluated.
In this paper, a counseling dialogue system is constructed using GPT-4 and the generated responsesare evaluated by aprofessional counselor. To generate appropriate responses, counseling dialogue data was collected through role-play scenarios with a professional counselor, and the utterances were annotated with the counselor's intentions. To assess the feasibility of using the dialogue system in a real counseling situation, a third-party counselor is evaluating the appropriateness of the responses generated by the human counselor and the GPT-4 in the same context of the role-play dialogue data.
Role Play Dialogue Collection and Generation of Counselor Responses
Two counselors participated in the collection of role-play dialogues, oneplaying the role ofclientand the other playing the role of counselor, and the dialogueswere conducted in Japanese using the"LINE"messaging application. A total of six dialogues, one for each of the six themes listed in the table below, were collected.
To test the effectiveness of the Large Scale Language Model in counseling dialogue, collected role-play dialogue data is used to generate utterances on the GPT-4 as the counselor.In order to obtain high quality responses,the collected counselor utterances are annotated with response points (Key point) and intentions (Intent),as shown in the table below.
The prompts that prompted the GPT-4 to generate a response are also used in the table below. This prompt instructs the GPT-4 to generate a response as a counselor and also adds guidelines supervised by a professional counselor.The prompt also includes the ongoing dialogue between counselor and client.
The dialogue includes all text from the start to the previous client's statement, andthe point and intent of the response is annotatedbefore the counselor's utterance, as shown in the table below.
GPT-4is used with theOpenAI APIGPT-4-0613, with a temperature of 0.0 and other parameters left at their default settings.The statistics of the generated utterances are shown in the table below.
The reason that the number of utterances by the human counselors is higher than that of GPT-4 is that the GPT-4 generates one utterance at a time, whereas the role-play scenario allowed the human speakers to send a series of messages.
Analysis
Professional counselors evaluate the role-play dialogues and the GPT-4-generated utterances. A 3-point Likert scale from 0 (poor) to 2 (good) is utilized for the ratings, with 3 counselors participating in each dialogue. Reasons for the ratings are also recorded, with a total of seven counselors participating in the evaluation. The table below shows a sample of the generated utterances and the average score for each counselor.
Note that Krippendorff's alpha coefficient was calculated to measure the agreement of ratings for the dialogues of themes 1 through 3 (counselor utterances: 157, GPT-4 utterances: 124), and the alpha coefficient: 0.24, indicating that the correlation among raters is weak.
The mean rating scores for counselor and GPT-4 speech are 0.99 (variance: 0.49) and 0.94 (variance: 0.61), respectively. A Mann-Whitney U-test was also conducted at a significance level of 0.05, and no significant differences were found. This indicates that there is no significant difference in response quality between counselors and GPT-4.
The figure below shows the rating percentages of utterances by counselors and GPT-4, showing that GPT-4 utterances are rated 0 and 2 more often than counselors' utterances.
More than half of the counselors' utterances received a rating of 1, which was attributed to the fact that short utterances such as "I see" and "Yes" were rated as 1.
The evaluation results indicate that there are individual differences in the scoring tendencies of the raters. Therefore, we analyze how the same rater rated the counselor's utterances and GPT-4 utterances in the same context. If the counselor uttered a series of utterances before the client responded, the average rating of all utterances is used as the rating of the counselor's utterance.Results are shown in the table below.
The percentage of speech rated as excellent by counselorswas 34.8% vs. 30.5%, which ishigher thanthat of the GPT-4, but the difference was small, with 34.7% of speech rated as Tie. The results confirm that the quality of the GPT-4 response is very close to that of the counselor's response.Since the responses generated by the large-scale language model are comparable to those of humans even in situations where the prompts are not fully explained, the performance of the GPT-4 could be further improved, andactual counseling delivery using alarge-scale languagemodel-based system is expected.
Case Study
Whenactuallyprovidingcounselingthrough a dialogue system, it is important to minimize inappropriate responses. This paper analyzes this under-rated GPT-4 response.
When we reviewed the responses that received low ratings, we attributed them to inappropriate or unnatural wording or expressions. For example, the use of the word "interesting" can be offensive to the client. The word could be perceived as if the counselor is treating the client's problem as an object of curiosity (interesting).
Theyalsonotedthat GPT-4 responses may treat the client's problem as if it were someone else's. For example, the evaluator noted that the phrase "it sounds difficult" seems insincere and should be avoided.
Avoiding risky responses is especially important in counseling; although the GPT-4-generated utterances did not include offensive or discriminatory remarks, a small number of risky utterances were identified. For example, the response "kindness causes me pain" risks inculcating the false value that one should not be kind.
Although the number of at-risk responses identified in this validation was small, if the input prompt contained offensive content, GPT-4 tended to generate offensive statements in response. Although no offensive content was generated in the role-play dialogues in this paper, it is possible that clients may include offensive content in actual counseling sessions.Future research should analyze such cases and develop a safer and more effective counseling dialogue system.
Summary
In this paper, role-play counseling dialogue data was collected, annotated, andevaluated byprofessional counselorsforappropriateness of the responses generated by the GPT-4. The results show that GPT-4 responses are of comparable quality to those of human counselors. They also report that the responses that were rated low did not include offensive, discriminatory, or high-risk responses.
This paper is an important first step in exploring how useful AI can be in real counseling settings, and the finding that GPT-4 is nearly equal in quality to human counselors suggests that AI could play an important role in counseling in the future.
However, we also recognize the need for further testing and improvement in order to achieve fully automated counseling services. Our goal is to develop an AI system that can understand human emotions and subtle nuances and respond appropriately, which requires testing in a variety of scenarios and continuous improvement.
Through these papers, itis hoped that the evolution of AI technology and further research will lead to a society in which people can more easily receive counseling. This will create an environment in which people with serious problems can receive prompt support.
Categories related to this article