Improved UX And Smooth Communication Through "Pepper-GPT", An Integration Of Whisper And GPT
3 main points
✔️ Improved user experience through technology integration: Integrating Whisper ASR and GPT-3.5 API into the Pepper robot bridges the gap between virtual AI and physical robots, significantly improving the user experience. Whisper ASR shows superior performance, especially compared to Google's ASR service, achieving low word error rates and fast processing times.
✔️ Potential of Pepper-GPT: In an evaluation by real users, the majority of participants rated the system as easy to use and the gestures of the robot as appropriate, suggesting that Pepper-GPT enriches human-robot interaction and has further potential in the future HRI field. The results suggest that Pepper-GPT has the potential to enrich human-robot interaction and further develop the HRI field in the future.
✔️ Future improvements: listening hints to help users get better guidance when interacting with the robot, enhanced multilingual support, designing more physical actions, and enhanced face-tracking capabilities to improve the user experience.
Does ChatGPT and Whisper Make Humanoid Robots More Relatable?
written by Xiaohui Chen, Katherine Luo, Trevor Gee, Mahla Nejati
(Submitted on 11 Feb 2024)
Comments: Published in Australasian Conference on Robotics and Automation (ACRA 2023)
Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
As technology evolves at a dizzying pace, it is more important than ever to make human-machine interactions smoother. To meet this challenge, a new field has emerged: human-robot interaction (HRI). Effective human-machine interaction is said to be essential in order to reap the benefits of technology.
User experience (UX) is defined as the impact a machine has on its users: ease of use, intuitiveness, usefulness, and the degree of frustration during interaction. Ensuring a good UX is essential for robots to bring substantial value to our daily lives.
Pepper, the famous humanoid social robot by Softbank Robotics, is very well known for its diverse interactive features. However, when aiming for more human-like communication, Pepper's current capabilities may not meet expectations.It is also clear that delays and errors in language processing can affect the user experience. Pepper's speech recognition capabilities are also known to be inadequate for understanding natural language.
To address these challenges, large-scale language models such as ChatGPT are expected to be utilized. These advanced systems will enable more natural and contextually relevant conversations and will contribute to Pepper's speech recognition capabilities.
In this paper, we develop a "Pepper-GPT" system that integrates the Pepper robot with Whisper and the GPT API.Italsoreports experimental results on the method and its interaction with humans. This effort aims to improve the quality of communication with robots and bring greater value to human lives.
Technique
The Pepper-GPT project employs a unique methodology to revolutionize human-robot communication. At the heart of this project are two main elements: the BlackBox and the PepperController.
BlackBox combines advanced speech recognition and natural language processing technologies and is responsible for converting the user's voice into text and generating meaningful responses. It uses OpenAI's Whisper automatic speech recognition system and the gpt-3.5-turbo language model. PepperController, on the other hand, manages the robot's commands to perform actions in the real world.
The client-server model is used for data exchange, using the TCP/IP protocol, which ensures reliability and stability. This system design ensures a smooth process from voice input to response generation, making human-robot interaction more natural.
By extending the potential of AI and robotics, Pepper-GPT goes beyond a mere digital assistant to a real-time, interactive companion. This approach successfully provides advanced communication capabilities even for robots capable of physical actions.
In addition,BlackBox can be divided into two modules: the speech recognition module and the GPT module.Through these two modules, BlackBox records the user's voice input, recognizes the voice content, and generates accurate action commands or contextualized responses through the GPT-3.5 model. The generated results are then sent to PepperController for execution.
Voice Recognition Module
The speech recognition efforts in the Pepper-GPT project are focused on accurately interpreting the user's voice and generating responses accordingly. At the core of this process is the Whisper ASR system, which was selected from testing of three different Automatic Speech Recognition (ASR) models. Its selection was based on its robustness and remarkable performance, which can significantly improve the speech recognition capabilities of the Pepper robot. In particular, the "Whisper Small" model outperforms other models in its efficiency, balancing processing speed, resource consumption, and accuracy.
The speech recognition module is designed to start recording when it detects a human voice and stop recording when it senses silence, preventing the generation of silent audio. It also incorporates a Silero VAD model that identifies the human voice to avoid accidentally generating phrases such as "thank you" that would result in an inappropriate response.
The recorded audio is saved as a file and converted to text by the Whisper Small model. This text is then transferred to the GPT module, which is responsible for content analysis and response generation. However, there are times when the Whisper Small model is unable to transcribe the text properly, and the system automatically prompts the user to speak again to ensure a smooth and stress-free dialogue.
GPT Module
The introduction of the GPT module in the Pepper-GPT project aims to make communication between the user and the robot deeper and more natural. At the heart of this module is response generation with the gpt-3.5-turbo model. This model is chosen for its exceptional comprehension and text generation capabilities. This produces human-like responses and enables Pepper-GPT's goal of highly flexible conversation. This technology can significantly improve the quality of the user experience by understanding user input and creating relevant, authentic conversations.
In this module, the process involves receiving text from the user via the speech recognition module, analyzing the content, and then switching to action mode or speech mode, as appropriate. In action mode, the user's request is translated into action commands that can be executed by the Pepper robot. In speech mode, on the other hand, the GPT module acts as an interlocutor, generating context-sensitive responses and continuing the conversation.
However, misinterpretations can occur. To solve this problem, the GPT module provides a double-checking function. This functionality allows the user to review the generated response for appropriateness and make corrections as necessary. This allows the Pepper robot to respond appropriately to user interactions.
The advanced design of the GPT module further facilitates user-robot interaction by ensuring that when the user requests an action from the Pepper robot, or enjoys a conversation, it accurately captures his or her intentions and responds appropriately.
The "PepperController" in the Pepper-GPT project acts as the central nervous system for the Pepper robot. This system controls the robot's movements and conversations, turning Pepper into a more engaging and dynamic being. Specifically, the Naoqi ALAnimatedSpeech agent is the core technology that executes both movement and voice commands. For voice commands, the PepperController converts text from the BlackBox into speech, and the Pepper robot is set to perform specific animations in response to the user's words.
PepperController
All actions that can be performed by the Pepper robot are stored in a pre-coded data set, and the appropriate action is selected in response to the physical action command. In addition, during the speech recognition and response generation process, transition animations are executed as if Pepper were thinking, smoothing the flow of the interaction.
For data transmission, the highly reliable TCP/IP protocol is used to ensure stable data exchange between BlackBox and PepperController. This communication protocol has a retransmission feature to ensure that data is sent and received reliably, preventing data loss.The Pepper-GPT design employs a client-server model, where each client has a specific role, and after input from the user, the appropriate commands are sent to the PepperController, leading to the robot's next action.
Experiments and Results
Two analyses are conducted here: one comparing the selected speech recognition API with other APIs, and the other concerning the final results of the experiment.
The first step is to evaluate speech recognition. To improve the accuracy and speed of speech recognition, two tests are being conducted with three speech-to-text APIs prior to the experiment.
Word Error Rate (WER) is used toevaluate accuracy.This is a widely used metric to measure the accuracy of a system; WER is calculated based on the number of paraphrase, deletion, and insertion errors and the total number of reference words. In addition, recognition time is also used as a performance metric, which measures how fast the model converts spoken language into text. This is important in real-world applications where immediate and effective speech-to-text conversion is required.
The Speech Accent Archive dataset, in which speakers from 177 countries uttered the same English sentence, was also used as the dataset. This diverse range of accents is ideal for evaluating the adaptability and performance of the selected speech recognition model. The test includes native English-speaking and non-native English-speaking countries/regions to test its utility in global communication.the second test uses the "daily-dialog" dataset, which includes daily conversational dialogues. This test aims to evaluate how accurately the speech recognition model can recognize and transcribe common dialogues. Five different conversational scenarios that might be encountered in the real world have been selected to test the model's performance in practical applications.
The first evaluation analyzed a total of 24 groups, including both English-speaking and non-English-speaking countries. 3 speech-to-text APIs were used in this test, with average word error rate (WER) and average recognition time as the evaluation criteria. The results show that Whisper achieved a significantly lower WER than the other APIs, demonstrating near-perfect accuracy.
In particular, among English-speaking countries, the U.S. has the lowest WER, while the U.K. has the highest WER. Among non-English speaking countries, Indian accents are shown to be the most difficult to understand, while Arabic and Filipino are the easiest to understand. In terms of average recognition time, Whisper was found to have the ability to convert speech to text in the shortest time.
The second test used the "daily-dialog" dataset to evaluate the accuracy and efficiency of speech recognition in five different conversation scenarios. The results of this test showed that Whisper consistently achieved the lowest WER, demonstrating the highest level of accuracy and maintaining the shortest average recognition time.
Through these results, Whisper performed significantly better than other speech recognition APIs, confirming the appropriateness of our research methodology. This demonstrates Whisper's effectiveness in speech-to-text conversion, even for use in real-world applications where high accuracy and efficiency are required.
Experiments with Pepper
To explore the implications of integrating ChatGPT with the Pepper robot, a trial with actual human participation is required. Oakland University students will be challenged to have free conversations with the integrated ChatGPT Pepper robot, with each session lasting 15 to 20 minutes.
Participants were recruited by distributing flyers on bulletin boards around campus. The only requirement for participation is that participants must be at least 18 years old and able to communicate in English.
Informed consent is an essential ethical requirement for human participatory research. This ensures that participants fully understand the purpose, risks, and benefits of the research, as well as their own rights. Participants read the "Participant Information Sheet" and signed a "Consent Form" upon agreement. This procedure protects the privacy and confidentiality of participants and ensures that the research is conducted according to ethical standards. Researchers are available to answer participants' questions and help them fully understand the study and make informed decisions about their participation.
Prior to the start of the experiment, participants were briefed on the features and functions of the integrated system, as well as guidelines for initiating a conversation with the robot. Participants are also provided with a microphone to enhance the accuracy of speech recognition.
During the experiment, participants were free to converse with the Pepper-GPT robot placed in front of them, and the system transcribed their conversations into text. If technical assistance was needed, a researcher was reserved in a corner of the room. Interaction with the robot was adjusted to last between 5 and 10 minutes, depending on the participant's response.
After the interaction, participants completed two digital questionnaires, providing information about their age, gender, faculty, ethical considerations, and previous experience with ChatGPT. Feedback on their interactions with the robot was also collected. All participants were rewarded with a $10 gift card.
Quantitative results indicate that participants had different experiences based on their English proficiency, but many found interacting with ChatGPT to be realistic and engaging. However, some participants felt there was room for improvement regarding the intuitiveness of the system. Overall, the results indicate that the presence of a physical robot enriched the ChatGPT interaction.
A clear correlation between word error rate (WER) and processing time has been observed in evaluations of speech recognition techniques. In particular, British accents with complex phonological features exhibit high WER and long processing times, while Australian accents show the opposite. While this trend is not consistently true in all cases, it does indicate that a linear relationship exists between WER and processing time.
The experiment revealed that participants' English proficiency had a significant impact on their experience interacting with Pepper-GPT. In general, Whisper's speech recognition performed well on tests involving accents, but participants with lower levels of English comprehension had to repeat questions until the robot accurately grasped their intentions.
Approximately 30% of participants with more experience using ChatGPT had higher expectations for the robot's performance than did occasional users, who tended to be slightly disappointed with the system's capabilities at the end of the experiment. Additional challenges included difficulty in determining when to interact with the robot and the low accuracy of the Pepper robot's facial recognition technology, which required participants to make multiple attempts to get the robot's attention.
This study shows that English proficiency, user expectations, clarity of interaction timing, and Pepper-GPT's face-tracking feature influence the participant's experience. These factors are important areas for improvement in the next iteration of the system and are expected to contribute to increased user satisfaction and engagement.
Summary
In this paper, Whisper ASR and GPT-3.5 API are integrated into the Pepper robot to reduce the gap between the virtual AI and the physical robot, greatly improving the user experience. from the ASR performance comparison, Whisper performed the best, with an average Word Error Rate (WER) of 1.716% and an average processing time of 2.639 seconds, outperforming Google's ASR service. This has improved Pepper-GPT's comprehension capabilities; the GPT module makes interactions richer and more engaging for the user by allowing the robot to generate contextually relevant responses, understand the user's instructions, and act accordingly.
The results of the survey of participants indicate great potential for Pepper-GPT in the HRI field. More than 90% of participants found the system user-friendly, and more than half rated the robot's gestures as appropriate. Positive feedback from participants indicates that they enjoy the Pepper-GPT and look forward to further interaction with the system in the future.
Through further improvements, Pepper-GPTis expected toevolve into a more natural, efficient, and enjoyable interaction experience,further enhancing theuser experience.
Categories related to this article