Furhat Robot" Leverages A Large-scale Language Model To Achieve Natural Facial Expressions And Conversation

Large Language Models 13/10/2024

3 main points
✔️ GPT-3.5 is used todevelop FurChat2, a conversational agent that enables natural conversations with humans
✔️Robots capable of more realistic facial expressions and gesturesusing not only verbal but also nonverbal cues
✔️ A system consisting of many elements such as automatic speech recognition, natural language understanding, and natural language generation toachieve natural conversations

FurChat: An Embodied Conversational Agent using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions
written by Neeraj Cherakara, Finny Varghese, Sheena Shabana, Nivan Nelson, Abhiram Karukayil, Rohith Kulothungan, Mohammed Afil Farhan, Birthe Nesset, Meriam Moujahid, Tanvi Dinkar, Verena Rieser, Oliver Lemon
(Submitted on 29 Aug 2023 (v1), last revised 30 Aug 2023 (this version, v2))
Comments: Accepted at SIGDIAL 2023 (24th Meeting of the Special Interest Group on Discourse and Dialogue), for the demo video, see this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The fields of robotics and artificial intelligence have made remarkable progress, and robots are now playing a variety of roles beyond the boundaries of industry. In particular, receptionist robots play an important role in providing information about locations and services to visitors in offices and hotels.

In this paper, we develop FurChat2, a conversational agent that enables natural conversations with humans. The agent utilizes an advanced large-scale language model and is capable of natural open- and closed-domain interaction.The system has been applied to the Furhat robot developed atFurhatRobotics in Sweden, and demonstrations show new possibilities for more natural and intuitive conversations with robots.non-verbal cues, such as facial expressions, as well as words.

Designed specifically for the National Robotarium, the system provides information about the facility, research, news, and events through natural conversations with visitors. In addition, FurChat2 uses the GPT-3.5 model to provide information based on prompt engineering, cross-domain conversations, and facial expression generation.

The use of appropriate verbal and nonverbal cues is crucial in robot-human interaction, and this is a major difference from traditional agents. Conventional agents operate in a rule-based fashion and rely on pre-programmed commands and keywords, which limits their freedom of choice in dialogue. However, the development of FurChat2, which utilizes a large-scale language model, allows for open-domain interaction, resulting in more natural and personalized interactions for the user.

In the future, it is expected that robots will not be limited to a mere receptionist, but will become a multifunctional conversational agent.

Furhat Robot

Furhat is an innovative social robot developed by Furhat Robotics.Therobotutilizes advanced conversational AI and facial expressions toenable natural and intuitive interaction with humans; Furhat's face mimics human facial expressions using a three-dimensional mask and is animated by a microprojector. This technology allows the robot to provide more realistic and emotional facial expressions.

In addition, Furhat is supported by a motorized platform that allows for neck and head movement, rotation, and nodding. This allows for more human-like movements and realistic facial expressions and gestures during interaction. The robot is equipped with a microphone array and speakers, allowing it to identify and respond appropriately to human speech.

However, Furhat's human-like appearance can sometimes cause the "uncanny valley" phenomenon. This phenomenon is a psychological effect in which "the closer a robot's appearance and movements are to those of a human, the more affinity the human has for the robot, but once the similarity reaches a certain level, the opposite is true: the robot feels creepy. However, Furhat is considered an important step toward deeper human interaction.This advanced robot's expressive and interactive capabilities make it an ideal candidate for receptionist work in a variety of environments.

System Overview

Thefigure below showsthearchitecture of aconversational system that allows users to interact with the robot through spoken language. The system consists of several key components.

The robot has three components: automatic speech recognition (ASR), which converts user speech into text; natural language understanding (NLU), which processes and interprets the text; dialogue manager (DM), which manages the flow of the dialogue; and natural language generation (NLG), which utilizes GPT-3.5 to generate natural-sounding responses. The generated text is converted back into speech using text-to-speech technology (TTS) and output from the robot's speakers to achieve dialogue. The system retrieves relevant data from a database based on user intent.

Automatic Speech Recognition (ASR)uses the Google Cloud Speech-to-Text module. This module utilizes machine learning algorithms to transcribe spoken words into text and is integrated into the system through the Furhat SDK.

Dialogue Managementconsists of three sub-modules:NaturalLanguageUnderstanding(NLU),Dialogue Manager (DM), and Database Storage.Natural Language Understanding (NLU)analyzes input text fromAutomatic Speech Recognition (ASR)and uses machine learning techniques to break it into structured definition sets; FurhatOS provides an NLU model that classifies text into specific intentions based on a confidence score.

The Furhat SDK built-in Dialogue Manager maintains the flow of the conversation and manages the state of the dialogue based on the intent identified by the NLU component. This module sends the appropriate prompts to the large language model, receives the response from the model, and then processes it to add the desired facial gesture.

One of the challenges facing large language models today is the generation of non-factual content, which can undermine user trust and raise safety concerns. While not a perfect solution, we are attempting to mitigate this effect by creating a custom database. Here, we manually web-scrape the National Robotarium website and build the database. When the appropriate intent is invoked, the dialog manager retrieves information from the database andsends it along with a prompt to elicit a response fromthe large language model.

Natural Language Generation (NLG)is responsible for generating responses based on requests from the Dialogue Manager. Prompt engineering is an important part of this process, andlarge-scale languagemodels are used toelicit appropriate responses.

The system uses text-davinci-003, a very powerful model in the GPT-3.5 series, at a cost of $0.0200 per 1000 tokens. Prompt engineering defines the robot's personality and application context, and uses information extracted from past dialogue history and databases to form the dialogue.

Emojis are also incorporated as appropriate to express appropriate emotions according to the flow of the conversation. For example, a smiling gesture is selected for dialogs that convey joy or humor, while a sad facial expression is selected for dialogs that convey empathy or sadness. Thisseamlessly integrates thelargetext-basedlanguagemodel into the embodied Furhat robot, resulting in more natural conversations. Note that the prompt format is "This is a conversation with a robot receptionist, <Robot Personality>, <Data from the Database>, <Dialogue history&gt ;, <Response Format along with sample emoticons>".

In addition, Furhat SDK provides built-in gestures that allow users to add custom facial gestures tailored to their specific needs. Using a state-of-the-art GPT model, Furhat identifies emotions from text and generates gestures based on them that express the appropriate emotion. After receiving a response from the model, the Dialogue Manager selects the best expression from a set of pre-defined gestures and activates it simultaneously with the generated speech.

To convert text to speech, the Amazon Polly service is used. This service is provided by default in FurhatOS and allows for clear and natural speech output.

Inthis way,theFurhat robotuses advanced technology todeepen engagement with the user,resulting in anatural conversation.Thefigure below showsan example of a dialogue between a human and a robot.

Summary

This paper describes the development of a conversational robot, FurChat, for use as a receptionist. The robot's conversational agent generates open- and closed-domain dialogues and facial expressions using the advanced large-scale language model GPT-3.5. built on the Furhat SDK, the system employs a one-to-one visitor interaction scheme.

As for future prospects,the company is aiming for multi-directional dialogue, an active research area in the development of receptionist robots. In addition, as a countermeasure to the so-called "hallucination" problem of inaccurately generated content from large-scale language models, the company is currently planning to fine-tune the language models and shift to direct dialogue generation that does not rely on the natural language understanding (NLU) component.Further progress in dialogue robots based on large-scale language models is expected.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Furhat Robot" Leverages A Large-scale Language Model To Achieve Natural Facial Expressions And Conversation

Summary

Furhat Robot

System Overview

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...