Human-robot Cooperative Assembly Realized By Large-scale Language Models

Robot 24/12/2024

3 main points
✔️ Effective natural language communication between humans and robots using Large Language Models (LLMs)
✔️ Integration of voice commands and sensors to streamline assembly operations and improve safety on the manufacturing floor
✔️ Flexible task handling and real-time error handling in a fluctuating manufacturing environment Improved capabilities

Enhancing Human-Robot Collaborative Assembly in Manufacturing Systems Using Large Language Models
written by J Jonghan Lim, Sujani Patel, Alex Evans, John Pimley, Yifei Li, Ilya Kovalenko
[Submitted on 4 Jun 2024 (v1), last revised 21 Jun 2024 (this version, v2)]
Comments: Accepted by arXiv
Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This research proposes a framework for using large-scale language models (LLMs) to improve communication in human-robot collaborative manufacturing systems. In a manufacturing environment, human operators flexibly respond to dynamic situations, while robots perform accurate and repetitive tasks.

However, the communication gap between humans and robots hinders collaboration. In this study, we proposed a framework that integrates natural language voice commands into task management. An assembly task case study shows that this framework can process natural language input and address real-time assembly tasks. Results suggest that LLM has the potential to improve human-robot interaction in manufacturing assembly applications.

Introduction

Advances in robotics technology have greatly improved manufacturing efficiency, reducing costs and increasing productivity. While robots can quickly and accurately repeat heavy-duty tasks on the manufacturing floor, they lack the adaptability and versatility of human operators.

This has increased the importance of human-robot collaboration (HRC), in which humans and robots complement each other's skills and capabilities; HRC refers to the interaction and cooperation of human operators and robotic systems within a shared workspace.

Prior research has shown that the HRC framework improves the ergonomics of tasks in a manufacturing environment and enables safe human-robot interaction. For example, from handling, installing, and removing large components to complex assembly tasks of small components such as printed circuit boards, human-robot cooperation can significantly improve the efficiency and safety of production lines.

However, there are several challenges related to human-robot interaction that need to be addressed to further advance HRC in manufacturing systems. In particular, interaction with robots can cause psychological stress and tension for operators due to language barriers. Modern manufacturing systems require extensive pre-training and complex code development to ensure that operators work accurately and safely with robots.

These difficulties underscore the need to develop human-robot communication systems that do not require extensive robot training. In addition, HRCs must be flexible enough to adapt to changes and errors during the manufacturing and assembly process. In addition, human-robot cooperative assembly applications must integrate advanced technologies with human-centered design to improve communication and ease of use.

Large-scale language models (LLMs) have recently been introduced to improve natural language understanding and production capabilities. These can be extended to improve human-robot interaction in manufacturing facilities; models such as OpenAI's GPT-3 and GPT-4 have shown high capabilities in natural language processing, understanding, and communication.

LLM integration enables natural language communication between humans and robots. Using a voice interface for this communication improves collaboration and operator safety in dynamic work environments.

The main contributions of this study are as follows

1. use LLM to interpret natural language and allow the operator to coordinate with the robotic arm

2. propose an integrated framework for voice command, robotic arm, and vision systems to improve operational flexibility in HRC.

3. enhance the ability to adapt to task errors and obstacles through human-robot communication and improve efficiency in manufacturing environments.

Related Research

Human-robot collaboration (HRC) has been developed in diverse ways to improve safety and efficiency in manufacturing. For example, Fernandez et al. developed a dual-arm robotic system with multi-sensor capabilities for safe and efficient collaboration. This system integrates gesture recognition. Wei et al. developed a deep learning method to predict human intent using RGB-D video.

In addition, Liu et al. studied improving HRC by integrating different modalities such as voice commands, hand movements, and body movements. This approach uses a deep learning model for voice command recognition, but does not focus on context-dependent communication. Wang et al. also employed a teaching-learning model that uses natural language instructions to predict human intentions and facilitate collaboration. This model uses natural language for multimodal processing, but does not focus on interactions that account for language diversity.

These previous studies have introduced methods for the use of environmental data and natural language to improve the safety and efficiency of HRCs in manufacturing. However, there is limited research on human-robot collaborative assembly that effectively integrates natural language capabilities to handle context-dependent communication and language diversity. The authors aim to integrate an LLM-based approach to improve human-robot communication. This approach is a first step in combining existing techniques, computer vision and LLM, to leverage human flexibility and robot precision in manufacturing.

Framework

The framework proposed in this study is aimed at human-robot collaborative assembly in a manufacturing environment. The framework is designed to facilitate the interaction between the human operator and the robot during the assembly process.

Physical Layer

The physical layer enables human-robot interaction based on data from the virtual layer. This layer consists of three main components

1. human instructions: the operator controls the robot's movements through voice instructions.

2. robot behavior: the robot performs a behavior based on a predefined set of tasks.

3. sensor data: Data from sensors is used to monitor environmental conditions. This data allows the robot to adjust its movements in response to changes in the workspace (e.g., position and orientation of parts).

When an event or error is detected during a task, the robot notifies the human operator via a communication protocol; the LLM module converts the error information into a natural language message and communicates it to the operator using speech synthesis technology. Once the operator understands and responds to the error, the robot resumes its task.

Virtual Layer

The virtual layer holds system functions to facilitate communication between human instructions and robot actions. This layer consists of two main agents

1. human agents:

The human agent converts voice instructions into text, in a format that can be understood by the robot. The agent uses a speech recognition module to convert voice data into text and sends instructions and information to the robot via a communication module.

2. robot agent:

The robotic agent interprets voice instructions received from the human operator and performs tasks. This process is supported by the following functional modules

Initialization Module: initializes the robot agent and provides basic operating guidelines and task execution protocols. It defines the robot's ability to perform tasks and sets protocols for asking the operator for help in the event of errors.

LLMModule: The LLM converts human instructions into tasks and automatically detects and suggests the next task based on context. It also converts error information from the task control module into natural language and communicates it to the operator.

Sensor module: processes data from sensors and adjusts robot motion. For example, it recognizes the position and orientation of parts to make precise robot adjustments.

Task Control Module: executes tasks and manages errors. Verifies sensor data and notifies the operator through the LLM module if errors are detected.

Figure1 illustrates the Human-Robot Collaborative Assembly Framework. This figure visually shows how human and robot agents work together to perform tasks.

Figure 1: Human-robot cooperative assembly framework using LLM

Human-Robot Cooperative Assembly Workflow

The overall workflow is illustrated in the sequence diagram in Figure 2, which describes the human-robot cooperative assembly process. This diagram shows how voice commands from the human operator are processed by the LLM module to guide the robot's actions.

The process begins with the operator giving voice commands, which are converted by the LLM module into a discrete set of tasks t for the robot. The robot then requests sensor data to perform _ti. If the data is valid, the robot proceeds to execute the assigned ti. The sensor module determines the validity of the data by comparing the detected parameters to predefined criteria.

If the execution is successful, a completion message _Mc ( _ti ) is sent to the operator via the LLM module.

If the data is invalid or there is an error in _ti, the robot generates an error message _Mei ( _ti ) via the LLM module to notify the human operator of the specific error and its occurrence within subtask _tic+1 for efficient resolution. After the error is identified and corrected by the human operator, a new command is issued to the robot by the human operator.

The robot then resumes task execution at _ti, starting from the interrupted subtask _tic+1, based on the new sensor data. This procedure is repeated until _ti is completed.

Figure 2: Sequence diagram of human-robot cooperative assembly in a manufacturing system

Case Study

In this study, the proposed framework was integrated into a manufacturing assembly system and applied to a cable shark product assembly operation. This case study was conducted to demonstrate the effectiveness of the framework.

LLM and ASR Modules

This section describes how the LLM and ASR modules were implemented in the system. The communication aspect is implemented by OpenAI's speech recognition model "whisper1" and speech synthesis model "tts-1". The LLM module uses OpenAI's pre-trained GPT-4.0, which translates human voice instructions into text accurately and allows the robot to respond in a form it can understand; the LLM module uses OpenAI's pre-trained GPT-4.0, which translates human instructions into tasks and allows the robot to execute them.

Sensor Module: Vision Systems

The sensor module incorporates a vision system. This system provides environmental data during the assembly process and feedback to the task control module. The YOLOv5 model is used for object detection, and custom models are trained using image data sets of individual parts (e.g., housings, wedges, springs, end caps). Figure 4 shows how the vision system extracts features. The system recognizes the location and orientation of parts to assist in accurate assembly operations.

Figure 4: Feature extraction method by visual system

Task Control Module: Assembly Tasks

The Task Control Module executes tasks as directed by the LLM interpreting human instructions and managing errors. It verifies sensor data and proceeds with the task if the data is valid, or notifies the operator through the LLM module with error details if the data is invalid. The cable shark assembly process is illustrated in Figure 5.

Figure 5: Cable Shark Assembly Process

Case Study Results

The proposed framework was integrated into a cable shark assembly system. Operators interacted with the robot through voice instructions to perform assembly tasks. In scenario 1, overlapping parts were detected and human intervention was requested. In scenario 2, the robot stopped and requested human correction when a wedge part was incorrectly assembled. In scenario 3, a missing spring part was detected and required a human operator to place the part. Table 1 shows the language variations of the instructions for each task. Table 2 shows the success rate of the language variations for each scenario.

Figure 6: Case study communication results for each scenario

Table 1: Language variations in task instructions

Table 2: Success rates for language variations

Case Study Discussion and Limitations

This case study evaluated how LLM integration makes human-robot collaboration more efficient and flexible. Results show that the more specific the instructions, the better the robot's performance. For example, the vague instruction "Correction is made. Resume the operations." failed due to lack of context and explicit task references. This result demonstrates the limitations of the proposed framework and room for improvement.

Conclusions and Future Issues

The development of large-scale language models (LLMs) has enabled human-robot collaborative assembly to execute actions and collaborate based on environmental data; by integrating LLMs, robots can better understand the human operator's instructions, resolve errors, and leverage feedback from the environment to improve execution. and improve execution by leveraging feedback from the environment. In this study, LLM was incorporated to enable dynamic responses to task variability in a manufacturing environment.

To address the challenges of human-robot cooperative assembly, this study places particular emphasis on the following

1. development of a communication system that does not require extensive robot training (C1)

2. improve flexibility to accommodate changes and errors (C2)

3. integrate advanced technology with human-centered design to improve ease of use (C3)

The cable shark device assembly process was used to validate the effectiveness of this framework and to achieve intuitive human-robot communication via voice commands; by integrating LLMs, sensors, and task control mechanisms, it dynamically responds to task variations and errors; and by integrating a variety of sensors, the framework can be used to control the workflow of the robot, ensured a continuous workflow while maintaining productivity.

For future work, the company plans to test the framework under real industrial conditions. This would include operator diversity and different conditions of the manufacturing environment (e.g., noise, dust, brightness). They also plan to provide a variety of data on robotic tasks and sensor information to enhance the adaptability of the LLM-based framework to improve task flexibility, safety, and the ability to handle unexpected errors. In addition, the project will incorporate multimodal strategies such as haptic and gesture to improve human-robot interaction.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO