# We Want To Learn Generic Rewards From Human Motion Videos That We Can Use To Train Our Robots!

3 main points
✔️ Learns reward functions with high generalization performance by using diverse and large scale human videos
✔️ Solve tasks using the learned reward function even for unknown tasks and environments
✔️ Demonstrates high success rate in experiments with real robots

ritten by Annie S. Cgeb,Suraj Nair ,Chelsea Finn,
(Submitted on 31 Nar 2021)
Subjects: Robotcis (cs.RO), Artificial Intelligence (cs.AI), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## first of all

For a robot to be able to solve various tasks in various environments, it is essential to have indicators and rewards for whether the tasks are successful, which are necessary for planning and reinforcement learning. In particular, in the real world, the reward function must be generic to various tasks, objects, and environments, using information obtained only from onboard sensors such as RGB images. A certain degree of generality has been achieved by using large and diverse datasets in areas such as computer vision and natural language, but not in the field of robotics. However, in the field of robotics, it is costly and difficult to collect such a large and diverse dataset. In this study, we introduce a method called Domain-agnostic Video Discriminator (DVD), which learns a reward function that can cope with an unknown task by making good use of videos of humans solving tasks, such as those available on YouTube. which learns a reward function that can handle unknown tasks. We have shown that this method can estimate rewards for unknown tasks and environments using a relatively large dataset of humans solving tasks and a small amount of data about robots. In the next section, we will explain in detail how the method works.

## technique

In the first place, human data is very different from the observation space of robots, and of course, robots and humans themselves have different shapes. In addition, the action space of humans and robots is also very different, and it is not possible to convert all human actions into robot actions. In addition, the so-called "in-the-wild" data, which can be found on the web, has problems such as different viewpoints and backgrounds, and noise. However, we are still motivated by the fact that we can easily access a lot of data, and we want to use it to learn reward functions for robots.

So how do we use human data? The idea of the proposed method is that, given two videos, a trained classifier can distinguish whether the two videos are solving the same task or different tasks. By using the activity labels attached to a large number of human videos and a small number of robot videos, we can learn whether the two videos are solving the same task or not even if there is a visual domain gap. This method is called Domain-agnostic Video Discriminator (DVD), and it is a simple method with a wide range of applications. After learning the DVD, one input is a human demonstration of the task to be solved, and the other part of the video is a video of the robot in action. The diagram below shows the whole process. The figure below illustrates the whole process.

### Relationship between the amount of data and performance of robots

Finally, we investigated the relationship between the amount of robot data and performance. In the previous experiments, we used 120 robot demonstrations, but the figure below shows the results of an experiment to see how performance would be affected if the amount of robot data were smaller. The figure below shows that there is no significant difference in the number of successful tasks even when the amount of data is reduced, indicating that even a small amount of DVD data from a robot can produce a good performance.

## summary

The problem of difficulty in obtaining large-scale data for robots has been mentioned for a long time, and recently there has been a lot of research in the direction of improving learning efficiency and generalization performance, such as in this paper, by making good use of human data. However, since the domain difference between robot data and human data is too large, it is necessary to explore how to use robot data more effectively in the future.

If you have any suggestions for improvement of the content of the article,