Catch up on the latest AI articles

How To Make A Machine Learn Intuitive Human Understanding?

How To Make A Machine Learn Intuitive Human Understanding?

Machine Learning

3 main points
✔️ propose a method for efficiently learning complex physics problems in order to design robots that can interact with the physical world.
✔️ The paper describes the steps to train a machine on the problem of moving a marble to the center of a CME.

✔️ It designs an agent that can obtain physics concepts from a physics engine.

Data-Efficient Learning for Complex and Real-Time Physical Problem Solving using Augmented Simulation
written by Kei OtaDevesh K. JhaDiego RomeresJeroen van BaarKevin A. SmithTakayuki SemitsuTomoaki OikiAlan SullivanDaniel NikovskiJoshua B. Tenenbaum
(Submitted on 14 Nov 2020 (v1), last revised 16 Feb 2021 (this version, v2))
Comments: Under submission

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This paper presents an effort to learn how a deep reinforcement learning algorithm can efficiently navigate marbles in a circular maze. For this task, which is intractable with normal reinforcement learning, the algorithm interacts with a real system, initializes the physics engine with parameters estimated from the data, and uses Gaussian process regression to correct for errors in the physics engine. In this way, a hybrid model is proposed that can learn how to move marbles within minutes in very complex environments.

Figure 1.


This paper focuses on the goal of robot design with artificial intelligence that has flexible, data-efficient, and generalizable methods. It combines a predefined model with a data-driven approach in interacting with the physical world, learning residuals between predictions and actual observations to update the model. The paper presents a novel framework that can be applied to real-time physical control through a circular maze environment, where the proposed approach successfully learns sample-efficiently. Key contributions include a hybrid model that uses a physics engine augmented with machine learning models and a demonstration of sample-efficient learning in a Circular Maze Environment (CME).

Related Research

This research focuses on advances in reinforcement learning in the fields of computer games and robotics, and proposes a solution to the problem of limited real-world applicability of algorithms that have been successful in simulations. The use of model-based agents and general-purpose physics engines is explored to learn efficiently in real physical environments. This includes transfer from simulation to realistic situations and the use of hybrid learning models, and methods are explored to leverage physical knowledge for efficient control.

Problem Formulation

Consider the problem of moving a marble to the center of a CME. The goal is to study the problem from simulation to reality in a model-based setting where the agent uses the physics engine as initial knowledge of the physics of the environment.

This paper attempts to answer the following three questions in this setting.

(i) What is needed for a model-based sim-to-real architecture for efficient learning in physical systems?
How can we design sim-to-real agents that operate and learn in a data-efficient manner?
(iii) How does the agent's performance and learning compare to how humans learn to solve these tasks?

sim-to-real ... simulation to real

The research will use CME as the test environment and it is hoped that the proposed technology can be used universally in robotic systems. The goal of the agent is to use the controller to learn an accurate model of the marble's dynamics so that the CMS can select actions based on its state. MuJoCo is used as the physics engine ( fPE ), combined with a residual dynamics model ( fGP ) and a real system model ( freal ).


The approach to designing learning agents is inspired by human physical reasoning. That is, a person can solve a new manipulation task in a few attempts. This is primarily because it relies on physics concepts that humans have already learned. Following a similar principle, we design an agent whose physics concepts are obtained from a physics engine. The proposed approach is outlined in Figure 2.

Figure 2.

In this paper, a method is proposed to design a sim-to-real agent using a physics engine to bridge the gap between the simulated and real environments. The initial parameters of the physics engine are set randomly and these parameters are estimated from the residuals between the simulated and real environments using an evolutionary strategy (CMA-ES). The remaining errors are corrected using Gaussian process regression and finally the NMPC policy is used to control the actual environment. The gap between the simulation and the actual environment is due to physics engine approximations and system-level issues, and a method is proposed to correct for these through parameter estimation and Gaussian process regression.

A: Physics engine

In this paper, MuJoCo is used as the physics engine (PE) to consider a ring-shaped environment (CME) with constrained marble motion. The radial motion of the marbles is ignored and only angular positions are considered in the model. However, in order to study the performance of the agents in the simulation, a complete model without marble constraints is also created.

Two different physics engine models will be constructed and presented, a reduced physics engine suitable for RL models (f PE red) and a model that uses the full internal state of the simulator (f PEfull ). The difference between these models is with respect to the position of the marbles, which serves as an approximation of the real system in the simulation study. Experiments, called "sim-to-sim", will be conducted to evaluate how well the agent adapts to the physics engine when initialized in a complex environment.

sim-to-sim ... simulation-to-simulation

B: Model learning

Consider a discrete-time system.

where xkR4 denotes states, ukR2 denotes actions, and ek denotes discrete time k ∈ [1, . , T], and is assumed to be the standardized white Gaussian noise with diagonal covariance at time k ∈ [1, ..., T]. In the proposed approach, the unknown dynamics f in Eq. 1 in the algorithm represents the CMS dynamics f real, which is modeled as the sum of the following two components.

Here, f PEred represents the physics engine model defined in the previous section, and f GP represents the Gaussian process model that learns the residuals between the actual dynamics and the simulator dynamics. To improve the accuracy of the model, both components of f PE red and f GP are learned. This approach is presented as a pseudo code for Algorithm 1 and is described as follows

・(1) Estimation of physical parameters

First, the physical parameters of the actual system are estimated. Since it is difficult to measure physical parameters directly on the actual system, we use CMA-ES to estimate the four friction parameters of MuJoCo. As described in Algorithm 1, we first collect multiple episodes in the real system using the NMPC controller. We then use CMA-ES to estimate the optimal friction parameter µ∗ that minimizes the difference in marble motion between the real and simulated systems as follows

where D represents the transitions collected in the actual system and is the weight matrix. Its value is 1 and relates only to the angular position term of the marble θk+1 in state xk+1.

・(2) Residual model learning using Gaussian process

Due to the modeling limitations described at the beginning of this section, discrepancies remain between the simulator and the real system after the physical parameters are estimated. To obtain a more accurate model, a Gaussian process (GP) model is trained via marginal likelihood maximization using a standard linear kernel to learn the residuals between the two systems by minimizing the following L GP

After collecting trajectories on the real system, the physics engine is reset using the estimated physics parameters µ∗ to generate simulator estimates. In this process, the GP learns the input-output relation fGP ( xrealk, urealk ) = xrealk+1 - xsimk+1 and two independent GP models are trained for the marble's position and velocity. The GP models were found to be optimal in terms of data prediction accuracy and data efficiency in real systems.

・(3) Modeling of motor motion

The CMS tip-tilt platform uses a controller with hobby-grade servo motors operating in position control mode and with a long settling time. This causes an actuation delay of the action due to the calculation of the control algorithm, resulting in a mismatch with the physical engine. To address this problem, an inverse model of the motor is learned. This inverse model predicts the actions sent to the motors of the tip-tilt platform and generates control signals. The inverse motor model is trained by exciting the CMS with the sinusoidal input of the motor and collecting motor response data.

C: Trajectory optimization using iLQR

In model-based control, the iterative LQR (iLQR) algorithm is used to solve controller design optimization problems in a computationally efficient manner. Although other optimization solvers exist that can generate optimal solutions, iLQR provides an efficient solution. Formally, the following trajectory optimization problem is solved and the control uk is manipulated over a specific time step [T-1].

For state cost, we use a quadratic cost function of the state error measured from the target state xtarget (the gate closest to the marble in the current case), as follows

where the matrix W represents the weights used for the various states. For control costs, we also use the quadratic cost given by the following equation to penalize control

In the iLQR optimization, the introduction of a smooth version of the cost function did not change the behavior of iLQR. Using discrete-time dynamics and the cost function, we compute a local linear model and a quadratic cost function for the trajectory of the system and solve these iteratively to obtain the optimal control inputs and local gain matrix. The solution to this optimization, referred to as the reference trajectory, was adjusted empirically only once at the start of training, using W=diag(4, 4, 1, 0.4) and weights of λu=20 in the experiment.

D: On-line control using nonlinear model predictive control

Controlling the motion of marbles in a real system is difficult and is affected by issues such as static friction and delays. For this reason, online model-based feedback control is required, using an orbit-tracking MPC controller. iLQR-based NMPC controllers control the system in real time, and control signals are generated by a least-squares tracking cost function.

The control rate is 30 Hz and the optimizer is warm-started using pre-computed trajectories, and calculations are performed to meet time constraints using parallel computation.

Figure 3.

Comparison of actual trajectories (red), predicted trajectories using physical properties estimated using CMA-ES (blue), and trajectories using default physical properties from sim-to-sim experiments (green). Trajectories are generated from random initial points with a random policy.


This section tests how the proposed approach performs on the CMS and how it compares to human performance.

A: Estimation of physical properties using CMA-ES

In this section, we show the behavior of physical parameter estimation in the sim-to-sim and sim-to-real settings. sim-to-sim experiments confirm that CMA-ES generates parameters with sufficient accuracy and that the estimated parameters bridge the gap between different dynamics The results are shown in Fig. 3. Simulation-to-real experiments also showed that CMA-ES optimization reduced ball position errors. However, issues related to static friction remain and need to be fine-tuned with GP regression after the initial CMA-ES parameter estimation.

B: Control performance in a real system

Sim-to-sim agents performed well with CMA-ES fine-tuning and residual learning was introduced in the real system CMA-ES is well suited for sim-to-sim transfer, but when applied to robots there are differences between the internal model and the real-world dynamics CMA-ES model was extended in a data-driven fashion and iteratively improved using the GP residual model. As the amount of training data increased, the model improved performance, especially in the outer and inner rings, reducing control time in each ring.

Figure 4.

C.: Comparison with human performance

Participants and the learning model (CMAES and CMA-ES+GP1) were given the same CME task, in which participants used a joystick to guide marbles through the maze. Participants' solution times decreased slightly, but not statistically. Model learning proceeded similarly, and the decrease in time was not statistically significant. The task in the inner ring was difficult, and humans and models alike spent more time in that ring. In contrast, the SAC algorithm learned the shortest distance in the simulation, minimizing the time spent on the inner ring.

Conclusions and Future Initiatives

The paper proposes a method for building an agent that efficiently controls marbles in a complex circular maze based on findings from cognitive science. Using physics engine initialization and a Gaussian process regression model, the physical parameters are estimated on a real system, and the marbles are controlled using a combination of iLQR and MPC. This approach has been shown to be more data efficient than traditional reinforcement learning and can adapt the marbles to the task within minutes. It is flexible and can be applied to other physical control tasks. Future work will examine its generality and application to different mazes, and integrate it with general-purpose robotics optimization software to make it more effective.

It is interesting to note that the combination of iLQR and MPC improves data efficiency by focusing on actual physical parameter estimation, including the initialization of the physics engine and the use of Gaussian process regression models.

Of particular practical value are the results that achieve higher data efficiency than traditional reinforcement learning, allowing the marbles to adapt to new tasks within minutes. Another strength is that it is flexible and can be applied to other physical control tasks.

Looking to the future, the generality and application of the proposed method to different mazes could be verified and integrated with general-purpose robotics optimization software to further increase the usefulness of the proposed method. Overall, it is felt that an approach that combines principles from cognitive science and physics could lead to new advances in robotics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us