Catch up on the latest AI articles

Mind's Eye: Using Simulation To Improve Physical Reasoning Ability Prompt Extension

Mind's Eye: Using Simulation To Improve Physical Reasoning Ability Prompt Extension

Large Language Models

3 main points
✔️ Proposed a benchmark dataset UTOPIA to investigate the physical reasoning ability of language models
✔️ Proposed a method called Mind's Eye to improve the reasoning ability of language models by reflecting the results of physical simulations in prompts
✔️ Existing reasoning Outperforms existing inference improvement methods

Mind's Eye: Grounded Language Model Reasoning through Simulation
written by Ruibo LiuJason WeiShixiang Shane GuTe-Yen WuSoroush VosoughiClaire CuiDenny ZhouAndrew M. Dai
(Submitted on 11 Oct 2022)
Comments: Published on arxiv. 

Subjects:  Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Although large-scale language models have recently achieved superhuman performance in a variety of tasks, various drawbacks have been reported. One such drawback is poor reasoning ability due to lack of knowledge and experience in the physical world.

However, due to the nature of the learning method, current language models can only grasp the phenomena of the physical world from linguistic information. However, due to the nature of current language models, they can only grasp the phenomena of the physical world from linguistic information, which may lead to incorrect inferences in terms of the laws of physics.

Several measures have been devised to address this problem. For example, one approach is to use prompts that allow the language model to infer in a step-by-step fashion. However, this would rely entirely on the knowledge stored within the language model. Another way to actively use knowledge outside the language model is to augment it through retrieval, but knowledge expressed in the written language is still subject to bias.

To address these issues, this paper investigates the extent to which current language models understand physical laws and proposes a method to improve physical reasoning ability using simulation.

A correct understanding of the physical world is important not only for human-level reasoning ability, but also for general-purpose, physical intelligence, and this paper is a contribution to that regard.


To investigate the extent to which current language models understand physical concepts and laws, we proposed the dataset UTOPIA as a benchmark.

The dataset asks how objects move in six representative scenes (motion, friction, freefall, projection, collision, and slope) selected from high school physics textbooks and other sources. The questions are written in relative (greater than, etc.) rather than absolute (less than, etc.) terms in order to investigate perceptual abilities similar to those of humans in the real world. The answers to the questions can be computed by the physics engine, making the dataset easily scalable.

The following table shows a sample of UTOPIA. As shown on the far right side of the table, 39 different subtasks are available.

Mind's Eye

We also proposed Mind's Eye, a system that uses physical simulation to improve physical reasoning stress. It has the following structure (see figure below).

Mind's Eye consists of three components: a text-to-code converter, a physics simulation engine, and a foundation model.

Text-to-code converter

In order to input textual content into MuJoCo, the physics engine, it is necessary to replace the text with an XML file. For this purpose, we train a language model that, given a query text, outputs an XML file that can be validated by MuJoCo. We train a decoder-based language model from scratch in an autoregressive fashion using 200,000 data sets of query text and XML representations.

Simulation augmented prompting

Upon receipt of the XML file for rendering, the physics engine executes it and the result is shown in the prompt for the foundation model, the third component of Mind's Eye (blue text on the right side of the figure above).


To evaluate existing language models, 100 samples are prepared for each of UTOPIA's 39 subtasks, for a total of 3900 examples.

The language models under evaluation are GPT3 and PaLM.

The results are shown in the following graph.

The blue and orange bars show the performance of the model before the prompt was extended by Mind's Eye. Blue is the zero-shot case and orange is the few-shot case.

The performance improves as the model size of the language model increases, but the improvement reaches a plateau, especially in the case of few-shot.

This is because even though in-context learning can be made more efficient by performing few-shots in contrast to zero-shots, the lack of physical reasoning capability tied to the real world is a bottleneck that prevents performance improvement.

In contrast, the purple and red graphs show the performance of the model when the prompts are extended by Mind's Eye. Purple is the zero-shot case and red is the few-shot case.

Thanks to the enhancements made by Mind's Eye, we see a significant increase in inferential capability.

We also see that even with small model sizes, the use of Mind's Eye improves physical inference performance over larger models that do not use it.

This demonstrates the effectiveness of decoupling trials from inference. Let the language model itself concentrate only on inference, while allowing domain-specific simulations rooted in the physical world to be used as an external tool. This seems to dramatically reduce the size of the language model.

Comparison with various techniques

In this section, we compare the results with various methods for improving the inference capability of language models.

For comparison, we use prompt improvement methods such as Zero-shot Reasoner, which tells the user to "Let's think step by step," as well as methods such as RAG, which searches for external knowledge.

The GPT-3 175B model is used as the basis for the methods being compared.

The results are shown in the following table. It can be seen that the proposed method, Mind's Eye, outperforms other methods in both the zero-shot and few-shot cases.

Comparing GPT-3 1.3B and 175B shows that the Mind's Eye expansion is more effective than simply increasing the model size.


The "Mind's Eye" method introduced in this paper is to use simulations to perform trials and give the results to language model prompts, thereby unlocking the inferential capabilities hidden in the language models. This method has applications beyond physical simulation and will be widely used in other fields.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us