SOK-Bench] Situational Video Inference Benchmark Using Real-World Knowledge In Video

Computer Vision 28/02/2025

3 main points
✔️ Proposed SOK-Bench, a benchmark consisting of more than 44,000 questions and 10,000 videos integrating dynamic situations and common knowledge.
✔️ Knowledge graphs (SKG, GKG, SCKG) are used to enable inference of temporal and causal processes in videos and to generate question responses.
✔️ Evaluated state-of-the-art large-scale language and multimodal models in experiments to identify challenges in inference capabilities.

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
written by Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan
(Submitted on 15 May 2024)
Comments: CVPR
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Improving the ability of artificial intelligence to properly understand real-world video images and to perform commonsense reasoning (commonsense reasoning) is an extremely important issue in the development of intelligence. In particular, tasks involving video data require not only simple object recognition and motion recognition, but also the ability to appropriately interpret the situation and make rational inferences based on it. However, most video inference benchmarks to date have been limited to simple fact-based question-answering and specific situational inference, and do not support advanced inference utilizing open world knowledge.

In this paper, we propose a new video inference benchmark called SOK-Bench (Situated Open-world Knowledge Benchmark ) to solve this problem. The features of this benchmark are as follows

In total, it has more than 44,000 questions and has built benchmarks for 10,000 dynamic situations (videos).
Integrate situational and general knowledge in the video by utilizing knowledge graphs (Situated Knowledge Graph (SKG), General Knowledge Graph (GKG), Situated Commonsense Knowledge Graph (SCKG)).
Automatic generation method combining LLM (Large-scale Language Model) and MLLM (Multimodal Language Model) to generate QA (Question and Answer) data on a large scale and perform manual quality checks.
An evaluation using state-of-the-art Visual Language Models (VLMs) was conducted, showing that current AI models still have limitations in video inference.

What makes this research particularly outstanding is that it requires AI to have the ability to infer knowledge and cause-and-effect relationships that exist in the video, rather than simply understanding the video. For example, if there is a scene in a video in which a person is cooking, AI should ideally be able to not only recognize what the ingredients are and the flow of cooking, but also to infer how the dish would change if certain ingredients were missing. To enable such inference, SOK-Bench employs a design that integrates video data and knowledge graphs.

Related Research

Video Question Answering (VQA)

Video question answering is a task in which AI looks at video and generates appropriate responses, and has been the subject of much research. Typical datasets include.

CLEVR (2017): rule-based visual reasoning dataset.
VCR (2019): question-answer dataset for understanding human behavior and intentions.
AGQA (2021): video question and answer dataset requiring temporal and causal inference.

These datasets focus mainly on visual feature extraction and static video comprehension, making it difficult to consider temporal changes and causal relationships in the video. SOK-Bench, on the other hand, is designed to be a dataset that clearly captures the situation in the video and its causal relationships, and integrates open world knowledge to enable more sophisticated inference.

Proposed Method

The SOK-Bench dataset consists of question-and-answer data automatically generated based on video images, and is created according to the following procedure.

Situated Knowledge Extraction from video
- Analyzes video scenes and extracts objects, people, actions, and temporal relationships.
- For example, in a "cooking scene," the ingredients and cooking procedures are recorded.
General Knowledge Integration
- Additional information is assigned based on the extracted situational knowledge, utilizing the General Knowledge Graph (GKG).
- Add knowledge, e.g., "cornstarch is used to thicken".
Generation of question and answer data
- Utilizes knowledge graphs to automatically generate question-answer data.
- E.g., "How would not using cornstarch affect your cooking?"
Quality checks through manual review
- Automatically generated data is manually checked to ensure quality.

Thus, SOK-Bench is a data set that can evaluate not only video comprehension, but also advanced reasoning ability using knowledge.

Experimental Results

To validate the effectiveness of SOK-Bench, we conducted an evaluation using typical LLMs and MLLMs. The main evaluation models are as follows

GPT-4V (OpenAI)
Video-LLaMA (LLaMA-based video understanding model)
PandaGPT (video, audio, and text integration model)
AskAnything (multimodal question-answer model)
Valley (the latest video understanding model)

Analysis of Results

The evaluation in SOK-Bench showed that the current model still faces challenges in causal inference and the use of open world knowledge.

Although the GPT-4V recorded the highest score, the percentage of correct responses remained below 60%.
Existing video understanding models (Video-LLaMA, etc.) are only 40% or less accurate, and in many cases cannot correctly infer the situation in the video.
In particular, all models struggled with counterfactual reasoning, and there are issues with the ability to correctly understand the pre- and post-events in the video.

These results indicate that SOK-Bench is a useful data set that clearly demonstrates the challenges that the current LLM/MLLM must overcome.

Conclusion

SOK-Bench provided an important new benchmark in the field of video inference. Current models still have difficulty understanding causal relationships and leveraging open-world knowledge, and further improvements are needed in future model development. In particular, we felt that further integration of video data and knowledge graphs is essential for multimodal AI to be as flexible in reasoning as humans.

Categories related to this article

Sasayama