[VideoAgent] Understanding Long-form Video Using A Large-scale Language Model As An Agent

Computer Vision 21/06/2024

3 main points
✔️ VideoAgent mimics the video comprehension process, emphasizing inferential capabilities over processing long visual input.
✔️ Experiments have shown excellent effectiveness and efficiency in understanding long videos by effectively retrieving and aggregating information through a multi-round iterative process.
✔️ Future work will focus on improving and integrating the model, extending it to real-time applications, applying it to a variety of application areas, and improving the user interface to further advance and broaden VideoAgent's applications.

VideoAgent: Long-form Video Understanding with Large Language Model as Agent
written by Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
(Submitted on 15 Mar 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This study introduces a new agent-based system called VideoAgent. This system employs a large language model at its core and is responsible for identifying the key information to answer questions and edit videos. It also has a visual language base model as a tool for processing visual information.VideoAgent was evaluated on the challenging EgoSchema and NExT-QA benchmarks, using an average number of only 8.4 and 8.2 frames, and achieving zero shot accuracy was achieved.

Introduction

Understanding long videos requires models that can process diverse information and effectively infer long sequences. Existing attempts to build models that excel at all of these requirements have proven difficult. Current large-scale language models are good at processing long contexts, but are inadequate at processing visual information. Visual language models, on the other hand, have difficulty processing long visual input. Our system mimics the video comprehension process and emphasizes inferential capabilities over the processing of long visual input; VideoAgent is more effective and efficient than existing methods and represents a significant advance in understanding long-form video.

Related Research

Traditional methods include selective or compressive processing of video. Compressive sparsity methods attempt to compress the video into a meaningful embedding or representation. Selective compression methods attempt to subsample the video based on input questions or text. An agent is an entity that makes decisions and performs actions. Advances in large-scale language models (LLMs) have led to an increasing number of studies utilizing LLMs as agents. Such approaches have been used successfully in a variety of scenarios, such as online searching and card game playing. Inspired by the way humans understand video, this study re-formulates video comprehension as a decision-making process.

Proposed Method

1. initial state acquisition: 1.

First, a uniformly sampled frame from the video is presented to the LLM to familiarize it with the context of the video. A vision language model (VLM) is used to translate the visual information into a verbal description. This initial state records an overview of the video's content and meaning.

Determination of next action: 1.

Considering the current state, the LLM determines the next action; there are two possible options. One is to answer the question and the other is to search for new information; the LLM considers the question and existing information, performs introspection, and chooses an action based on the confidence score.

3. collection of new observations:.

When new information is needed, LLM uses tools to retrieve it. Some information is collected at the segment level to enhance temporal inference capabilities. The acquired information serves as an observation to update the current state.

4. current status updates:.

Considering the new observations, the VLM is used to generate captions for each frame and requests the LLM to generate predictions for the next round.

This approach has several advantages over traditional methods. In particular, the adaptive selection strategy for gathering information locates relevant information and minimizes the costs required to answer questions of different difficulty levels.

Experiment

Datasets and Metrics

The experiment will focus primarily on the zero-shot comprehension function and will use two main data sets: one is EgoSchema and the other is NExT-QA. EgoSchema consists of self-centered videos and 5,000 questions; NExT-QA contains a natural video featuring the interaction of everyday nature videos featuring the interaction of objects and contains 48,000 questions.

Implementation Details

All videos are decoded at 1 fps and the most relevant frames are retrieved based on cosine similarity between visual description and frame features. In our experiments, we use LaViLa for EgoSchema and CogAgent for NExT-QA. We also use GPT-4 as the LLM.

Comparison with state-of-the-art technology

VideoAgent achieved SOTA results on the EgoSchema and NExT-QA datasets, significantly outperforming previous methods. For example, it achieved 54.1% accuracy on the complete EgoSchema dataset and 60.2% on a subset of 500 questions.

Analysis of Repetitive Frame Selection

One of the key components of VideoAgent is iterative frame selection. This process dynamically retrieves and aggregates information until it collects enough information to answer a question. To better understand this process, we conducted a comprehensive analysis and ablation study.

Ablation of the basic model

LLM (Large Language Model): We compared different LLMs and found that GPT-4 performs better than the other models; GPT-4 is particularly strong in structured prediction and shows robust performance in generating accurate JSON format.

VLM (Visual Language Model): Three state-of-the-art VLMs were investigated and found that CogAgent and LaViLa performed similarly, while BLIP-2 performed poorly. the VLM converts image frames into descriptive captions, which are then fed into the model.

CLIP (Contrastive Linguistic Image Model): different versions of CLIP were evaluated, suggesting that all versions perform equally well; CLIP is superior for retrieval tasks and is more efficient because it does not require recalculation of image embeddings.

Case study

An example is given of resolving an instance of NExT-QA, explaining how the VideoAgent identifies the missing information, determines the additional information needed, and uses CLIP to retrieve the details

It was shown how VideoAgent correctly resolves a one-hour video from YouTube. It was emphasized that in this case, the identified frames were provided to GPT-4V to correctly answer the question.

Conclusion

This study introduces a video comprehension system that leverages a large-scale language model called VideoAgent to effectively retrieve and aggregate information through a multi-round iterative process, demonstrating its superior effectiveness and efficiency in understanding long videos. Future work will focus on improving and integrating the model, extending it to real-time applications, applying it to a variety of application areas, and improving the user interface to further advance and broaden VideoAgent's applications.

Categories related to this article

Sasayama