
Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video
3 main points
✔️ Proposed "Vript," a video-text dataset built with high-resolution video and detailed captions
✔️ State-of-the-art video captioning model "Vriptor" developed
✔️ Proposed "Vript-Hard," a new benchmark for evaluating hallucinations and assessing comprehension of long-form video
Vript: A Video Is Worth Thousands of Words
written by Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao
(Submitted on 10 Jun 2024)
Comments: submitted to NeurIPS Dataset & Benchmark track.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:![]()
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Recent advances in multimodal learning have spurred research into models for understanding and generating video. And with it has come a surge in demand for high-quality datasets built with high-resolution video and detailed captions. However, because of the added element of time, video cannot be paired with text as easily as images. Preparing video and text pairs is a much more difficult task. For example, a travel video blog (vlog)includes many scenes of travel preparations, accommodations, and visits to tourist attractions. To provide detailed and appropriate captions for such videos requires an enormous amount of time and effort to review the videos and caption them according to the scenes.As a result, mostconventionaldatasetsbuilt on video-text pairsare short videos with simple captions.
To overcome this challenge, the paperbuildsVript, a video-text dataset that is applicable to longer videos and built with more detailed captions than previously possible. vript contains over 12,000 high-resolution videosannotated by GPT-4V( Vript's annotations are based on the format of the video scripts and focus on the content as well as the type of shot and camera movement for each scene. Unlike traditional datasets, this paper annotates the untrimmed videos in detail, with a long caption of about 145 words for each scene. In addition to the visual information, the narration is transcribed into text and supplemented as background information along with the video title to improve the information content and accuracy of the captions.
Existing research shows that detailed captions can help improve visual and verbal associations/mappings. However, as noted above, most datasets have only short captions and not dense, detailed annotations. Therefore, in this paper, we employ three novel approaches to improve the association/mapping between video and text
- Video Script Utilization
- Sampling multiple consecutive scenes to create a long video and concatenating the corresponding captions to generate a long text
- Transcription of narration
- Combined narration transcription and video input
- Video timestampconjugation
- Enter narration and video timestamps as additional information
Based on these approaches, the paper also builds a model for video captioning called "Vriptor," which can generate dense and detailed captions for both short and long videos, achieving SOTA performance among open source models.
We also propose avideo comprehension benchmark, Vript-Hard, which consists ofthree tasks (hallucination evaluation, inference of longer videos, and reordering of events)that are more challenging than traditional benchmarks.
Video Caption Detailing
In building Vript, the goal of this paper is to annotate the videos with as much detail as possibleso that the videos can be visualizedthrough their captions.For each scene in the video, captions are described with attention to detailed actions and interactions, rather than coarse captions. A variety of information can be described here, including the appearance of objects and characters, the environment, light, and the style of the video.
It also focuses on how the camera is moving to capture the image. In previous research, image captionswere used for video captions, and the information of camera movement, which is unique to video, was not utilized. For example, for avideoclip ofa man riding a bicycle, it isnot sufficient to simply explain that "a man in a dark blue shirt is riding a black bicycle down the road," but rather, "as the camera pans to a close-up shot, a man in a dark blue shirt is riding a black bicycle. A more specific description, such as, "As the camera zooms out, we see a general view of the man moving along the road with the mountains in the background," would be more useful. Thus, to improve the quality of the video captions, information about camera behavior is added in addition to the content.
It can be said that combining static situations with information about camera behavior can be approximated as describing a scene in a video script. Therefore, Vript follows the format of a video script, dividing the video into scenes with PySceneDetect and annotating each scene with captions about the static situation and camera behavior. We call this "video scripting. And for the video data in this paper,we selected10,000 full-length YouTubevideosfrom HD-VILA-100Mand collected 1,500 short videos from YouTube Shorts and TikTok. In addition, GPT-4V is used to annotate each scene with the following information.
- Title: Summary of the scene (10 words or less)
- Content: Detailed description of approximately 150 words
- Shot type: Full view, close-up, etc.
- Camera movement: panning, zooming, etc.
In order to add high-quality scripting to the videos, we densely annotate the untrimmed videos (ranging in length from 5 seconds to 2.9 hours) from start to finish.In addition to the video frames, we also add external information to aid in annotation: the original video content is entered into the model using the narration and video titles transcribed in Whisper. This external information greatly reduces illusions and improves captioning accuracy, allowing the model to better understand the content of the video as well as the visual information.
For example, as shown in the figure below, one cannot guess what ingredients are added to a bowl simply by looking at the Scene-010 frame. With the information from the narration, we know that the ingredients are mayonnaise and mustard, leading to improved accuracy in the captions displayed in the upper right panel.
Vriptor
Generally,when mappingimagesto text, one caption is mapped to one video. However, existing video-to-text datasets (Panda-70M, WebVid-10M, etc.) have only simple captions and lack detailed information, so the current situation is not sufficient to relate/map images and text. To solve this problem, this paper studies how to map videos to more text using the Vript dataset. The result is Vriptor, a powerful model of video captioningthat achieves SOTA performance among largeopen-sourcelanguage models forvideo.
If a video is annotated in detail, there are several possible ways to increase the amount of text to be supported, one of which is to concatenate the captions of several consecutive clips. However, with this method, the context and meaning of the concatenated captions may not be consistent, since the captions are annotated separately. Therefore, we reconstruct consecutive captions as scenes in a video script by referring to the format of the video script. This ensures that the captions for each scene are highly detailed and that the background and context are consistent, thus ensuring uniformity of meaning. vript allows for the creation of "subscripts" by sampling a number of consecutive clips. For example, 10 consecutive clips and their corresponding "subscripts" contain about 1,500 words, about 100 times longer than the shorter captions.
We have also added a transcription of the narration; Vript is annotated by entering the narration and the video frame together, so the caption contains information from the narration.
Commonlarge-scale language models thatdeal with videoimplement a specific sampling technique that extracts multiple frames of video input, but these models are weakly time-aware, knowing only the order of frames but not how long they last. Therefore, this paper considers timestamps to be important for video script support and adds video timestamps to the input narration and output captions. This makes it easier for the model to understand the start and end of each scene.
These methods are integrated to learn Vriptor. In the figure below (reproduced below), four different input and output combinations are used: one scene to one caption, one scene + narration to one caption, multiple scenes to one script, and multiple scenes + narration to one script. Time stamps are added to all of them.
We also trained Vriptor in two levels based on the ST-LLM and evaluated its captioning abilities with the Vript-HAL and MSR-VTT. Vriptor supports two types of captioning: "describing the entire video" and "describing each scene". When describing an entire video, Vriptor provides a general description of 100-150 words. When describing a scene-by-scene, Vriptor provides a detailed description of 100-150 words for each scene. The table below shows thatVriptorprovides a more detailed description of the video in the scene-by-scene description compared to the whole video description, resulting in an improved Recall.
The figure below illustrates Vriptor's ability to caption long videos with long text.
The last two rows of the table below (reproduced below) show that the addition of narration allows the model to provide a more detailed and accurate description. In addition, the percentage of proper nouns in the captions has increased by 14%. This suggests that the model has the ability to infer the name of an object by analyzing the narration as well as its appearance.
To further validate the effect of timestamping, another model was trained without additional timestamps and these two models were compared. The results showed that the improvement was slight for the entire video description, but significant for the scene-by-scene description. The model with added timestamps is less likely to generate duplicate descriptions from previous scenes because it understands the start and end of each scene and can identify which scene corresponds to which time period. In addition, models with added timestamps showed 12% higher recall on the Vript-HAL, indicating that models without added timestamps are more likely to forget to explain parts of the video.
Vript-Hard
As the performance of multimodal models improves, more sophisticated benchmarks are needed to evaluate their capabilities. Therefore, this paper proposes a new video comprehension benchmark called "Vript-Hard". This benchmarkconsists ofthree challenging tasks: HAL (Hallucination Evaluation), RR (Retrieval then Reasoning), and ERO (Event Re-ordering).
We also evaluated large-scale language models that handle images, such as BLIP2, InstructBLIP, Qwen-VL, and LLaVA 1.6 34B, and large-scale language models that handle video, such as VideoChatGPT, VideoLLaMA, VideoChat, VideoChat2, and ST-LLM VideoChatGPT, VideoLLaMA, VideoChat, VideoChat2, ST-LLM. We also evaluate closed-source models such as Claude 3-Sonnet, Opus, and GPT-4V.
Vript-HAL: Benchmark for Hallucination Assessment
Past research has investigated methods for detecting and evaluating hallucinations in large-scale language models that deal with images, but large-scale language models that deal with videos have reported hallucination problems as well. For example, when a large-scale language model that handles video is asked to describe a video, it may misinterpret objects and actions and generate descriptions that include hallucinations. Existing captioning benchmarks (e.g., MSR-VTT and MSVD) are inadequate for assessing hallucinations because they only have short captions of 10 words or less and lack detail.
To address this challenge, we are building a benchmark called V "ript-HAL, which is annotated with captions of approximately 250 words per Vript-HAL video, 25 times longer than MSR-VTT. Based on this detailed ground-truth captioning, it is possible to see if a large-scale language model dealing with video would produce hallucinations in the captions.
Traditional evaluation metrics (BLEU, ROUGE, CIDEr, etc.) evaluate accuracy by measuring word similarity between predictive text and ground-truth text, but are not suitable for evaluating whether objects and actions are correctly described. Therefore, this paper uses accuracy scores to evaluate whether nouns (objects) and verbs (actions) in captions are correctly described.
Recall is also introduced to measure how well objects and actions in the ground truth are correctly described, as caption length and detail vary from model to model. We also calculate an F1 score as an overall score for the hallucination evaluation.
Specifically, accuracy is calculated based on the number of accurately described objects and actions, and recall is used to evaluate how accurately they are described based on ground truth. For this evaluation, we use SpaCy to extract nouns and verbs and create word embeddings using Sentence Transformer. If the Cosine similarity between the prediction and the ground-truth is above a certain level, we consider the prediction to be correctly described.
Vript-HALevaluateslarge languagemodels thathandlemanyimagesandlarge languagemodelsthathandle videos. The figure below shows that some models (e.g., BLIP2 and VideoChat2) are less hallucinatory because they generate short captions with fewer details; Vriptor-W (whole videos) provides a general description, while Vriptor-S (per scene) describes many details of the video, shows high recall. Both models are comparable to GPT-4V in video captioning.
Vript-RR: Benchmark for Understanding Long-Form Video
Asking a question about the details of a long-form videocan lead to ambiguity, such as when there are multiple answers that fit the question at different timestamps or when the answer changes over time.This ambiguity is a common problem in long-form video comprehension benchmarks (e.g., EgoShecma). Therefore, this paper proposes a new benchmark called Vript-RRR (Retrieval then Reasoning) to solve this problem.
In this benchmark, the model first provides a hint to help the model find a scene in the video that is relevant to the question. This hint is a detailed description of the relevant scene. The question is then based on that scene to eliminate ambiguity. In practice, as shown in the figure below, the hints and questions are entered along with the entire video, and the model outputs the answers directly. This process is end-to-end.
Hints are carefully crafted to ensure that the model does not easily find the answers; Vript-RR requires at least one inference step or additional processing (e.g., text reading, detailed inspection) for each question in order to assess the different abilities of large language models to handle video The questions are designed in such a way that
Vript-RR consists of two subtasks based on video input: one for the entire video and one for the relevant scenes directly Vript-RR provides both choice-format and open-ended questions. For the open-ended output, GPT-4 turbo is used as a judge to evaluate whether the answer is correct by comparing the prediction with the ground truth.
As shown in the table below, the "Scene" column represents results using related scenes as input, whichis an easier task because it does not require the model to search the entirevideoto find related scenes. The "Whole" columnuses the entirevideoas input and requires the model to use hints to find relevant scenes, which requires the ability to understand full-length videos; closed-source models such as GPT-4V and Claude 3 have shown better performance than open-source models.
Foreachvideoin the Vript-RR, we designed questions drawn from scenes located at 15%, 40%, 60%, and 85% of the video.This allows us to investigate whether the temporal location of the scene in thevideoaffects the Vript-RR results. The model needs to find the "needle" (the relevant scene) through visual tokens rather than text tokens. In this task, we observe a significant degradation of model performance when the "needle" is located between 15% and 85% of the long context, especially when the text length exceeds at least 16K tokens. As shown in the figure below, although the number of visual tokens is much smaller than 16K, when the scene islocated in themiddle of the visual tokens (40% and 60% of the video), most models show performance degradation.
Vript-ERO: Benchmark for Temporal Understanding of Long-form Video
Several benchmarks test temporal comprehension of models. However, these benchmarks focus on questions about the temporal ordering of actions in short clips, and few explore the temporal comprehension of events in feature-length videos. To fill this gap, this paper proposes the Vript-ERO (Event Reordering) task. In this task, we sample three different scenes (averaging 10 seconds) with different time stamps from a feature-length video (ranging from 2 minutes to 2 hours) and shuffle their chronological order. Based on a detailed description of the feature video and the three shuffled scenes, as shown in the figure below, the modelshould answer the correct temporal order of the scenes based on its understanding of the entirevideo.
In the table below (reproduced below), "-" indicates that these models are unable to provide an answer. Unlike the previous task, Vript-ERO also includes long descriptions of the scenes, which indicates that the models are weak at processing long instructions. Even with the scored models, only about 20% of the questions can answer the correct sequence for all three scenes.
In the figure below, we also collected the answers to the questions that were answered incorrectly and analyzed the reasons for the incorrect answers. It is observed that the model is easily misinterpreted by the description provided:in 31.4% of the cases, this is due to the fact that some events are missing in the input frame, due to the limited number of input images for a model like GPT-4V. Also, in 25.1% of the cases, the model does not recognize which scenes should be ordered based on the description.
Summary
Recent advances in multimodal learning have led to an increasing focus on models for understanding and generating video. This has resulted in a surge in demand for high-quality video text datasets with high-resolution video and detailed captions. However, because of the added temporal component of video, it is more difficult to acquire and annotate video/text pairs than image/text pairs. For example, a travel video blog includes many events, each consisting of a different scene, such as preparing for a trip or visiting a destination. Adding video captions requires a lot of time and effort to review the entire video and annotate details. As a result, traditional video text data sets often contain only short, coarse descriptions.
To overcome these challenges,this paper proposes Vript, a high-quality video text dataset with dense and detailed captions. Based on this, we have also built a top-performing open-source model of video captioning, Vriptor, and a challenging benchmark for evaluating the hallucination and long-form video comprehension of large-scale language models dealing with video, Vript-Hard.vriptor is adept at generating dense captions for both short and long videos, achieving state-of-the-art performance among open source models.
This study proposes a new way to deepen the correspondence between video and text, which not only improves the performance of video captioning models, but also provides a new benchmark for evaluating the models' comprehension capabilities.It is expected to contribute to future research and practical applications.
Categories related to this article