Catch up on the latest AI articles

VideoPrism Opens Up The Possibilities Of Video Analytics

VideoPrism Opens Up The Possibilities Of Video Analytics

Large Language Models

3 main points
✔️ VideoPrism achieves state-of-the-art performance on a wide variety of video understanding tasks
✔️Excellent versatility confirmed in a wide range of evaluations, including scientific datasets
✔️ Practical application requires responsible algorithmic bias reduction and privacy protection

VideoPrism: A Foundational Visual Encoder for Video Understanding
written by Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
(Submitted on 20 Feb 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Video is a vivid window into our world, documenting a wide range of experiences, from everyday moments to scientific inquiry. In this digital age, Video foundation models (ViFMs) have the potential to analyze such vast amounts of information and extract new insights. While research to date has certainly made great strides in the area of video understanding, the construction of truly foundational video models that skillfully address appearance and movement is still an unattained area.

Therefore, this paperproposesVideoPrism, an innovative general-purpose video encoder designed to tackle any task in video comprehension, from video classification to localization, search, captioning, and question answering. Through extensive evaluation, including computer vision datasets and scientific disciplines such as neuroscience and ecology, VideoPrism has demonstrated state-of-the-art performance with minimal adaptation. The figure belowrepresents an overview of Video Prism.

In developing VideoPrism, we emphasize the importance of pre-training data. Ideally, pre-training data would be a representative sample of videos from all over the world, but in practice, many of them do not have text describing their content or are very noisy. Therefore, VideoPrism makes the most of this data by collecting 36M high-quality video and caption pairs and 582M noisy video clips.

The modeling begins with contrastive learning of meaning between video and language. Then, using video-only data, we incorporate global and local distillation, token shuffling, and further improvement through modeling of masked video. This unique two-step approach is what makes VideoPrism exceptional at tasks that focus on both video appearance and motion.

The effectiveness of this approach has been demonstrated through extensive evaluation across four major comprehension task categories, including 33 diverse benchmarks ranging from video from the web, script-based performance, and scientific experiments. VideoPrism outperforms existing video infrastructure models (ViFMs) by a wide margin on 30 of these benchmarks, demonstrating its superior performance. The results are shown in the figure below.

This is an indication that VideoPrism has "very" good generalization capabilities.


VideoPrism employs an innovative approach to video understanding. At its core is a rich pre-trained data set containing 36M clips. These are extracted from 36M videos and are manually captioned in high quality. In addition, 582M clips from 275M videos contain parallel text with noise. This collection of pre-training data is unprecedented in Video foundation models (ViFMs), but is still small compared to the data used for image foundation models. To fill this gap, this paper also collects additional data, including ASRs, metadata, and noisy text generated through large-scale multimodal models.

It is noteworthy that we do not use any training set of evaluation benchmarks in our pre-training or subsequent training. This prevents the model from being over-optimized for a particular evaluation benchmark. In addition, the pre-training corpus is de-duplicated from the evaluation benchmark videos to avoid data leakage.

In terms of model architecture, VideoPrism is based on the Vision Transformer (ViT), but with both spatial and temporal considerations. This ensures that spatial and temporal dimensions are preserved in the output token sequence to support downstream tasks that require fine-grained features. videoPrism-g, which employs a ViT-giant network with 1 billion parameters, and the smaller ViT-Base network We experiment with two model configurations, VideoPrism-B, which uses a smaller ViT-Base network.

VideoPrism uses a unique two-stage approach in which video-only data is leveraged and trained, as well as video and text pairs. Because text in large pre-trained datasets is often noisy in some videos, VideoPrism focuses on video-only data to capture the deeper meaning of the video.

Stage 1:In this stage, contrast learning is used to synchronize the video encoder with the text encoder. This process assists the video encoder in learning rich visual semantics by guiding it from the language by minimizing symmetric cross-entropy loss based on similarity scores of the video-text pairs. The model resulting from this stage provides a semantic video embedding for the next stage of learning.

Stage 2:Learning based solely on visual text data in Stage 1 faces the problem that textual descriptions contain noise and tend to capture appearance rather than motion. The second stage focuses on learning both appearance and motion information from video-only data. Here we introduce a new token shuffling scheme and global and per-token distillation loss as an improvement to masked video modeling. This allows the model to learn to predict the embedding of the first stage based on the masked video, preserving semantic knowledge.

With this two-step approach, VideoPrism builds an underlying video encoder that can better understand video and capture the semantics of appearance and motion.


VideoPrism has been evaluated to demonstrate its performance and versatility in a wide range of video-centric comprehension tasks. These tasks fall into four categories:first, general video comprehension. This includes classification and spatial and temporal localization; the second is zero-shot video text retrieval; the third is zero-shot video captioning and QA; the fourth is computer vision for scientific research; and the fifth is video comprehension in general, which includes video comprehension and video localization.

In all experiments, VideoPrism was fixed as a video encoder and only the components required for a particular task were trained. This allows us to evaluate VideoPrism's versatility and ability to be model independent for a specific task. In addition, because the cost of video encoding can be spread across multiple tasks, the VideoPrism approach is especially useful in video analysis where expensive fine tuning is difficult.

It is first compared to state-of-the-art models in VideoGLUE, a benchmark for video understanding. This evaluationranges fromappearance-focused action recognition (VC(A)), motion-rich action recognition(VC(M)),multi-labelvideo classification(VC(ML)),temporal action localization (TAL),temporal and spatial actionlocalization (STAL)This is done using eight representative datasets ranging from

VideoPrism achieves noticeable performance gains as the model size increases from ViT-B to ViT-g. It has been shown that VideoPrism consistently achieves improvements across a wide range of video comprehension tasks. This means that VideoPrism combines appearance and motion cues, spatial and temporal information, and robustness to different video sources such as web video and scripted performance in a single encoder.

We then evaluate VideoPrism's zero-shot video text retrieval performance using three key benchmarks: MSRVTT, VATEX, and ActivityNet. We also challenge the ATP-Hard subset of Kinetics-400, Charades, SSv2-Temporal, SSv2-Events, and NExT-QA in our zero-shot video classification task.

As a key result, VideoPrism sets new best records in many benchmarks and achieves significant improvements, especially on challenging data sets; VideoPrism-B outperforms existing larger models. Furthermore, VideoPrism performs equally well or better when compared to models pre-trained with in-domain data and additional modalities. These results demonstrate that VideoPrism has strong generalization capabilities in zero-shot search and classification tasks.

We alsoused standard video capping datasets, including MSRVTT, VATEX, and YouCook2, and video QA benchmarks, including MSRVTT-QA, MSVD-QA, and NExT-QA, for video capping and QA tasks in zero-shot settings to The performance is evaluated using video QA benchmarks, including MSVD-QA and NExT-QA. Note that the models are not specifically tuned for the captioning and QA tasks.

The results are shown in the table below. Despite its simple architecture and limited number of adapter parameters, it is competitive and achieves top results in most evaluations except VATEX. This suggests that VideoPrism's encoder has extensive generalization capabilities for video-to-language generation tasks.

While existing video analytics benchmarks focus primarily on human-centric data, we are exploring VideoPrism's capabilities and its potential for scientific applications on a broad range of video sets using scientific datasets. The analysis encompasses a wide range of disciplines, including behavioral science, behavioral neuroscience, cognitive science, and ecology. This study is the first application of ViFMs to a scientific dataset and shows that it performs as well as or better than professional models. This

The analysis includes large video datasets annotated with expertise captured in scientific experiments, including flies, mice, chimpanzees, and Kenyan wildlife. All of these are annotated in detail for video classification of behavior or spatial and temporal action localization. In particular, the CRIM13 dataset analyzes videos from the lateral and upper viewpoints of the cage.

Results show that the use of shared freeze encoders can achieve performance equal to or better than domain-specific models dedicated to individual tasks VideoPrism consistently performs well, especially in the base-scale model, with expert models especially for the base-scale model. Furthermore, scaling the model to larger sizes improves performance across all data sets. These results indicate that ViFMs has the potential to significantly accelerate video analytics in a wide variety of areas.


VideoPrism, presented in this paper, is a foundational video encoder that enables state-of-the-art technology in the field of video understanding. It focuses on a data and modeling approach by building a uniquely large pre-training dataset and a pre-training strategy that effectively extracts appearance and motion information from video. It achieves the best performance in a wide range of benchmark tests and shows very high generalization capabilities compared to other models.

Technological advances in video understanding have the potential to accelerate developments in fields ranging from scientific research to education, robotics, healthcare, and content recommendation. These technologies are expected to facilitate scientific discovery, enrich the learning experience, increase security and safety, and enable more responsive interactive systems.

However, before using these models in the real world, it is also important to take steps to prevent potential bias and misuse. We must reduce algorithmic bias, protect privacy, and adhere to responsible research norms. The paper suggests that it will be important to continue to promote open discussion within the community about these new developments in order to reap the benefits that this technology can bring in a responsible manner.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us