A Multimodal Model Is Now Available That Enables Prediction Of Viewer Behavior From Video!
3 main points
✔️ Created The Content Behavior Corpus (CBC), a dataset consisting of content and corresponding receiver behaviors
✔️ Large-scale multimodal model for learning with behavior tokens, Large Content and Behavior Models(LCBM), a large-scale multimodal model that uses behavior tokens to learn
✔️ performed as well as or better than GPT-3.5 and GPT-4 on a variety of tasks
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
written by Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy
(Submitted on 1 Sep 2023 (v1), last revised 8 Sep 2023 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
In 1949, a scholar named Shannon published a paper on information theory, in which he stated that "communication can be divided into three levels
- Level A - Technical problem: How accurately can you convey the symbols of communication?
- Level B - Semantic problem: How accurately do the communicated symbols convey the desired meaning?
- Level C - Effectiveness problem: How effectively and desirably does the received meaning influence behavior?
While the development of telecommunications technology has led to major advances in Level A issues, such as the Internet, and in recent years, large-scale language models (LLMs) have made significant progress toward Level B, Level C issues have remained largely untouched.
The Level C problem is to predict desired recipient behavior and optimize communication, and while LLMs demonstrate broad generalization capabilities in many tasks, these problems are more difficult to solve.
The author of this paper pointed out that one of the reasons for this is that LLM learning does not include "behavior tokens" that define recipient behavior in communication, such as the number of shares, likes, clicks, purchases, and retweets.
In this paper, we describe a large-scale multimodal model, Large Content and Behavior Models (LCBM), which expands the inference range of LLM from content⇨content to content⇨behavior by creating The Content Behavior Corpus (CBC), a data set consisting of content and corresponding receiver behaviors, andlearning it using behavior tokens. We will discuss a paper that proposes Large Content and Behavior Models (LCBM), a large-scale multimodal model that can expand the inference range of LLMs from content⇨content to content⇨behavior by learning with behavior tokens.
The Content Behavior Corpus (CBC)
Since most publicly available corpora remove recipient behavior from content, in order to model content and behavior as text-to-text, we created a dataset, The Content BehaviorCorpus (CBC), a dataset consisting of content and corresponding receiver behavior.
This paper uses Youtube, a large public source of content and behavior data, which includes (a) channel name, channel description, and number of subscribers, (b) video and creator-provided titles and descriptions, and (c) number of likes, views, and It consists of actions in the form of user comments and replay graphs.
From this structure, this dataset covers all five elements of communication :Communicator, Message, Channel, Receiver, and Effect, as shown in the figure below.
Large Content Behavior Model (LCBM)
Next, we describe the Large Content Behavior Model (LCBM), a large-scale multimodal model proposed in this paper.
The overall picture of LCBM is shown in the figure below.
This paper takes a similar approach to recent models such as BLIP, Llava, and VideoLlama to understand both image and text content, using Visual Encoder (EVA-CLIP) to encode images and LLM (Llama) to text encoding.
In addition, this method can also include video content by encoding video frames using EVA-CLIP, Uniformer, and GMHRA.
Next, to effectively take advantage of LLM's rich linguistic representations, a Linear layer is added to BLIP-2's Q-Former to convert visual tokens into linguistic tokens through Visual Content Embeddings.
LCBM is based on the Llama-based Vicuna-13B LLM, which, like previous studies, is structured by a two-stage learning paradigm.
This learning paradigm utilizes datasets such as WebVid, COCO caption, Visual Genome, CC3M, and CC12M in the first stage to align the embedding of the visual encoder with LLM, and in the second stage, Behavior Instruction Fintuning (BFT) is used to fine-tune the model in the second step.
Content Behavior Test Benchmark
To demonstrate the effectiveness of the proposed method, four types of tasks were designed in this paper, as shown in the figure below.
Each task is described below.
- Behavior Simulation: Predict viewer behavior given video content, title, scene-by-scene description, channel and subscriber count, and post date
- Content Simulation: Given a scene-by-scene description, channel information, and video content, predict content based on viewer behavior.
- Content Understanding: Based on existing research, perform tasks to verify understanding of content content, such as topic classification, emotion classification, and classification of reasons for actions.
- Behavior Understanding: Let the model explain the behavior of the person on the content.
For each task, five different models were compared: LCBM, GPT-3.5, GPT-4, Vicuna-13B, and VideoChat.
Behavior Simulation
The results of the Behavior Simulation experiment are shown in the figure below. (Green = best score, Blue = second best score)
It is noteworthy that LCBM's model size is more than 10 times smaller than other models, yet it scores the best, demonstrating that it is able to predict viewer behavior appropriately.
Content Simulation
The results of the Content Simulation experiment are shown in the figure below.
LCBM achieved the best score on this task as well, and was found to perform better than existing models in content prediction.
Content Understanding
The results of the Content Understanding experiment are shown in the figure below.
In this task, GPT-3.5 performed the best, while LCBM was found to have the second best performance in most of the evaluation metrics.
Behavior Understanding
The results of the Behavior Understanding experiment are shown in the figure below.
LCBM was found to perform best in this task as well.
The results of this experiment demonstrate that LCBM performs equally well or better in all tasks despite its model size, which is 10 times smaller than GPT-3.5 and GPT-4.
From these results, it can be inferred that the training corpus for large-scale models such as GPT-3.5 and GPT-4does not include behavior tokens, and this experiment demonstrated the effectiveness of our method for training LLMs using behavior tokens.
In addition, the figure below shows several examples of LCBM's ability to understand and explain viewer behavior observed in this experiment.
Compared to existing models such as Vicuna and GPT-3.5, LCBM was able to understand viewer behavior appropriately, again confirming the effectiveness of this method.
Summary
How was it? In this article, we discussed a paper that proposed Large Content and Behavior Models (LCBM), a large-scale multimodal model that expands the inference range of LLM from content⇨content to content⇨behavior by creating The Content Behavior Corpus (CBC), a data set consisting of content and corresponding receiver behaviors, and training it using behavior tokens. The paper proposed Large Content and Behavior Models (LCBM), a large-scale multimodal model that can expand the inference range of LLMs from content⇨content to content⇨behavior by learning with behavior tokens.
This paper is the first to find that using behavior tokens, which are the actions of the receiver that have been previously removed during preprocessing of data for LLM training, is effective for inference between content simulation and simulation of actions.
In addition, we have created a dataset that can be used for future research, and we anticipate that this research will inspire a variety of applied research, so we will be keeping a close eye on future developments.
The details of the architecture of the data sets and models presented in this paper can be found in this paper for those who are interested.
Categories related to this article