CelebV-Text] Dataset For Generating Human Face Videos From Text
3 main points
✔️ Proposed CelebV-Text,the firstlargedataset oftext-video on faces
✔️ Constructed a new benchmark to facilitate standardization of the task of generating videos of faces from text
✔️ Conducted comprehensive statistical analysis to examine the quality and diversity of video and text, and text-video associations
CelebV-Text: A Large-Scale Facial Text-Video Dataset
written by Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu
(Submitted on 26 Mar 2023)
Comments: Accepted by CVPR2023. Project Page: this https URL.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
In recent years, generative models havereceived a great deal of attention for generating and editing videosfrom text. However, the field of generating videos about human faces remains a challenge due to the lack of appropriate datasets.In particular, there are issues such as the low quality of the video frames generated and their weak relevance to the input text.In this paper, we address these issues bydeveloping CelebV-Text,a large-scale dataset for generating videos of human faces from text.This is a large, high-quality dataset of text and video pairs.
CelebV-Text is a dataset of 70,000 diverse facial video clips, each with 20 textual descriptions. Thesetext descriptionswere generated using semi-automatic text generation and contain detailed information about static and dynamic attributes. Compared to other datasets, a comprehensive statistical analysis of the relationship between video, text, and text and video has been performed. The usefulness of this dataset has also been demonstrated by extensive experimentation.
We have designed a comprehensive data building pipeline that includes data collection, data annotation, and semi-automatic text generation, and we also propose a new benchmark for text-video generation. In addition, we evaluate the results on a representative model, resulting in improved association between the generated facial videos and the text, and significant improvements in temporal consistency.
Data Set Construction
To build a large text-video dataset on high-quality faces,we propose an efficient pipeline thatincludes data collection and processing, data annotation, and semi-automatic text generation.
First,we employ asimilar method to CelebV-HQ in data collection.Queries are generated for person names, movie titles, vlogs, etc., and videos are retrieved that contain dynamic state changes over time and rich facial attributes. These data are downloaded from online resources, from which we exclude videos that are low resolution (<512x512), short (<5 seconds), and already included in CelebV-HQ.
Inaddition, annotation is an important process that has a significant impact on the relationship between text and video in CelebV-Text, so it has been designed with particular care.
Unlike images, videos contain changes over time. However, most face video datasets focus on static attributes that do not change over time. Therefore, in this paper, weclassifyface videosintostatic (Static) anddynamic (Dynamic)attributesand annotate them in detail.
As static attributes, thecurrent dataset only considers appearance attributes, but CelebV-Text includes not only general appearance, but also detailed appearance and light conditions. Detailed appearance includes five classes: scars, moles, freckles, dimples, and one eye, while light conditions includes six classes, including light color temperature and brightness.
In addition, three dynamic attributes have been designed: motion, emotion, and light direction. The motion attribute is extended with reference to CelebV-HQ, while the emotion attribute adopts Affectnet's eight emotion settings. For light direction, there are six classes. As in CelebV-HQ, the dynamic attributes are also given start and end timestamps.
Thus, CelebV-Text annotations are designed to capture the details of temporal changes in the video and to enhance the relevance of the text to the video.
Inaddition,CelebV-Texthas implemented a combination of automatic and manual annotation methodsbased ontheseattribute designs tooptimize the quality and cost of the data set.
For attributes that can beautomatically annotated, the algorithms are first investigated and those with an accuracy of 85% or better are selected. Light conditions, appearance, and emotion labels are automatically annotated. The automatic annotations are further corrected by humans to improve the accuracy.For dynamic and detailed appearance attributes, manual annotation is required. Annotation workers create natural and appropriate descriptions.
This effective combination of automatic and manual annotation efficiently builds a high-quality data set.
Inaddition,commonmultimodal text-video data sets use subtitles, manual text generation, and automatic text generation togenerate text.However, each of these has its own challenges. Subtitles are easy to acquire, but are less relevant and noisy. Manual generation is time-consuming, costly, and difficult to scale. And while automatic generation is easy to scale, it has challenges with the variety, complexity, and naturalness of the generated text.
To solve these problems, this paper proposes "semi-automatic template-based text generation" that combines the advantages of both manual and automatic methods. In this method, the annotator first generates 10 different facial video descriptions for each attribute and analyzes their grammatical structure. Next, it uses a probabilistic context-free grammar to design its own template to increase the diversity of the generated text.
These methods allow for the efficient generation of natural and diverse text and the scalable construction of quality text-video data sets.
Statistical Analysis of Data Sets
This paper compares CelebV-Text to other major facial video datasets and provides a comprehensive analysis of video, text, and text-video relationships.
CelebV-Text contains approximately 70,000 video clips with a totalplayback time ofapproximately 279 hours. Each video clip has 20 descriptive text clips describing all six attributes. It is also larger in scale and higher resolution than other datasets. For example, VoxCeleb2 has a large sample size but limited video variety (distribution) because it is primarily composed of talking faces; CelebV-HQ and CelebV-Text sampleshave a larger variety (distribution) becausethey were collected using a variety of queries. In particular, CelebV-Text has about twice as much video data, more video attributes, and more relevant text descriptions. Compared to MM-Vox, the only existing text-video dataset on faces, it is superior in terms of size and quality.
To show the distribution of attributes in CelebV-Text, general appearance, movement, and light direction attributes are divided into groups. Facial features (e.g., double chin, large nose, egg-shaped face) account for about 45%, the basic group about 25%, and the beard type about 12%. Hairstyle and accessory groups account for about 10% and 8%, respectively. In terms of movement attributes, head-related movements account for about 60% and eye-related movements about 20%. Interaction groups (e.g., eating), emotional groups (e.g., laughing), and daily groups (e.g., sleeping) account for about 9%, 7%, and 4%, respectively. In terms of light direction, most of the sample includes frontal lighting, with the remainder evenly distributed.
The quality of the collected videos is analyzed and compared to MM-Vox and CelebV-HQ to show the superiority of CelebV-Text. BRISQUE and VSFA are used to evaluate image and video quality. Image quality for all datasets is higher for CelebV-Text and CelebV-HQ, showing significantly higher quality than MM-Vox. Video quality is similarly highest for CelebV-Text, possibly because the video segmentation method reduces discontinuities during background switching.
The text in CelebV-Textis longer and more detailed than in MM-Vox and CelebV-HQ. Average text lengths are 28.39, 31.06, and 67.15, respectively. Comprehensive annotation allows CelebV-Text video descriptions to contain more words.
Unique parts of speech (verbs, nouns, adjectives, and adverbs) are compared across the three datasets to examine linguistic diversity. With its comprehensively designed attribute lists and templates, CelebV-Text has a broader range of texts and covers a wide range of temporal static and dynamic facial attributes.
In addition, we examine the textual naturalness and complexity of CelebV-Text compared to MM-Vox. We find that the grammatical structure and synonym substitution significantly improve the linguistic naturalness and complexity of CelebV-Text.
In addition, a text-video retrieval task is performed on three datasets, MM-Vox, CelebV-HQ, and CelebV-Text, to quantitatively examine the relevance of text and video.Recall@K(R@K), median rank (MdR), and mean rank (MnR) are used asevaluation metrics.Note that higher R@K and lower median and mean ranks indicate better performance.
First, we evaluate the performance with text containing general appearance descriptions; both CelebV-HQ and CelebV-Text results are better than MM-Vox, indicating that the designed template can generate more relevant text for videos than MM-Vox . Next, we added a description of dynamic emotion changes and found similar results for both datasets, indicating higher annotation accuracy for static appearance attributes. We also added a description of behavior, achieving the best performance on most measures.
Validation of The Usefulness of The Dataset
Here, to validate the effectiveness of the CelebV-Text dataset, we generatefacialvideosfrom textand benchmark the task using representative methods.
To demonstrate the effectiveness of CelebV-Text's static and dynamic attribute descriptions, we are conducting several experiments comparing it to CogVideo, based on MMVID, the latest open source methodology.
To first validate the effectiveness of the CelebV-Text dataset on static attributes, we generate videos based on general appearance, facial details, and light condition descriptions; we train the MMVID from scratch using CelebV-Text and generate three input texts with individual descriptions for each static attribute The generated text is then used to generate a video based on the MMVID description of the light condition. The generated text is then input to MMVID and CogVideo, and the video output is compared.
The general appearance visualization resultsare shown inFigure(a)below,where CogVideo generates facial videos based on textual descriptions, but shows low relevance between text and video, such as "dark circles under eyes" and "wavy hair. On the other hand, MMVID generates videos that include all attributes described in the text, showing high relevance.
We also validate CelebV-Text based on changes in dynamic attributes (e.g., emotion, movement, light direction).In Figure (b)above, we see that CogVideo fails to reflect temporal changes (e.g., smile -> rotate) described in the input text. However, the MMVID trained on CelebV-Text accurately models changes in dynamic attributes, demonstrating the validity of the dataset.
Note that CogVideo has a model size about 100 times larger than MMVID and is trained on a text-video dataset about 75 times larger than CelebV-Text, but as shown in the figure above, the video samples produced by CogVideo are of lower quality than those produced by MMVID trained on CelebV-Text alone. The video samples generated by CogVideo are of lower quality than those trained by MMVID on CelebV-Text alone, which demonstrates the effectiveness of the dataset proposed in this paper.
Text-to-video generation techniques are evolving rapidly, with MM-Vox being the sole benchmark for facial video generation. In this paper, we extend this benchmark and build a new benchmark using three datasets, MM-Vox, CelebV-HQ, and CelebV-Text.This allows for a comprehensive evaluation of the performance of the task of generatingfacialvideosfrom text.Two methods, TFGAN and MMVID, have been selected for performance evaluation and are based on the following metrics
- FVD: Assessment of temporal consistency
- FID: Evaluate the quality of each frame
- CLIPSIM: Evaluating the relevance of text and generated video
Variant text, including static and dynamic attributes, was used for quantitative evaluation to validate the baseline methodology. Results show that MMVID outperforms TFGAN, as shown in the table below. We also found that the quality of the video generated by MMVID degrades when the input text contains temporal state changes.
The figure below shows a sample of videos by MMVID trained on different datasets. We can see that these video frames are 128 x 128 pixels, are temporally consistent, and are of high quality. However, we also see that MMVID sometimes fails to perfectly reproduce the attributes described in the input text.
Summary
This paper proposes CelebV-Text, a large facial text-video dataset with static and dynamic attributes. The dataset contains 70,000 video clips, each with 20 individual texts describing static and dynamic elements. And through extensive statistical analysis and experimentation, the superiority and effectiveness of CelebV-Text.
The paper also states that future plans are to further expand the scale and diversity of CelebV-Text. It is also expected that CelebV-Text will be applied to new tasks based on CelebV-Text, such as fine-grained control of video faces, adaptation of general pre-learning models to the face domain, and text-driven generation of 3D recognition face videos.
Categories related to this article