AI's Cambrian Explosion: The Key To The Era Of Finding And Utilizing Useful AI Creators

Video Generation 18/03/2024

3 main points
✔️ Examples of "dancing," "singing," and "talking" promotional video production using generated AI
✔️ Trial of using voice-generated AI to make your dog's roar spoken in English and verification of reproducibility
✔️ Comparison of number of views (PV) and investigation of effectiveness with videos not using AI

Prototype and discussion of singing and dancing videos using AI technology
written by Takahiro Yonemura
(Submitted on 5 Nov 2022)
Subjects: Motion & Dance

The original text can be read below. (To enlarge, click on the symbol in the upper right corner.)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

The author feels that the birth of various AIs since the publication of the paper resembles the Cambrian explosion, when life became diverse at once. In the paper, these various AIs were defined as "AI creators," and examples of their use were clearly described. The promotional video (A) created in cooperation with AI creators received approximately 19 times more views than the promotional video (B) without such cooperation, and the pros and cons of using AI creators were discussed. Video A in the table below is a pure entertainment video that omits the product promotion part from Video C.

PV比較

This article explores the creativity of "AI creators" and real-life examples of their work.

手書きイラストがAIにより3D化される表現 — 3DCG of 2D images by AI Journal of Art Science DiVA No.55, P.16

Just as the prehistoric Cambrian explosion enriched the diversity of life, in an age where AI creators flourish, our users' knowledge and creativity are expanding and the barrier between dreams and reality is lowering. Below is a summary of our cooperation with AI creators, their capabilities, and examples of their use.

AI called "generative," not "synthetic."

AI for each genre, such as sound, music, images, and text, is generally referred to as generative AI. The key point is that we call them generative AI, not synthetic AI. Generative AI can generate from scratch based on learned resources (materials). Therefore, whose rights does the generated product (output), which is not similar to a synthetic product, belong to?　While there is mention of research into the "ethical, legal, and social issues (ELSI)" of commercial use of the output of generative AI, these issues remain unresolved to this day.

筆者近影を生成AIによって3DCG化 — 3DCG of the author's recent image with "by the author" generated AI TRIPO3D https://www.tripo3d.ai/

On the other hand, the technical side of the issue was also presented.

In terms of creation, this paper asserts that teaming with humans was already possible. However, it still requires a human to fine-tune the AI.

This one will no doubt be resolved in the not-too-distant future, when conditions are relaxed or disappear due to the ever-evolving capabilities of the generative AI.

Supplemental Creative Resources and AI Creators

Some of the resources for video composition shown in the paper can now be prepared quickly by a human inputting textual instructions (prompts) to the generative AI. We have evolved to an environment where we can quickly prepare the necessary resources simply by thinking of the prompts to generate them.

About creative (video composition) resources

Among resources, music generation (1), which was considered difficult to put to practical use because of the large element of inspiration, has reached a practical level with music generation AIs such as " Suno. To prepare avatars (4), 3DCG generation AIs, such as 3DCG of the author's recent image, have been released in a state of sharpened competition. The lyrics (2) can be prepared by substituting the music generation AI's Ichi-Function or by having an interactive generation AI devise it. （The background (image) indicated in (5) can be handled by many image generation AIs (e.g., DALL E3 ).

AI creators need to understand and utilize the resources prepared by analog humans. The following is a list of resources created by the author.

(1) Music[5] : 44KHz,wav format file (2) Japanese lyrics (3) Barking voice of my dog: 44Khz, 3Second wav format file) (4) Avatar (VRM format) (5) Photos with rights (book shadow, paper craft model[6], background)

[5] A-Rumenoy, Scobey's Song (Extended Version) [Songs], ISRC SE-6HN-22-84796, Provided by Amuseio AB., Apr 2022.
[6] T. Yonemura and K. Furukawa, Paper Craft Made with Software "Paper Dragon", NICOGRAPH 2012, pp. 115-118, 2012.

However, there are some disadvantages to keep in mind. In principle, the resources prepared by the generated AI cannot be used for commercial purposes and are limited to personal use, even if they are rearranged.

The flip side benefit is that the generalization of interactive generative AI has created a creative environment in which human creators can generate the necessary resources by explaining the required content in Japanese, even if they do not know the specialized prompts.

About AI Creators (Production Team)

This is a supplement to the AI creators involved in creation. Speech synthesis (2) is now becoming more problematic, with many higher-performance speech generation AIs now available (e.g., VALL-E-X ) and the creation of deep fakes. Some AI singing software (3) has been integrated as a function of music generation AI. Similarly, for translation (4), interactive generation AIs such as ChatGPT are now capable of native-level translation tasks. While this has the disadvantage of losing originality, it also has the advantage of simplifying the creative process by reducing the need for human creators to use different types of generative AI.

The following AI creators, who are approved for commercial use at the time of production, are assigned to the production team and work creatively on the creative resource.

(1) Charamin Studio (AHS Corporation) : Software that analyzes musical compositions and generates motions and camera work for avatars using AI (2) NarikiriVC ( produced by NON906, 2018-2021) Software that machine learns audio material and synthesizes speech from text (3) NarikiriVC ( produced by Mr. NON906, 2018-2021) (3) CeVIO Pro [7] : AI singing software (4) DeepL ( DeepL SE ) : Translation using AI

Trial of making sounds speak with voice generation AI

Voice "synthesis" was realized in the 1980s. It was a mechanical method that synthesized the waveforms emitted by several operators and output them to resemble the waveform of a voice. However, this method cannot make the "roar of a dog" (sound) into a speaking voice. This is because there is no waveform to imitate (dogs cannot speak).

Generative AI, on the other hand, which speaks and sings by voice "generation," is an intelligent method. It identifies the spoken voice, extracts "voice" information, and repeatedly analyzes its frequency components and characteristics. This information is then referred to during generation as numerical data of the voiceprint. However, even with the latest voice generation AI, when the system outputs sounds as clear "speech" for its own reasons, such as the quoted part, the voices may be unstable or cause errors during training, so human adjustment may be necessary at times.

可視化した声紋の比較 — Source: Paper Figure 9: Voiceprint display after machine learning

Although several runs of machine learning were recommended, the human voice was defined as the learning material in the first place, and the system needed to be re-trained more than a hundred times to feed back the machine learning results until the accuracy was at least recognizable as "words".

In addition, the training material is unsupervised data, making it a machine learning process with a large number of iterations. However, as long as the training data (model) is ready, the speech generation AI can generate unrestricted waveforms from scratch. As a result, we were able to output the spoken voices "A" and "I" from a sound resource with no similar waveforms. The characteristic pattern of A appears in B. It can be confirmed that the features of the voiceprints are almost identical.

Implications for unsupervised (labeled data) machine learning

Machine learning uses data from a variety of resources, and we distinguish between two types of data. Supervised data with labels representing the correct answers and unsupervised data to which no correct answers are attached. The purpose of machine learning with unsupervised data is to discover unknown patterns or features that may be present in the data during training and to make them available in a usable model. This is an appropriate choice since the sounds are to be generated as spoken voices. In general, however, supervised machine learning is often done to enable output of the best solution to a realistic question, such as a mathematical equation.

Through the trial and error and machine learning shown, scenes are included in the promotional video where the dog speaks in English, as if "this is what my dog would sound like if he spoke.

愛犬の咆哮を語りとして利用（動画から抜粋） — Using your dog's roar as a narrative (excerpt from video)

About Virtual Singer (Song Generation AI)

Singing-enabled voice-generating AIs are referred to as virtual singers. Such singing-generating AIs are based on text-to-speech (TTS) technology, which generates speech from text. Unlike normal speech, singing involves elements such as melody, rhythm, pitch and intensity. Generative AI involves these elements in the voice generation process to produce a song-like voice.

To do this, the AI needs to learn singing styles and expressions. Generally, it analyzes a lot of singing data and annotates it (teacher labeling) as needed. For singing data, we label lyrics, pitch, rhythm, and emotional expression. Then pre-processed supervised learning is used to form deep learning models and so on. Special AI models, such as GANs and VAEs, may also be incorporated. Based on these learnings and models, AI to generate singing continues to evolve to the practical stage.

The human creator's job is to adjust the parameters of the model and the generation method so that the singing is refined. When creating the promotional video, we added breath sounds and specified slurs and ties. We also make adjustments to express the individuality of the singing. Not only in this field, but the accuracy of the generation AI is increasing every day. Some singing generation AI has been released that can generate realistic and natural singing voices without adjustments.

VoiSona(Cevio pro)歌唱編集画面 — VoiSona (former Cevio pro) Song Editing Screen

Automatic motion generation synchronized with music

The " Charamin Studio " software we used analyzes the frequency of the music and obtains the rhythm (beat) mainly from the drum and bass sounds. The software uses a somewhat mechanical method to "create" motions for the avatars to dance in sync with the rhythm.

アバターのモーション作成画面 — Avatar motion auto-creation screen

This technology is also evolving quickly. Generation AIs with new technologies have begun to be released, reusing music generation AI, 3DCG generation AI, and generation AI such as " Magic Animate," which reflects the performer's movements directly in the movements of 2D images. Combining them with human creators, the environment is becoming one in which a high degree of freedom in creation is possible.

Magic Animateで筆者を走らせる画像 — Image of the author's photo run in Magic Animate

A little trick with avatars and background images

Some software allows you to set up avatars (3DCG) and background images (2D) one by one by yourself. In this case, if you want to reduce the man-hours involved, use a skydome. This is a method of UV expansion of an image into a (semi)sphere like a planetarium to create a 3D shape. The background image can be prepared using an image generation AI, but the two ends of the image must be connected, otherwise the split will be visible and hidden. If this point is ignored, the man-hours involved can be reduced.

During the creation of the promotional video, human creators performed the tasks shown in the figure below.

アバターとスカイドームの関係 — UV expansion of 2D image to sky dome (3D)

Widespread use of markerless motion capture

Large-scale systems that use physical markers on the body to convert motion into data can now be handled by software, thanks to the emergence of AI that can perform highly accurate image recognition. Motion capture that uses software to process video input from a camera is commonly referred to as markerless motion capture. The role of AI in this process is to identify the contours and characteristic parts of the human body for each input frame, and to convert the motion patterns into numerical data.

For AI to do those things, it needs a large dataset containing various human motions and poses. However, the datasets are now available from research institutes and organizations for motion capture, making it more accessible to be realized in software.

Pattern recognition and machine learning "collaboration"

AI's image recognition accuracy has been improved by combining pattern recognition with machine learning. Machine-learning AI compensates for complex input information that cannot be represented by symbols or mathematical formulas alone, which pattern recognition handles. In machine learning, the AI algorithm uses supervised data as a reference, and the AI itself finds and learns the regularities. Since the learning method is similar to pattern recognition, pattern recognition is sometimes described as a part of machine learning.

When creating the promotional video, human creatives acted as performers to accentuate the avatars' movements and add unique effects.

ソフトウェア・モーションキャプチャーの例 — Markerless Motion Capture Application Examples

Comparison and effectiveness of prototype promotional videos

試作プロモーション動画の抜粋画像 — Excerpt images from the prototype promotional video

In cooperation with the AI creator, a promotional video with entertainment elements, such as the one above, was completed. As a reference value, three videos were also shown for comparison.

The author will discuss the information obtained from one week of creative videos published on the author's YouTube channel. Due to the population size, we use this information as a reference.

Figure 10-1 is the Japanese version of the video described in this paper released at the time of the Tokyo Olympics (about 1 minute, defined as A), Figure 10-2 is simply a video showing a series of product promotions (15 seconds, defined as B), and Figure 10-3 is the improved Japanese version of this video (about 1 minute, defined as C).

図10の動画PV

Evaluation as Entertainment

Dancing videos are a worldwide phenomenon. The reference values shown also indicate that promotional videos composed mainly of dance (entertainment), A and C, have a large number of views. In particular, video A, a Japanese-language video that was created in time for the Tokyo Olympics by combining the output of AI creators with almost no processing, received a particularly high number of views. In contrast, video B, which was created by an amateur video creator, did not attract viewers' interest. There is no significant difference in the number of repeat views.

表１と２を示す画

Evaluation as a commercial promotion

The role of promotional videos is to direct viewers to product information and service providers. In other words, the number of links to the website is the outcome. In the values shown (click rate), videos A and C are low, a result that could be attributed to the lack of coordination with the AI creators. But what would users think if they had to choose between products with the same content and features when making a purchase? The author surmised that the results are consistent with consumer psychology, which would choose between a design that is too unique or one that is simple and calm.

表３を示す画像

Summary

The key to making the most of AI creators is to make the most of their output by having them perform unexpected instructions and learning for entertainment purposes. Similarly, AI creators that can respond to unexpected instructions and learning are beneficial and competent. This is one of the key points to identify a generative AI that will cooperate with you.

When using AI for practical purposes, the key point is to leave the generation to the AI creator and use it without modifying the output content too much. Even if the originality of the creation is reduced, the result will be coherent. Some of the generative AIs that continue to evolve every day have already reached a practical level. And ......

The technical issues will be resolved as time goes on, but I cannot imagine what the best way to utilize or collaborate with them will be. I look forward to such future developments.

The discussion concludes that the creative style with AI creators is unimaginably fun. So we posed this question to the latest interactive generative AI ( ChatGPT ) itself. I will summarize the article with the response from the generative AI. The response was similar to the author's subjective view. Yes, it seems to me that the first contact with an AI capable of "exciting" interactions has long been underway.

Categories related to this article

米村貴裕 ( Takahiro Yonemura ): Takahiro Yonemura is a multi-creator and author from Tokyo, Japan. He founded Inazuma Corporation while in graduate school and earned a Doctor of Engineering degree from Kindai University. He also has a passion for dragons and is an Enthusiastic plant grower and composer, under the name A-Rumenoy. - Yonemura has authored over 67 published works, including technical books, science fiction, and articles. He has received recognition for his work, including the Wakayama City Mayor's Award for game design and selection as a recommended work for the 10th Cultural Media Arts Festival. - In addition to his creative endeavors, Yonemura has focused on scholarly work and has a paper on AI that has been published and presented in 2022. He is also the author of "The Metallic Dragon and I" and the graphic novel "Beast Code," which was released in the United States on November 16, 2022.