Zero-shot Learning] AI Voice Cloning And Lip-syncing Verification And Explanation

Neural Network 29/01/2025

3 main points
✔️ How Zero-Shot Learning (Zero-Shot Learning) works to deal with unknown concepts by reasoning with existing knowledge
✔️ Demonstration and discussion of cloning (duplicating) the voices of the author and her dog with a few seconds of audio data and turning them into native speakers.
✔️ Visualization of feature points of voice and image involved in Zero-Shot Learning

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
written by Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber
[Submitted on 7 Jun 2024 (v1)]
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Comments: Accepted at INTERSPEECH 2024

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
written by Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang
[Submitted on 3 Jul 2024 (v1)]
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Zero-shot learning (Zero-shot learning), which the paper addresses, has become one of the techniques used in many fields where AI is utilized to derive appropriate answers to unknown situations (information) using existing learning knowledge.

In general, this technology has shown that AI has the ability to reason from other learned knowledge (experience) to plausibly solve even completely unknown things. To what extent, then, can it solve and respond to unknown situations?

In this article, we will discuss AI voice cloning technology and lip-synching (lip-synching) technology, which is similar to deep faking, while examining them in practice.

Bilingual cloned audio from your dog's howls

Multilingual World of Voice Cloning Technology Using AI

AI voice cloning technology learns "voice characteristics" from a few seconds of sample voice and reads sentences in a sample-like voice.

One of the most notable technical features is zero-shot learning, which allows for natural reproduction of voices that are not present in the training data through repeated inference (estimation). Even my dog's howls were converted into voices.

And since XTTS and VALL-E X are multilingual, they can read Japanese to English and vice versa "like a native".

Even the style of exchanges is about to be revolutionized.

In this section, we will look at the basic principles of the techniques described in the paper and how they are used.

Visualization of Feature Extraction Performed by AI

What is the "data feature extraction" that speech recognition AI and image recognition AI perform internally?　We visualized one part of it using Librosa and OpenCV, a library in the programming language python.

Feature Extraction of Speech Signals

The mel spectrogram, displayed in different colors, helps the speech recognition AI to capture differences in emotion and speaking style.

The mel spectrogram is a transformation based on the sensation of frequencies that humans perceive with their ears. Voice features are easily emphasized, and using this as preprocessing makes it easier for AI to identify subtle differences in emotion, inflection, voice quality, etc. in voice data.

Analogous to peak detection, indicated by the dotted line, onset strength is an indicator of how strongly the "beginning of the sound" appears in the speech data.

Characteristic extraction of your dog's howling

Feature extraction when howls are voice cloned to mimic words.

The top is a dog howling and the bottom is human speech imitated by a voice clone. A quick glance at both shows the difference.

What is noteworthy, however, is the common horizontal stripe pattern (formant structure). The overlapping frequency bands seen in the low frequency range are unique to dog howling. This is present in both, confirming that the voice cloning AI can "do its best" to convert the pattern into one that is closer to human vocalizations.

Technically, we were one step closer to reproducing a "dog that speaks", and we got a glimpse of the capabilities of voice cloning AI.

Image Feature Extraction

Feature extraction of image data performed by AI models is fundamentally different from conventional methods of extracting contours, corners, colors, etc. using algorithms (manual brute force).

Generally, AI models use a large amount of data to learn the features "themselves" as information. AI models automatically optimize and extract which features of an image are important and how they should be captured, based on pre-trained data.

One part of the extraction is visualized.

Visualized image of feature extraction of the image of my dog and the author

LivePortrait captures implicit key points (simply, the hidden coordinates that control facial movements) from images and videos and turns them into natural animations of the areas that are important for movement.

In the visualization image above, the eyes and mouth are also marked, but the AI model automatically captures even the "hidden" feature points, a technique similar to the zero-shot learning method, which leads to more realistic movements.

Animation-style portrait images were incorporated into the training data, and the lip-motion (lip-synching) video of the dog was made to look a little more animated and cute.

Basic Principles of Zero-Shot Learning to Make It Look Like That

Let me explain with a bold analogy.

When AI is learning images of "dogs" and "cats" and responds to an image of an "unlearned animal" (e.g., a fox), AI cannot accurately identify the fox. It is like judging that it is a new animal by analogy with the features of dogs and cats.

The AI combines the information it has already learned, like a puzzle, and asks, "Which of these new things is most like it?" and make a decision.

If it has more similarities to a dog or cat, we judge it as "an animal somewhere between a dog and a cat". However, if there are more parts that are different, we judge it as a "new animal".

To determine that an animal is "new," we compare the characteristics it possesses to those of a dog or cat and ask, "Which one does it resemble?" and then make an analogy.

And processing them as "dog-like" or "cat-like" is an idea that matches the basic principle of zero-shot learning involved in each paper.

Visualized image showing examples of feature extraction, such as "eye-like" and "mouth-like" of your dog.

Visualized images showing examples of feature extraction, such as "eye-like" and "mouth-like" of the author

To make this "something-like" analogy, the more training data that can be compared, the more accurate and highly capable the AI will be.

What If We Take It One Step Further?

The AI determines that "this is a new animal that combines the information it is learning," but it does not know its name. Therefore, it learns the characteristics as if it were "solving a new puzzle for now.

Then, when we again deal with an image of a similar animal (e.g., another fox), we can determine that "this is similar to the new puzzle we learned before" and process it more efficiently.

It is not a mistake to think that this repetition will lead to the evolution and improvement of the AI's capabilities.

What Exactly Is AI Learned Information?

We are talking about the memory of features (training data) used by AI.

The "learned features" used in image and speech recognition are generally stored as weights in a neural network. These weights can be called parameters for capturing ambiguous features while processing input data.

The learned weights themselves are not in a form that humans can intuitively "understand". They are converted into a mathematical representation.

Specifically, a huge number of matrices and vectors are stored (saved), and neural networks use them for feature extraction and other recognition.

How Do You Capture Features with The Information You Learn?

The content presented in the paper is a description of the technique and examples of its operation.

Currently, the neural network of AI, which is the foundation for recognition, generation, and synthesis, is a black box with respect to the data and features it has learned and stored.

We can explain mathematically how weights and parameters work, but it is difficult to understand them intuitively. To use an analogy, it is like looking at the neural network of the human brain and not knowing the full picture of which nerves are doing what.

However, the field of "Explainable AI (XAI)" has recently begun to advance, and there is a movement toward clarifying the decision-making process of AI.

However, it will be some time before the black box is completely unraveled.

Author's audio clone (Japanese) and lip-synching verification results

Author's audio clone (English) and lip-synching verification results

Summary

In a sense, it is the "power to generalize" that creates something from nothing, as various AIs output results by analogy from similar patterns for situations they have never seen or heard of before.

It is different from mere imitation (copying), even though it is based on past data.

It is somewhat of a leap of faith, but if the "generalization power" of these AIs is increased through zero-shot learning, etc., it is highly likely that we will eventually reach AGI and ASI, which is a mixture of hope and a touch of anxiety.

However, the dog says the following, and the dog's movements are comical if you watch the video performed by the author, and doesn't the future look like fun?

Message from my dog

The result of the author's motion-activated

Categories related to this article

米村貴裕 ( Takahiro Yonemura ): Takahiro Yonemura is a multi-creator and author from Tokyo, Japan. He founded Inazuma Corporation while in graduate school and earned a Doctor of Engineering degree from Kindai University. He also has a passion for dragons and is an Enthusiastic plant grower and composer, under the name A-Rumenoy. - Yonemura has authored over 67 published works, including technical books, science fiction, and articles. He has received recognition for his work, including the Wakayama City Mayor's Award for game design and selection as a recommended work for the 10th Cultural Media Arts Festival. - In addition to his creative endeavors, Yonemura has focused on scholarly work and has a paper on AI that has been published and presented in 2022. He is also the author of "The Metallic Dragon and I" and the graphic novel "Beast Code," which was released in the United States on November 16, 2022.