![Zero-shot Learning] AI Voice Cloning And Lip-syncing Verification And Explanation](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/October2024/arumenoy-tts.png)
Zero-shot Learning] AI Voice Cloning And Lip-syncing Verification And Explanation
3 main points
✔️ How Zero-Shot Learning (Zero-Shot Learning) works to deal with unknown concepts by reasoning with existing knowledge
✔️ Demonstration and discussion of cloning (duplicating) the voices of the author and her dog with a few seconds of audio data and turning them into native speakers.
✔️ Visualization of feature points of voice and image involved in Zero-Shot Learning
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
written by Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber
[Submitted on 7 Jun 2024 (v1)]
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Comments: Accepted at INTERSPEECH 2024LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
written by Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang
[Submitted on 3 Jul 2024 (v1)]
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Zero-shot learning (Zero-shot learning), which the paper addresses, has become one of the techniques used in many fields where AI is utilized to derive appropriate answers to unknown situations (information) using existing learning knowledge.
In general, this technology has shown that AI has the ability to reason from other learned knowledge (experience) to plausibly solve even completely unknown things. To what extent, then, can it solve and respond to unknown situations?
In this article, we will discuss AI voice cloning technology and lip-synching (lip-synching) technology, which is similar to deep faking, while examining them in practice.
Bilingual cloned audio from your dog's howls
Multilingual World of Voice Cloning Technology Using AI
AI voice cloning technology learns "voice characteristics" from a few seconds of sample voice and reads sentences in a sample-like voice.
One of the most notable technical features is zero-shot learning, which allows for natural reproduction of voices that are not present in the training data through repeated inference (estimation). Even my dog's howls were converted into voices.
And since XTTS and VALL-E X are multilingual, they can read Japanese to English and vice versa "like a native".
Even the style of exchanges is about to be revolutionized.
In this section, we will look at the basic principles of the techniques described in the paper and how they are used.
Visualization of Feature Extraction Performed by AI
What is the "data feature extraction" that speech recognition AI and image recognition AI perform internally? We visualized one part of it using Librosa and OpenCV, a library in the programming language python.
Feature Extraction of Speech Signals
The mel spectrogram, displayed in different colors, helps the speech recognition AI to capture differences in emotion and speaking style.
The mel spectrogram is a transformation based on the sensation of frequencies that humans perceive with their ears. Voice features are easily emphasized, and using this as preprocessing makes it easier for AI to identify subtle differences in emotion, inflection, voice quality, etc. in voice data.
Analogous to peak detection, indicated by the dotted line, onset strength is an indicator of how strongly the "beginning of the sound" appears in the speech data.
Characteristic extraction of your dog's howling
Feature extraction when howls are voice cloned to mimic words.
The top is a dog howling and the bottom is human speech imitated by a voice clone. A quick glance at both shows the difference.
What is noteworthy, however, is the common horizontal stripe pattern (formant structure). The overlapping frequency bands seen in the low frequency range are unique to dog howling. This is present in both, confirming that the voice cloning AI can "do its best" to convert the pattern into one that is closer to human vocalizations.
Technically, we were one step closer to reproducing a "dog that speaks", and we got a glimpse of the capabilities of voice cloning AI.
Image Feature Extraction
Feature extraction of image data performed by AI models is fundamentally different from conventional methods of extracting contours, corners, colors, etc. using algorithms (manual brute force).
Generally, AI models use a large amount of data to learn the features "themselves" as information. AI models automatically optimize and extract which features of an image are important and how they should be captured, based on pre-trained data.
One part of the extraction is visualized.
Visualized image of feature extraction of the image of my dog and the author
LivePortrait captures implicit key points (simply, the hidden coordinates that control facial movements) from images and videos and turns them into natural animations of the areas that are important for movement.
In the visualization image above, the eyes and mouth are also marked, but the AI model automatically captures even the "hidden" feature points, a technique similar to the zero-shot learning method, which leads to more realistic movements.
Animation-style portrait images were incorporated into the training data, and the lip-motion (lip-synching) video of the dog was made to look a little more animated and cute.
Basic Principles of Zero-Shot Learning to Make It Look Like That
Let me explain with a bold analogy.
When AI is learning images of "dogs" and "cats" and responds to an image of an "unlearned animal" (e.g., a fox), AI cannot accurately identify the fox. It is like judging that it is a new animal by analogy with the features of dogs and cats.
The AI combines the information it has already learned, like a puzzle, and asks, "Which of these new things is most like it?" and make a decision.
If it has more similarities to a dog or cat, we judge it as "an animal somewhere between a dog and a cat". However, if there are more parts that are different, we judge it as a "new animal".
To determine that an animal is "new," we compare the characteristics it possesses to those of a dog or cat and ask, "Which one does it resemble?" and then make an analogy.
And processing them as "dog-like" or "cat-like" is an idea that matches the basic principle of zero-shot learning involved in each paper.
Visualized image showing examples of feature extraction, such as "eye-like" and "mouth-like" of your dog.
Visualized images showing examples of feature extraction, such as "eye-like" and "mouth-like" of the author
To make this "something-like" analogy, the more training data that can be compared, the more accurate and highly capable the AI will be.
What If We Take It One Step Further?
The AI determines that "this is a new animal that combines the information it is learning," but it does not know its name. Therefore, it learns the characteristics as if it were "solving a new puzzle for now.
Then, when we again deal with an image of a similar animal (e.g., another fox), we can determine that "this is similar to the new puzzle we learned before" and process it more efficiently.
It is not a mistake to think that this repetition will lead to the evolution and improvement of the AI's capabilities.
What Exactly Is AI Learned Information?
We are talking about the memory of features (training data) used by AI.
The "learned features" used in image and speech recognition are generally stored as weights in a neural network. These weights can be called parameters for capturing ambiguous features while processing input data.
The learned weights themselves are not in a form that humans can intuitively "understand". They are converted into a mathematical representation.
Specifically, a huge number of matrices and vectors are stored (saved), and neural networks use them for feature extraction and other recognition.
How Do You Capture Features with The Information You Learn?
The content presented in the paper is a description of the technique and examples of its operation.
Currently, the neural network of AI, which is the foundation for recognition, generation, and synthesis, is a black box with respect to the data and features it has learned and stored.
We can explain mathematically how weights and parameters work, but it is difficult to understand them intuitively. To use an analogy, it is like looking at the neural network of the human brain and not knowing the full picture of which nerves are doing what.
However, the field of "Explainable AI (XAI)" has recently begun to advance, and there is a movement toward clarifying the decision-making process of AI.
However, it will be some time before the black box is completely unraveled.
Author's audio clone (Japanese) and lip-synching verification results
Author's audio clone (English) and lip-synching verification results
Summary
In a sense, it is the "power to generalize" that creates something from nothing, as various AIs output results by analogy from similar patterns for situations they have never seen or heard of before.
It is different from mere imitation (copying), even though it is based on past data.
It is somewhat of a leap of faith, but if the "generalization power" of these AIs is increased through zero-shot learning, etc., it is highly likely that we will eventually reach AGI and ASI, which is a mixture of hope and a touch of anxiety.
However, the dog says the following, and the dog's movements are comical if you watch the video performed by the author, and doesn't the future look like fun?
Message from my dog
The result of the author's motion-activated
Categories related to this article