Catch up on the latest AI articles

[ChatAnything] A New Framework For Creating Digital Personas From Text

[ChatAnything] A New Framework For Creating Digital Personas From Text

Large Language Models

3 main points
✔️ Introducing ChatAnything, a new framework: proposes a new framework for generating personas with personality and visual characteristics from text input.
✔️ Challenges in image generation and integration with talking head models: addresses the problem of generated images not fitting the talking head model.

✔️ Prospects for future research: integrates the generating and talking head models using a zero-shot approach, and presents potential improvements.

ChatAnything: Facetime Chat with LLM-Enhanced Personas
written by Yilin ZhaoXinbin YuanShanghua GaoZhijie LinQibin HouJiashi FengDaquan Zhou
(Submitted on 12 Nov 2023)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Recent advances in large-scale language models have been the focus of much attention in the academic community due to their exceptional generalization and contextual learning capabilities. These models can facilitate dialogue across a wide range of topics and provide users with a human-like conversational experience.

This paper proposes ChatAnything, a novel framework for generating personas from text input with customized personalities, voices, and appearances, enhanced by large-scale language models. We enhance the contextual learning capabilities of the large-scale language model by designing system prompts that generate unique characters based on text about the target the user seeks. We also show how a text-to-speech API can be leveraged to create voice types and select the best tone for the user's input.

Despite the advances in image generation techniques using diffusion models, it has become clear through experimentation that these images do not adequately serve as a source for talking head models. To address this issue, we generated and analyzed 400 humanlike samples from a variety of categories and found that only 30% of the generated images were detected by modern talking head models. This result is due to the fact that the generated images do not match the face detector's training data, and the paper also discusses possible improvements when using a pre-trained face detector.

We are also exploring new possibilities in image editing by taking advantage of the properties of the diffusion process. Specifically, we propose a pixel-level landmark injection method that enables the detection of landmark trajectories in a zero-shot fashion without affecting visual appearance. In addition, we address issues related to the balance between landmark retention and textual concept fusion, utilizing cross-attention blocks to enhance the overall structural information.

The main contributions of this paper include a new framework for generating personas from textual input that are augmented by large-scale language models, a zero-shot approach to harmonize the distribution between pre-trained generative and talking head models, and the consistency between generative and talking head models. We propose an evaluation dataset that quantifies the consistency between the generative and talking head models. Through these contributions, we are taking steps toward creating more realistic and individualized digital personalities.


Here we describe the pipeline of ChatAnything, a new framework that generates personas from text input, augmented by a large-scale language model. The overview is shown in the figure below; the ChatAnything framework is composed of four main components.

The first is a large-scale language model-based control module that initializes persona personalities described in text from the user. This is also used to manage system operations and invoke applications based on user interactions.The second is a portrait initialization module that generates a reference image of the persona. These are the fine-tuned diffusion model (MoD) and LoRA modules. Each model specializes in generating a particular style of image. The best-fitting model is automatically invoked via the large-scale language model based on the user's textual persona description; the third is the text-to-speech module (MoV), which converts the text input from the persona into a customized tonal speech signal. Selection is automatically made based on the user's textual description via a large-scale language model. the fourth is a motion generation module that takes the audio signal and moves the generated image.

System Architecture Overview

This section describes the system architecture of ChatAnything. The system consists of the following key processes

The first is a guided diffusion process. Image generation using the diffusion algorithm is an iterative process that removes noise step by step. We have found that by properly injecting facial landmarks in the initial stages, we can generate images that are free of visual defects. The process focuses on specific landmarks and customizes the initial steps of image generation based on data retrieved from a predefined external memory.

The second is a structural control process, which utilizes state-of-the-art technology such as ControlNet to achieve more fine-grained control over the image generation process. This approach allows facial features to be injected more precisely into the image, so that the generated image has the desired artistic style while still being compatible with subsequent facial animation algorithms.

The third is the process of combining diffusion models with voice modification techniques. In order to enhance the performance of a specific style-specific model, a combination of diffusion-based generative models of various styles downloaded from Civitai is used. This allows for customization of images and voices based on user requests, providing a more personalized experience. Model selection is automatic, based on the description of the target object provided by the user.

The fourth is the personality modeling process. Agent personalities are classified according to the keywords provided by the user to generate portraits. In this paper, a large-scale language model is used to characterize the various user-specified personalities. Specifically, the large-scale language model agent is customized in the role of a scriptwriter who follows the prompt template below.

Using this prompt template, the large-scale language model associates characteristics of user-entered objects and is free to build personalities based on these attributes. The following example shows a personality generated based on the user input "apple".

The "ChatAnything" framework is modular in design, making it easy to add new styles of diffusion-based generation models and voice modification techniques. This allows the project to be scalable and flexible to user needs in the future.


In order to identify the impact of guided diffusion techniques, this paper builds a validation dataset based on eight keywords selected from a variety of categories. These keywords include realistic, animal, fruit, plant, office supplies, bag, clothing, and cartoon.Utilizing ChatGPT, 50 prompts were generated for each category and these were applied as conditions for the diffusion process.

Facial landmark detection is performed using a pre-trained facial keypoint detector and is an important factor in improving the quality of facial motion animation. To increase the rate of facial landmark detection, the distribution of the pre-trained diffusion model is constrained by prompts of the form "{} portrait, fine-grained face". In this approach, specific concepts from the user are incorporated into the prompts.

However, this initial approach did not yield satisfactory results. As can be seen from the table below, the detection rate for certain concepts, especially cartoons, was only 4%, and the average detection rate was as low as 57%. In contrast, the newly proposed "ChatAnything" significantly improves the facial landmark detection rate, achieving an average of 92.5%.

The results demonstrate the limitations of simple prompting techniques and the effectiveness of the combined approach proposed by ChatAnything. The significant improvement in facial landmark detection rates opens up new possibilities in guided diffusion techniques and is expected to help drive further research.

Summary and Future Prospects

This paper presents an elementary study that utilizes the zero-shot technique to merge a state-of-the-art generative model with a talking head model. The goal of this research is to combine these techniques to make the computational process more efficient. The current methodology focuses on the use of pre-trained models based on significant prior research in the areas of talking head models and image generation.

However, it is conceivable that there may be lightweight alternative technologies that provide even better performance. The research team states that they are just working on it, and that this ongoing effort represents an important step toward future integration of generative and talking head models. Continued progress is expected.

The project page is available. Please visit the project page to see a demo.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us