Catch up on the latest AI articles

AnyGPT: The Next Generation Of Multimodal Large-Scale Language Models Integrating Image, Speech, And Text

AnyGPT: The Next Generation Of Multimodal Large-Scale Language Models Integrating Image, Speech, And Text

Large Language Models

3 main points
✔️ Development of a multimodal large-scale language model, "AnyGPT": based on existing large-scale language models, but with the ability to process different modes of information such as speech, text, images, and music in an efficient and unified manner by using discrete representations.
✔️ Development of the multimodal instruction dataset "AnyInstruct-108k": a new large-scale instruction dataset containing multi-turn conversations with intertwined multimodals necessary for the model to effectively process and understand multiple modalities.

✔️ Cross-modal task results and applicability: AnyGPT achieved excellent zero-shot performance on a variety of cross-modal tasks, comparable to specialized models.

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
written by Jun ZhanJunqi DaiJiasheng YeYunhua ZhouDong ZhangZhigeng LiuXin ZhangRuibin YuanGe ZhangLinyang LiHang YanJie FuTao GuiTianxiang SunYugang JiangXipeng Qiu
(Submitted on 19 Feb 2024)
Comments: Under Review, Work in Progress
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Large-scale language models have an exceptional ability to understand and produce human language, but so far their capabilities have been limited primarily to text processing. In the real world, however, it is a multimodal environment where information is exchanged through a variety of senses, including visual, auditory, and tactile. Incorporating this diversity is a major goal in the development of next-generation systems. Specifically, incorporating multimodal encoders is expected to enable large-scale language models to process various modes of information and leverage their advanced text processing capabilities to generate consistent responses. However, this approach cannot produce multimodal output.

Pioneering efforts such as Emu (Sun et al., 2023b), SEED-LLaMA (Ge et al., 2023b), and SpeechGPT (Zhang et al., 2023a) have made significant progress in enabling multimodal understanding and generation within language models, but these models integrate only a single non-textual modality, such as image or audio. While it is relatively easy to align text with one additional modality, integrating multiple modalities (N ≥ 3) in a single framework and achieving bi-directional consistency among them presents a greater challenge.

To address this challenge, this paper develops "AnyGPT. This is a new type of multimodal large-scale language model with a multimodal tokenizer that transforms raw data such as images and speech into discrete semantic tokens. This approach allows the large-scale language model to perform recognition, comprehension, inference, and generation in a unified manner at the semantic level. Furthermore, the model is designed to handle any combination of multimodal inputs and outputs, and experimental results show zero-shot performance comparable to specialized models.

The paper also builds a new text-centric multimodal alignment dataset. Since natural language is the most sophisticated semantic representation modal and is present in most multimodal datasets, we aim to use text as a bridge to achieve interlinkage between all modalities. Through this effort, AnyGPT facilitates multimodal dialogues and demonstrates the feasibility of unifying multiple modalities using discrete representations.

This paper proposes a new multimodal large-scale language model, AnyGPT, that can understand and generate a variety of modalities, develops a data set, AnyInstruct-108k, that follows multimodal intertwined instructions, and uses discrete representations to effectively unify multiple modalities The demonstration that it is possible to do so. These developments open up new possibilities for the development of the next generation of multimodal systems.

AnyGPT, a multimodal large-scale language model

In AnyGPT, dedicated tokenizers have been introduced for each of the image, voice, and music modalities, creating the fundamental technology that enables the diverse applications of AnyGPT.

First, the SEED tokenizer is employed to tokenize the image. This tokenizer takes a 224 x 224 RGB image as input, a ViT encoder encodes the image into 16 x 16 patches, then a Causal Q-Former converts the patch features into 32 causal embeddings; a VQ codebook with 8192 entries is used to quantize these embeddings and decode them into a visual code via MLP. This process is finally restored to the original image by the UNet decoder. This advanced tokenizer is able to precisely tokenize the image and align it with the latent space of unCLIP Stable Diffusion.

Next, an encoder-decoder architecture, SpeechTokenizer, is used to tokenize the audio data. The system uses a hierarchical quantizer to compress audio sequences into a discretization matrix that captures both semantic and extra-linguistic information. pre-trained on the Commonvoice and Librispeech datasets, the tokenizer effectively models the semantic information and extra-linguistic information, and works in conjunction with the speech cloning model to produce realistic speech.

Then, Encodec is used as a specialized tokenizer for music data. This convolutional autoencoder quantizes music tracks into latent space using residual vector quantization; pre-trained on over 20,000 songs, Encodec processes music data at high resolution and uses four quantizers to capture semantic elements of music. The tokenizer works with language models in predicting music clips, revolutionizing the composition and generation of music.

AnyGPT has significantly expanded the vocabulary of its language model to understand and generate multimodal data, not only text, but also images and audio. This new approach introduced tokens specific to each modality and accordingly expanded the model's embedding and prediction layers. These newly added parameters are first initialized randomly and subsequently trained to integrate tokens from all modalities to form a new vocabulary, aiming for consistency in the shared representation space. This approach allows the model to seamlessly integrate knowledge and information across different types of data.

To effectively handle multimodal data, AnyGPT is equipped with a dedicated tokenizer for each modality, which is used to transform data into discrete token sequences. This transformed data is used to train the model through a loss function that predicts the next token. With this consistent learning approach, the model gains the ability to understand and generate all types of data, including text, images, and audio. The backbone is the LLaMA-2 7B model, which has been pre-trained on a large text token data set and fine-tuned for the new vocabulary.

In addition, generating high-resolution images and high-quality audio data is particularly challenging because so much information must be processed. To efficiently handle long sequences, AnyGPT employs a two-stage framework. This approach first processes information at the semantic level and then uses that information to generate high-fidelity multimodal content.

Additionally, for visual content, a diffusion model is used to generate high-quality images from SEED tokens. For audio, a non-autoregressive SoundStorm model is employed to generate acoustic tokens from semantic tokens, which are then converted to raw audio data. This process is capable of reproducing the voice of any speaker from a mere 3-second audio prompt. Music generation uses Encodec tokens to filter out details beyond human perception and reconstruct them into high-quality audio data.

Thus, AnyGPT employs innovative techniques for handling complex multimodal data and generating high-quality content. These techniques enable deep understanding and generation across text, image, and audio modalities.

Pre-training dataset "AnyInstruct-108k"

The distribution of the dataset used for the AnyGPT pre-training data is shown in the figure below. It is segmented by the number of tokens. The inner section shows modalities, the middle section shows details of data types, and the outer section shows individual datasets.

We are developing text-centric bimodal datasets that elaborately link different modalities in order to achieve seamless generation across diverse information modalities. With text as the key, we aim to integrate different modalities, such as image and audio, through language models and make them all harmonize with each other.

To facilitate comparison of data from different modalities, a quantification method based on the number of tokens is applied. This approach makes it possible to compare data volumes on a consistent basis, regardless of the type of data.

For images & text, we utilized image-text pairs collected from LAION-2B, LAION-COCO, LAION-Aesthetics, and JourneyDB. These datasets were carefully selected to improve image and text quality, resulting in a high-quality corpus. In addition, a subset of LAION-Aesthetics and a synthetic dataset from JourneyDB were added to improve the quality of image generation. We have also incorporated data where images and text intersect to ensure that the model works effectively in different modes.

For Speech & Text, we have collected large datasets such as Gigaspeech, Common Voice, and Multilingual LibriSpeech (MLS) for automatic speech recognition (ASR) in English. These are 57,000 hours of spoken text pairs collected from online platforms, crowdsourcing, and audiobooks, covering a wide range of speakers, domains, and environments.

Music & Text collected over 1 million music videos from the Internet and matched song and video titles through the Spotify API. The collected metadata included video titles, descriptions, keywords, playlist names, and Spotify lyrics, which were fed into GPT-4 in JSON format. GPT-4 extracts important information from this noisy metadata and summarizes it into concise sentences to generate high-quality text captions. This effectively provides high-quality captions for large amounts of musical audio and minimizes misinformation in the data set.

In addition, effective human-machine interaction should allow information to be exchanged in a variety of intersecting modalities. However, increasing the number of modalities in a conversation greatly complicates the data collection process. Currently, there are no large instructional data sets that include more than two modalities. This is a major limitation in the development of comprehensive models that can manage dialogues with multiple intertwined modalities.

To overcome this limitation, this paper takes inspiration from recent data synthesis research (Wang et al., 2022; Wu et al., 2023) and builds a dataset consisting of 108k multi-turn conversation samples using a generative model. Through careful curation, each synthetic conversation integrates multiple intersecting modalities: text, speech, images, and music. The data synthesis process is illustrated in the figure below.

In this way, we are building pre-trained data sets that bridge different modalities and open up new possibilities for integrating diverse information.


This paper evaluates the basic performance of the pre-trained based AnyGPT, covering the tasks of multimodal understanding and generation across all modalities. The evaluation aims to test consistency across different modalities during the pre-training process. Specifically, for each modality, we test the text-to-X and X-to-text tasks. Here, X applies to images, music, and audio.

All evaluations are zero-shot in order to mimic real-world scenarios. This rigorous evaluation setting requires the model to generalize to unknown test distributions and demonstrates AnyGPT's versatile ability to work through different modalities. The evaluation results show that AnyGPT, as a general-purpose multimodal language model, achieves excellent performance on a variety of multimodal comprehension and generation tasks.

We evaluate AnyGPT's ability to understand images in an image-capturing task. The table below shows the comparative results, utilizing the MS-COCO 2014 captioning benchmark (Lin et al., 2014) and employing a Karpathy-divided test set according to existing studies (Li et al., 2023; Tang et al., 2023b).

The results of the text-to-image image generation task are shown in the table below. To be consistent with existing studies (Koh et al., 2023; Ge et al., 2023b; Sun et al., 2023a), we randomly selected 30,000 images from the MS-COCO validation set and used CLIPscore as the evaluation metric. This metric computes a similarity score between the generated images and their corresponding actual image captions based on CLIP-ViTL (Radford et al., 2021).

We also evaluate AnyGPT's performance on automatic speech recognition (ASR) tasks by calculating Word Error Rate (WER) on a test-clean subset of the LibriSpeech dataset (Panayotov et al., 2015).Wav2 vec 2.0 and Whisper Large V2 are used as baselines, although Wav2vec 2.0 is pre-trained with 60,000 hours of speech and fine-tuned with LibriSpeech, while Whisper Large V2 is evaluated in a zero-shot setting, trained on 680,000 hours of speech. Results are shown in the table below.

In addition, a zero-shot Text-to-Speech (TTS) evaluation was performed on the VCTK dataset. Results are shown in the table below. We evaluate TTS systems by speaker similarity and Word Error Rate (WER). WER here focuses on speech quality.

The MusicCaps benchmark (Agostinelli et al., 2023) is used to evaluate AnyGPT's performance on music comprehension and generation tasks. We use CLAPscore (Wu et al., 2022; Huang et al., 2023) as an objective measure. It measures the similarity between the generated music and the textual description.

In evaluating music captioning, we have found that existing objective measures are limited in their ability to represent performance on music captioning tasks. The diversity and subjectivity of music elicits different opinions from individual people. Only certain musical genres and instruments have unique characteristics that are readily recognizable. Recent research (Gardner et al., 2023) has explored this issue, but it remains a difficult problem to address. To ensure an objective assessment, we computed and compared CLAPscores for <music, actual caption> and <music, generated caption> pairs. These scores are averaged across the entire test set.


This paper introduces AnyGPT, a new multimodal large-scale language model that can consistently process a variety of modalities, including speech, text, images, and music. AnyGPT uses discrete representations that can effortlessly incorporate new modalities without changing the framework or learning methods of existing large-scale language models. This gives the model the flexibility of learning a new language.

To enable AnyGPT to skillfully handle a variety of modalities, we are also developing a multi-modal instruction dataset, AnyInstruct-108k. This is a groundbreaking large-scale dataset containing multi-turn conversations involving multiple modalities.

Furthermore, experimental results show that AnyGPT achieves remarkable results in a variety of cross-modal tasks and has the power to unify modalities with different discrete representations efficiently and conveniently within a single language model, The introduction of AnyGPT opens up new avenues for integrating diverse information sources such as speech, images, text, and music, and is expected to lead to the development of richer multimodal applications than ever before.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us