[OmniGen] All Image-related Tasks Can Be Performed With Only One Generation Model!

Image Generation 29/09/2024

3 main points
✔️ Proposed a new image generation model called OmniGen, which enables diverse image generation tasks to be handled in a unified manner
✔️ Eliminated additional modules previously required, allowing multiple tasks to be handled with a simple structure
✔️ Complex image editing and conditional generation can now be performed more efficiently

OmniGen: Unified Image Generation
written by Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu
(Submitted on 17 Sep 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Background

Traditional models often require task-specific structures and additional networks, making them complex to operate and limiting their utility; OmniGen was designed to solve this problem byallowing a single model to handle a wide variety of tasks, which could be an important part of the future of AI research. OmniGen was designed to solve this problem, and its ability to handle a variety of tasks with a single model could be an important part of the future of AI research.

A specific use case is that complex tasks such as image editing and image restoration can be performed through simple instructions. Thus, OmniGen opens up new possibilities for image generation, and further research is expected.

Technique

OmniGen's architecture is very simple and consists of two main components: the VAE (variational autoencoder) and the Transformer model.

The VAE extracts continuous visual features from the image and the Transformer uses these features to generate the image. This allows for processing any combination of both text and image inputs without the need for additional encoders. For example, tasks such as image editing, pose estimation, and edge detection can all be processed consistently as image generation tasks.

In addition, OmniGen learns diverse tasks on a unified "X2I" dataset, allowing for knowledge sharing and transfer between different tasks.

This gives it the flexibility to adapt to unknown tasks and new domains, and it expresses new capabilities not found in traditional task-specific models. For example, generation based on visual conditions can generate new images while preserving specific objects and structures.

A major advantage of OmniGen is its ability to generate a wide variety of images without any existing extensions or pre-processing. This makes it easy to apply in real-world applications and intuitive to operate. It is also more efficient and effective than other models because it performs equally well or better with fewer parameters and training data.

Experiment

The experiments in this paper evaluate OmniGen's performance on a variety of image generation tasks. OmniGen performed well on these tasks compared to other state-of-the-art models.

First, in the evaluation of text-to-image generation, OmniGen performed as well as or better than existing diffusion models. Evaluation metrics measured the quality of the generated images and their agreement with the text, with OmniGen achieving superior results with fewer parameters and data.

Second, experiments with image editing show that OmniGen is capable of performing multiple operations, such as changing the background and adding and removing objects. In particular, tests with the EMUEdit dataset show that OmniGen performs better in terms of editing accuracy and in terms of matching the original image.

In addition, experiments have been conducted to evaluate the ability to generate new images based on visual conditions, such as edge detection and pose estimation.

Finally, computer vision tasks such as low-light image improvement, deblurring, inpainting, and other conventional visual processing techniques have also been evaluated for integration. This shows that OmniGen is not just a generative model, but can also efficiently handle traditional computer vision tasks.

Summary

The conclusion of this paper suggests that OmniGen shows excellent performance in a wide variety of image generation tasks and may significantly exceed the limits of existing diffusion models. OmniGen is the first model that can handle various tasks such as image generation from text, image editing, and visual conditional generation in a unified manner, and is characterized by its simple architecture and high flexibility.

Looking ahead, OmniGen is expected to further improve its performance and application to new tasks. In particular, a unified approach to image generation could contribute to broader AI applications in the future. The research team aims to further develop OmniGen through its open-sourcing.