Catch up on the latest AI articles

InstructPix2Pix: A New Model For Image Editing At The User's Direction

InstructPix2Pix: A New Model For Image Editing At The User's Direction

Computer Vision

3 main points
✔️ InstructPix2Pix, a method for editing images based on human instructions, is proposed.
✔️ InstructPix2Pix makes it easy for anyone to edit images according to instructions.

✔️ A wide variety of edits could be performed, including replacing objects, changing seasons and weather, replacing backgrounds, changing material attributes, and artistic transformations.

InstructPix2Pix: Learning to Follow Image Editing Instructions
written by Tim BrooksAleksander HolynskiAlexei A. Efros
(Submitted on 17 Nov 2022 (v1), last revised 18 Jan 2023 (this version, v2))
Comments: Project page with code: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This paper proposes a method for editing images based on human instructions. Given a document with specific instructions, the model modifies the image accordingly. To obtain the training data needed for a large number of edited images, we use a pre-trained model that combines a language model (GPT-3) and an image generation model (stable diffusion). We then use a conditional diffusion model, InstructPix2Pix, to train the generated data and generalize it to actual images and user instructions. The model edits images quickly, requiring no fine-tuning or modification, and takes only a few seconds. Compelling editing results are achieved for a variety of input images and instructions.

This innovative technology has the potential to revolutionize the traditional image editing process. Traditional methods require specialized knowledge and manual labor, which is time-consuming and labor-intensive, but with InstructPix2Pix, anyone can easily edit images by following instructions. This new approach greatly expands the scope of creativity and expression, allowing for the rapid creation of customized images to suit a variety of needs. Moreover, combined with the latest advances in machine learning and natural language processing, human-computer collaboration will further evolve and open up new creative possibilities.


Because of the difficulty of acquiring large amounts of training data, this technique combines a large language model (GPT-3) with a text-to-image model to generate a dataset of instruction/image pairs. These models capture knowledge about both language and images and generate training data. The generated paired data is used to train a conditional diffusion model given an input image and textual instructions on how to edit it. This model performs image editing directly and requires no additional adjustments. Additionally, based on the training data, it works effectively with both real images and natural instructions. The model allows for intuitive image editing, object replacement, style changes, and much more.The following illustration is an example of replacing objects, changing the style of an image, changing settings, artistic mediums, etc.

Related Research

Recent research has shown the potential to solve complex multimodal tasks by combining large pre-trained models. This involves the use of large language models (e.g., GPT-3) and text-to-image models. Methods for combining these models include joint fine-tuning, prompted communication, and energy-based model composition. Similar to these approaches, this study utilizes pre-trained models to generate multimodal training data. Image editing models also include models that focus on traditional editing tasks and models that use text to guide image editing. Our approach is novel in that it differs from traditional text-based image editing in that it allows editing from instructions. The main advantage of this approach is that users can accurately direct edits with natural text. In addition, a method for acquiring large amounts of training data is proposed by using a generative model to generate training data.


Our proposed method tackles image editing as a supervised learning problem. Below is an overview diagram.

First, it generates a training data set consisting of pairs of text edit instructions and images. Then, an image editing diffusion model is trained that produces the edited image from the text editing instruction.

Generating Multimodal Training Data Sets combines a large language model with a text-to-image model to generate a data set containing text edit instructions and pre- and post-edit images. The large language model is then used to capture image captions and generate edit instructions and post-edit text captions.

In addition, the text-to-image model is used to convert caption pairs into image pairs. The Prompt-to-Prompt method is a technique used to adjust the text generation model. Typically, a language model generates text based on a single text prompt (input), but the Prompt-to-Prompt method uses two different prompts to tune the model. Specifically, Prompt-to-Prompt presents two different prompts to the model and compares the model's output to each prompt. This comparison can improve model consistency and stability. Prompt-to-Prompt also has the effect of increasing the diversity of the generated text. Below is a comparison of models with and without Prompt-to-Prompt.

Training trains a conditional diffusion model that edits images based on written instructions. This diffusion model is trained to estimate the score of the data distribution and generate data samples.

Finally, we trade off the quality and diversity of the generated samples using diffusion guidance without classifiers. This improves the quality of the conditioned image generation and produces more appropriately corresponding samples. In the figure below, sI controls similarity to the input image, while sT controls consistency with the editing instructions.


It shows the results of image editing based on various edits and instructions. The model in this study was able to perform a wide variety of edits, including object replacement, seasonal and weather changes, background replacement, material attribute changes, and artistic transformations.

Compared to SDEdit and Text2Live (previous techniques), our method follows the editing instructions, but differs from the previous methods in that it requires an "edited" text caption rather than an image description. SDEdit works well when the style is changed and the content remains largely constant. It works well, but can be problematic when major changes are required. Text2Live, on the other hand, can produce compelling results but limits the categories of edits.

Furthermore, quantitative comparisons in the following figures show that our method outperforms SDEdit in both similarity and edit quality. Blue is the technology of this study.

Ablation results for the choice of dataset size and guidance scale showed that as the dataset size was reduced, the ability to make large edits was reduced and only subtle adjustments were made. The results also indicated that the intensity of editing and the consistency of the images can be adjusted by adjusting the guidance.


This study shows how to combine a large language model with a text-to-image model to generate a dataset for training a diffusion model that follows instructions. This method allows for a variety of edits, but still has many limitations. This is because it is limited by the quality of the generated dataset and the diffusion model used. The ability to generalize to new edits and make correct associations is also limited by the diffusion model used and the ability of the model to create instructions. In particular, they may have difficulty counting objects and making spatial inferences. In addition, there are biases in our methodology and in the models used, which may be reflected in the edited images. To overcome these limitations, research is needed on how to interpret instructions, how to combine them with other conditioning formats, and how to evaluate them. It is also important to incorporate human feedback to improve the model.

Looking ahead, it is important to improve and extend the model, integrate human feedback, integrate with other conditioning formats, expand the application domain, and consider ethical considerations. With these prospects in mind, further development and application of instruction-based image editing technology is expected.


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us