Catch up on the latest AI articles

You Can Tell GANs What Kind Of Image To Produce Using Text Input!

You Can Tell GANs What Kind Of Image To Produce Using Text Input!

GAN (Hostile Generation Network)

3 main points
✔️ Combining the generative power of StyleGANs with the rich vision-language representations of OpenAI's CLIP.
✔️ Three new methods for effective text-based image manipulation.
✔️ Provides much more control over text-based image manipulation than previous SOTA.

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
written by Or PatashnikZongze WuEli ShechtmanDaniel Cohen-OrDani Lischinski
(Submitted on 31 Mar 2021)
18 pages, 24 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)



Generative Adversarial Networks or commonly known as GANs, have taken the state of the art of image generation to unseen heights. Models like StyleGAN enable us to generate high-resolution images that capture minute details of reality. Another important property of StyleGAN is that it allows images to be disentangled and manipulated in various ways. Nevertheless, utilizing this property is troublesome and could require a large amount of annotated data, or strong pretrained classifiers. A model tuned with one manipulation can only work in that specific direction, which further limits the capabilities of these models.

In this paper, we introduce a method to democratize image manipulation using GANs. Specifically, we combine the recently introduced Contrastive Language-Image Pretraining (CLIP) model with StyleGAN. CLIP has been trained on 400 million image-text pairs, and the use of natural language allows us to represent diverse visual concepts. While our method does it easily, the results generated by our model have not been generated by any other StyleGAN manipulation.



CLIP is a multi-modal model that learns to find the semantic similarity between an image and its corresponding text. It was trained by OpenAI on 400 million image-text pairs taken from the Internet. The CLIP models are very powerful and achieved state-of-the-art zero-shot image generation performance on a variety of datasets.

StyleCLIP Text-Driven Manipulation

We explore three different image manipulation methods by combining the generative power of StyleGANs with the rich vision-language representations of CLIP. The intermediate latent space representations of StyleGAN have been shown to possess disentangled image properties that are useful in image manipulations. We make use of the W+ latent representation in 2 of our methods and the supposedly more disentangled S latent representation in the remaining one. 

Latent optimization

This approach tries to directly optimize the latent code to produce the desired image manipulations. For any text prompt 't', source latent space 'ws' inverted using e4e, and manipulated latent space 'w', 

Here, DCLIP is the cosine distance between the CLIP embeddings of the text prompt and the image generated by the generator. The L2-norm controls the similarity to the input image.

The above equation is the identity loss. R is a pretrained ArcFace, a network for face recognition. Identity loss is computed by the cosine similarity between the embeddings produced by R for the input and modified images.

λL2 and λID control the proportion of L2 and identity loss.  The above optimization problem is solved by gradient descent to get the optimal manipulated latent space. The above diagram shows a few samples obtained after 200-300 iterations along with their (λL2, λID). It is a very versatile process but is time-consuming.

Latent Mapper

Different layers of the StyleGAN layers have been shown to be responsible for different details in the image. Therefore, the layers are divided into three groups (fine, coarse, medium) and the latent code w is fed into three different fully-connected mapper networks. The output mappings-Mt(w) are concatenated and added to the initial latent code and fed to a StyleGAN. Like before, in order to preserve the image quality and identity, we minimize the following function:

The L2-norm and identity loss prevents the image from changing too much, while the CLIP loss makes sure the necessary changes have been made by the mapper network. Almost all examples in this paper use λL2 = 0.8, λID = 0.1.

The above pictures show the results of manipulations to hairstyles using this method. The identity and important visual features are preserved in almost all cases. The bottom row shows that this technique is robust to multiple image attributes {straight, short}, {straight, long}. Such control has never been observed with previous models. We also found that the cosine similarity of Mt(w) is high across different images, which suggests that manipulations are applied in similar directions across different images.

Global Directions

Now, we wish to develop a more versatile image manipulator for more fine-grained disentangled manipulation using StyleGAN's style space 'S'. More precisely, for s ∈ S, we seek a manipulation direction ∆s  such that G(s + α∆s) produces an image with the manipulations specified by a text prompt 't'. Here, α controls the amount of manipulation.

We aim to use CLIP's language-image embeddings to encode the text prompt to a vector ∆t, which is then mapped to manipulation direction ∆s. Since an image might have several attributes, and an attribute might correspond to several images, it is necessary to distinguish between the manifold of image embedding(I), and the manifold of text embedding(T), in CLIP's joint embedding space. CLIP normalizes the embeddings during training, so only the direction of embedding is useful. In well-trained regions, the directions for T and I are approximately collinear with large cosine similarity. 

In order to obtain a proper ∆t from natural language, we need to reduce the noise from text embeddings and get a stable direction in T. For this, we use a method called prompt engineering, where multiple sentences with the same meaning are fed to the encoder and the resulting embeddings are averaged. Eg: “a photo of a {car}”, “a cropped photo of the {car}”, “a clear photo of a {car}”, and “a picture of a {car}”, all have similar meanings. 

Given a pair of images G(s) and G(s+α∆s), let their image(I) embeddings are denoted by i and i + ∆i. Our goal is to be able to yield the change ∆i which is collinear with ∆t using the manipulation direction ∆s. For this, we calculate the relevance of each channel 'c' of 's' to ∆i. This is done by calculating the mean projection of ∆ic into ∆i using 100 image pairs. ∆iis the space directly between the image pairs obtained from G(s ± α∆sc), where ∆sis a zero vector except its c coordinate which is set to the standard deviation of channel c. If the relevance ∆i.∆ic is smaller than a threshold β, it is set to 0. β is proportional to the amount of disentanglement in the image and its effects can be seen in the example above (for text "grey hair"). 

Other non-human image manipulations performed using this method can be seen in the picture below.


This paper introduced three methods that successfully manipulate images based on text queries. The results produced by these methods have not been seen before from existing methods. One of the major limitations is that since these methods are based on CLIP, they do not generalize well to spaces in which CLIP has not been trained well or has not been trained at all. Nevertheless, it has significantly contributed to the important and growing field of text editing.

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us