Catch up on the latest AI articles

Introducing WIT From Google AI: The Largest Multimodal Image-text Dataset With Coverage Of Over 100+ Languages

Introducing WIT From Google AI: The Largest Multimodal Image-text Dataset With Coverage Of Over 100+ Languages


3 main points
✔️The largest text-image dataset based on Wikipedia.
✔️Contains examples in 108 languages with a total of 36.7 million text-image pairs.
✔️Properly refined dataset validated by human annotators.


WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
written by 
Krishna SrinivasanKarthik RamanJiecao ChenMichael BenderskyMarc Najork
(Submitted on 2 Mar 2021 (v1), last revised 3 Mar 2021 (this version, v2)])
Comments: Accepted by arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)


first of all

Deep Learning models are data-hungry and tend to perform well when the model size and dataset size are scaled appropriately. It is almost always the case that a larger dataset improves the model performance as studies have shown that performance improves logarithmically with the size of the dataset. Large models like GPT, T5, BERT, ResNet are able to learn efficacious representations in a self-supervised way using large-good quality datasets such as ImageNet, COCO, and BooksCorpus. In addition, recent works like ViLBERT, UNITER, UniTransformer(UniT) have made the effort to incorporate multimodal NLP and vision capabilities into the same model. Most of these datasets are limited to English, which acts as a bottleneck in multi-lingual multi-modal learning.

In this paper, we introduce a highly refined multi-lingual text-image dataset based on Wikipedia. It contains 11.5 million unique images and 37.6 million text-image sets. Each language has 12K+ instances (with 53 languages with more than 100K+ instances). 

WIT: Wikipedia Image Text Dataset

Our objective is to create a highly curated dataset with high-quality image-text pairs like COCO and the FLickr30K. Creating such datasets is a resource-intensive task especially on a scale like WIT and therefore we want to automate and scale the dataset creation process like the Conceptual Captions(CC) dataset. 

For this reason, we choose Wikipedia which has a lot of crowd-sourced information in several languages that have been curated by its editorial team. Nevertheless, the data still requires a lot of refinement due to the low-information(redundant and generic) text-image associations that make it difficult to train Visio-linguistic(VL) models.

We use FlumeJava pipeline to extract and process about 124M pages of content information across 279 different languages. These pages are used to obtain 150M (image data, text data, contextual data) tuples which are further refined.

Text Used in WIT

There are three different types of textual information used in WIT:

1) The Reference Description(ref.) is the text right below the image. It is most relevant to the image but is less common than other descriptions.

2) The Attribution Description(attr.) is the text on the Wikimedia page of the image. The text is usually multilingual(138M+) and although most are uninformative and noisy, some are semantically informative and desirable. 

3) The Alt-text Description(alt), (which is generally hidden) that is used for accessibility/ screen readers was not found to be very useful. It was found that it is usually just set to the filename.

Text-based Filtering Conditions

  1. Text length greater than 3.
  2. Exclude alt-text with phrases like: .png, .jpg, icon, stub and “refer to”, “alt text” .. etc.
  3. For attribution and alt-text, only PNG and JPEG images were chosen.
  4. GIF images with reference descriptions were taken.
  5. Tuples without a reference description whose image is not found in the last sections(i.e., the bibliography, external links) were kept.

Image and Image-text based Filtering Conditions

  1. Images with a minimum height and width of 100 pixels were retained.
  2. Images with research-permissive licenses such as Creative Commons were retained.
  3. Images of flags, logos, maps, which are highly redundant were under-sampled to prevent modeling bias towards them. 
  4. Generic images, tiny icons, placeholder images, were all removed. 

Additional cleaning

Inappropriate content like pornography, violent images, text was removed using multilingual image/text understanding models. Only languages(108) with over 12K+ tuples were kept in the final dataset. The data was divided into the train, test(50K), and validation(50K) sets ensuring that each image lies only in one split.

Human Inspection 

Using forms like the one shown in the above figure, we crowd-sourced human annotators to validate the reliability of our dataset. Since one image can have multiple text annotations, we asked them how well the texts match the image, and how well the texts combined together describe the image. The possible answers were {yes, maybe, and no}.

The tests were conducted on 4.4k randomly sampled examples across different languages: 3k examples in English, 300 examples in German, French, Spanish, Russian, Chinese, and 100 examples for Hindi. 

Evaluation experiment of WIT

In order to evaluate WIT, we trained a dual-encoder model as shown in the above diagram. The two encoders are each for text and image processing. Then we measure the cosine similarity between the n image-text pairs in the batch and train the model to minimize the softmax loss. Only the diagonal entry in the nxn similarity matrix is taken as positive pairs. In other words, we encourage the encoders to produce similar results for related image-text pairs.

We also trained the model on the CC dataset and compared the results with the model trained on the WIT dataset. The above table shows the results on the image-text retrieval task without any finetuning(zero-shot). The model trained on WIT is able to generalize better and beats the CC model even in non-English sets.

The above diagram shows the zero-shot evaluation results on MS-COCO, Dlickr30k, and WIT-ALL datasets. In this case, however, the CC dataset is able to beat the WIT-ALL dataset on the first two datasets. 

We also evaluated the models on the Multi30k-R dataset to check the multilingual efficacy of WIT dataset. Both the models suffer on the Multi30K dataset while the model trained on the CC dataset also performs poorly on the WIT test dataset.

The reason behind why WIT performed so poorly on the Multi39K, COCO, Flickr dataset can be attributed to the fact that Wikipedia is a very diverse content pool. As shown in the above table 72.02% of the words have a word frequency of less than 3. In addition, the image data is also very diverse i.e. among the 4.5M entities identified,  more than 80% (3.68M) of the entities occur 3 times or less. Moreover, text in the WIT dataset tends to be descriptive, which is in contrast to the usually single-line annotations in the evaluation datasets. Text hypernymization(replacing personal nouns with common terms) was used in the CC dataset to create a dataset that was closer to the evaluation set. However, it is an extremely difficult task for a large dataset like ours in over 100 languages.


WIT is a rich and diverse dataset that has several applications like pretraining image models, language models, text-vision models, and finetuning image-text models or cross-lingual representations. Models like UNITER, Unicoder-VL, VL-BER, and the more recent UnitTransformer have shown promising results on a variety of text-vision tasks. A diverse dataset like WIT could help propel this domain. Since it is a multi-lingual dataset, it also makes the global availability of information for research more equitable.

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us