Development And Application Of Let's Go Shopping (LGS), A New Large-scale Bimodal Data Set That Leverages E-commerce Data

Large Language Models 01/03/2024

3 main points
✔️ Building a new dataset: a large dataset called "Let's Go Shopping (LGS)" using image/text pairs that are readily available from e-commerce websites. In addition, this approach addresses the problem of ensuring high quality annotated data.
✔️ Diversity and Scale of the LGS Dataset: The LGS dataset contains over 15 million image/text pairs, providing useful data for image recognition and bimodal applications and increasing generalization capabilities through the diversity of visual information.
✔️ Implications for Emerging Applications: The unique data distribution and bimodal (both image and text) properties of the LGS dataset demonstrate its effectiveness in a wide range of applications, including image classification, image reconstruction, bimodal representation learning, and text-to-image generation.

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
written by Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson, Aerin Kim, Somayeh Sojoudi, Kyunghyun Cho
(Submitted on 9 Jan 2024)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In recent years, pre-training on large datasets has become critical to research advances in the fields of computer vision (CV) and natural language processing (NLP). These datasets provide the basis for machine learning models to understand the complexities of the real world and apply them to tasks in image recognition and language understanding. However, the creation of these datasets is an enormous amount of time and effort, especially for bimodal applications that integrate both images and language, making the preparation of high-quality annotated data even more difficult. As a result, the research community relies on a limited number of publicly available datasets, which hinders the diversity and development of research.

To address this challenge, this paper proposes a novel approach to dataset construction that exploits image-text pairs readily available from e-commerce websites. The paper uses this new approach to build a large dataset called "Let's Go Shopping" (LGS), which provides a rich resource of 15 million image-description pairs collected from approximately 10,000 e-commerce sites. By providing objective, accurate, and rich caption information about images, the LGS dataset aims to provide high quality data for pre-training models on images and language. In addition, the nature of e-commerce data makes it ideal for image recognition tasks, as many images are clear backgrounds and have a static focus on the object of interest.

In addition, the paper demonstrates that the diverse visual information provided by e-commerce images can enhance generalization capabilities for out-of-distribution (OOD) scenarios not covered by traditional datasets. compared to traditional image-only datasets such as ImageNet, the LGS dataset can provide visual features that help models adapt to new environments and scenarios in image classification, reconstruction, captioning, and generation tasks.

This study suggests the importance of large, diverse data sets and the potential for leveraging new data sources.

What is the Let's Go Shopping (LGS) Data Set?

The Let's Go Shopping (LGS) dataset is a dataset of epic proportions that reflects the world of e-commerce, as can be seen from the table below, containing over 14.84 million image/text pairs. This is larger than many other bimodel datasets in existence, making it a valuable resource for researchers and developers. To build this dataset, information was collected from approximately 10,000 e-commerce sites that offer a variety of products ranging from infant products to sporting goods to bridal jewelry.

During data collection, heuristic rules are set up to distinguish between product and non-commodity pages, and automated tools collect the product title, description, and first listed image. This process undergoes rigorous testing to avoid information that sellers do not wish to share and ultimately weed out instances of suspected quality issues. Also, unlike typical image caption datasets, images in the LGS dataset often depict a single non-animated item occupying the foreground without any connection to the background. The background is monochromatic, and this clear background makes it easier for the model to identify the pattern corresponding to the task.

The LGS captions are about three times larger than the COCO dataset, and their word and phrase diversity is about 20 times greater. These captions contain a wealth of information from e-commerce sites, allowing the extraction of clear structural information for fine-tuning purposes. The Spacy library was used to analyze linguistic statistics, comparing common nouns, proper nouns, adjectives, and verbs; LGS captions, especially for clothing and wearable items, are highly descriptive, characterizing product-specific descriptions and behaviors.

The LGS dataset has been applied to image classification tasks beyond the image and caption pair format. Three classification variants have been built for this: LGS-117, LGS-710, and LGS-Overlap. These variants also help generate product titles, brand names, and summarized captions describing specific product attributes; LGS-117 and LGS-710 are designed as pre-training data sets. Among all the raw labels generated by the classification models, there are synonyms and overlaps that should be integrated. After manually merging synonyms among the most popular classes, we find that there are 117 classes containing at least 10,000 images. The 10,000 images from each class are selected to form the balanced LGS-117 dataset; LGS-710 is an unbalanced dataset containing rarer classes; LGS-Overlap is an out-of-distribution test set of models trained on ImageNet-1k is proposed as an out-of-distribution test set for the model trained on ImageNet-1k, showing a marked difference in label distribution between the e-commerce application and the general pre-training dataset.

The LGS dataset provides an important resource for research and application development that captures the complexity and diversity of e-commerce.

Experiment

In this study, we are conducting experiments on image classification and reconstruction using two different image datasets: eCommerce and ImageNet. Through this process, we also identify differences in the distribution of images and labels between these datasets.

The very well-known ImageNet classifier has been observed to perform poorly when applied directly to e-commerce datasets. For example, experiments with the models ResNet-50 and ConvNeXT-Base have observed that, unlike the high accuracy obtained on the ImageNet dataset, it is significantly degraded on the e-commerce dataset. This indicates that models trained on ImageNet are not suitable for direct application to a specific domain such as e-commerce. It suggests that additional training on domain-specific datasets is needed to improve classification accuracy.

Using Masked Auto Encoder (MAE), we compare the performance of models trained on ImageNet alone and models trained on both ImageNet and e-commerce datasets. The results show that when the e-commerce dataset is included, the quality of image reconstruction is significantly improved. This shows that self-supervised learning has the ability to generalize across different domains.

The above highlights the limitations of models trained on a general dataset such as ImageNet when applied directly to a specific domain such as e-commerce. We also show that it is possible to overcome these limitations and develop models with higher generalization capabilities by using different approaches to such challenges, including self-supervised learning. This represents a new direction for improving the applicability of models across different domains.

Summary

The Let's Go Shopping (LGS) dataset is an innovative dataset from the world of e-commerce. The dataset contains approximately 15 million pairs of images and their corresponding descriptions, all collected in a publicly accessible form from e-commerce sites. Unique semi-automatic collection and annotation methods ensure efficient collection of large and diverse data.

The characteristics of the LGS dataset reveal that despite the lack of a direct match between the e-commerce-specific categories and the general dataset, the techniques for extracting visual features can be shared. This suggests that learning algorithms can be applied across datasets from different disciplines.

Furthermore, the unique data distribution and bimodal (handling both images and text) properties offered by LGS also suggest potential in new application areas. Specifically, LGS has shown its effectiveness in a wide range of applications, including image classification, image reconstruction, bimodal representation learning, and text-to-image generation.

The LGS dataset is paving the way for the development of new technologies that leverage e-commerce data and its potential in a wide variety of applications. The dataset is expected to play an important role in future research and application development in the areas of AI and machine learning.

Categories related to this article

Large Language Models

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Development And Application Of Let's Go Shopping (LGS), A New Large-scale Bimodal Data Set That Leverages E-commerce Data

Summary

What is the Let's Go Shopping (LGS) Data Set?

Experiment

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...