Systematic Investigation Of Gen-RecSys, A Recommender System Evolving With Generative And Large-scale Language Models

Large Language Models 28/10/2024

3 main points
✔️ Advances in generative models enable new tasks by learning and leveraging complex user and item data with performance beyond traditional recommender systems
✔️ Introduction of large-scale language models has led to tremendous performance in reasoning, learning, and leveraging open world information to improve personalization and conversational interfaces
✔️ Considering performance, fairness, privacy, and social impact, key issues related to evaluating Gen-RecSys Raised

A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)
written by Yashar Deldjoo, Zhankui He, Julian McAuley, Anton Korikov, Scott Sanner, Arnau Ramisa, René Vidal, Maheswaran Sathiamoorthy, Atoosa Kasirzadeh, Silvia Milano
(Submitted on 31 Mar 2024)
Comments: This survey accompanies a tutorial presented at ACM KDD'24
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Advances in generative models have had a significant impact on the evolution of recommendation systems. Traditionalrecommendation systems, whichare "narrow experts" that capture user preferences and item characteristics within a specific domain, are now enhanced by generative models and have been reported to outperform traditional methods. These models are bringing innovative methods to the concept and implementation of recommendations.

Current generative models can learn and sample complex data distributions that include not only user and item interaction history, but also text and image content. This allows these data modalities to be leveraged for new interactive recommendation tasks.

In addition, advances in natural language processing with the introduction of large-scale language models such as ChatGPT and Gemini have shown remarkable performance in inference, in-context few-shot learning, and access to a wide range of open world information. Because of these extensive capabilities, pre-trained generative models present new research possibilities for a wide variety of recommendation applications, including enhanced personalization, improved conversational interfaces, and richer explanation generation.

The core of the generative model lies in its ability to model and sample the learned data distribution. Because of this nature, there are two primary applications for recommender systems.

One is a directly learned model (e.g., VAE-CF, a variational autoencoder for collaborative filtering). It directly learns from user-item interaction data to predict user preferences. This method does not use large and diverse pre-training data sets.The other is a pre-trained model. It uses pre-trained models with diverse data such as text, images, and video to understand complex patterns, relationships, and contexts.

This paper covers the application of pre-trained generative models in the following settings

Prompting for zero and several shots
- Extensive understanding without additional training using In Context Learning (ICL).
fine tuning
- Tailor the model with specific data sets to provide customized recommendations.
Search Extension Generation (RAG)
- Integrate information retrieval with generative modeling to generate context-relevant output.
Embedding for downstream training
- Generate embedding and token sequences for complex content representation.
multimodal approach
- Use a variety of data types to improve the accuracy and relevance of model recommendations.

The generative model is expected to open up new possibilities for recommendation systems and provide an unprecedented interactive and personalized user experience.

In recent years, several surveys have been published that show important progress in this area:Deldjoo et al. explore GAN-based recommender systems in four different scenarios: graph-based, collaborative, hybrid, and context-aware;Li et al. explore large-scale language modeling for recommender systems study training strategies and learning objectives for large-scale language models for recommender systems.also use large-scale language models to generate input tokens or embeddings forrecommender systems,whileWu et al.Otheractive research includesWang et al.introducing GeneRec,a next-generationrecommendation system thatpersonalizes content through an AI generator and interprets user instructions to collect user preferences.

While these studies provide important insights, their scope is limited to large-scale language models , or specific model sets (such as GANs).GeneRec,on the other hand, offers comprehensive research focused on personalized content generation.

The figure below provides an overview of this paper's research on Gen-RecSys. It is categorized by data source, recommendation model, scenario, etc., and delves into the evaluation of each system and its challenges. This paper investigates Gen-RecSys based on this system.

The paper covers a broad spectrum of generative models and data modalities, providing systematic information for the future of recommender systems.

Interaction-Driven Recommendations Based on Generative Models

Interaction-driven recommendations are the most common recommendation system based solely on user/item interaction (e.g., user A clicks on item B). In this setting, the focus is on the user-item interaction and not on other modalities such as text or visual information, and the focus is on the output of the recommendation list or grid. Deep Generative Models (DGMs) can be useful for such systems.

For example, deep generative models can augment user-item interactions, use noise reduction for recommendations, learn the distribution of recommendation layouts, etc. This section summarizes our survey of deep generative models for recommendation tasks using user and item interaction data, including autoencoding models, autoregressive models, generative adversary networks (GANs), and diffusion models.

Autoencoding models learn to reconstruct inputs, and their ability to do so allows them to be used for noise removal, representation learning, and generative tasks. In this context, the denoising autoencoding model learns to reconstruct the original input from corrupted input. For example, AutoRec reconstructs partially observed input vectors. Models such as BERT are also considered denoising autoencoding models, where BERT4Rec is trained to predict items masked in the user's previous interaction sequences.

Variational autoencoding models (VAE) learn to map from complex probability distributions to simple probability distributions. Variational autoencoding models have been widely applied in collaborative filtering, sequential recommendations, etc., and show excellent performance. In addition, Conditional VAE (CVAE) learns the distribution of recommendation lists for a particular user and, like ListCVAE and PivotCVAE, generates entire recommendation lists as well as rankings of individual items.

Autoregressive models also learn conditional probability distributions at each step, given an input sequence. These models are used for sequence modeling and are widely applied to session-based and sequential recommendations, model attacks, and bundle recommendations. Among them, recurrent neural networks (RNNs) are used to predict the next item in session-based and sequential recommendations. For example, GRU4Rec and its derivatives predict the next set of items in basket and bundle recommendations.

Self-attentive autoregressive models are based on transformers and replace recurrent units with self-attention and related modules. These models are used for session-based and sequential recommendations, predicting the next basket or bundle, and model attacks. Self-attentive models have the advantage of effectively handling long-term dependencies and allowing parallel training.

The generative adversary network (GAN) also consists of two main components: the generator network and the discriminator network. These networks learn competitively, improving the performance of both. Generator vs. adversary networks are used to select informative training samples in an interaction-driven setting. For example, in IRGAN, the generative search model samples negative items. Generative adversarial networks are used to augment training data by synthesizing user preferences and interactions, and are also effective in generating recommendation lists and page-wide recommendations.

In addition, the diffusion model produces output through a two-step process. First, it learns to transform the input into noise in a forward process and then recovers the original input from the noise in a reverse process. The model learns the user's future interaction probabilities and shows promising results for mitigating data scarcity and the problem of long-tail users.

Large-scale Language Models in Recommendation

Content-based recommender systems have been leveraging language for over 30 years, but have entered a new phase with the advent of pre-trained large-scale language models (LLMs).The generalized, multitasking natural language inference capabilities of large-scale languagemodels allow textual content to be used to represent item characteristics, user preferences, interactions, recommendation tasks, and even external knowledge in a unified and interpretable form.

Textual content is tied to item titles, descriptions, and reviews, and user preferencescanalsobe expressed innaturallanguage.Pre-trainedlarge-scale languagemodels offer new ways to leverage these textual data and have the ability to make recommendations based on user preferences and their descriptions in many domains.This section summarizes a survey ofthe major approaches to the evolution of large-scale language modelsin recommender systems.

For example, Dense Retrieval treats the textual content of an item as a document and concatenates the user's most recently preferred item descriptions to compose a query. For example,large languagemodels such as BERT, TAS-B, Condenser, etc.can be used to generate a ranking list of items; approximate search libraries such as FAISS can be used to build highly scalable systems.

Zero-shot and small-shot generative recommendations uselargecommerciallanguagemodels to build prompts that describe user preferences in natural language and predict the next item title and rating to recommend. Zero-shot prompting is competitive in settings without sufficient data.

Retrieval-enhanced generation (RAG) is anothermethod in which the output generation of alarge-scale languagemodel is conditioned on information obtained from external knowledge sources. This facilitates online updating and reduces illusions (erroneous generation).Search-enhanced generationis a method that firstbuilds a candidate item set using asearcher orrecommender systemand thenre-ranks the candidate set by providing prompts to the encoder-decoderlarge-scale languagemodel.

In addition, advances in large-scale language models have made user interaction with natural language systems feasible, opening up the possibility of conversational recommendations (ConvRec) ConvRec integrates a variety of conversational elements are integrated.Some studies use monolithic large-scale language models such as GPT-4to facilitate natural languageinteraction and generate item recommendations based on dialogue and interaction history.

By harnessing the power oflarge-scale languagemodels, more sophisticated and personalized recommendation systems are expected to be realized.

Generative Multimodal Recommendation System

In recent years, users have come to expect richer interactions than mere text and image searches. Specific examples include combining photos of desired products with natural language instructions such as "the red version of the dress in this photo," or visualizing how a garment would look on you or a piece of furniture would look if placed in a room to confirm recommendations. These advanced interactions require new recommendation systems that can discover unique attributes hidden in each modality (text, images, etc.).

Why do we need multimodal recommendations? Retailers have a wide variety of information, including product descriptions, images and videos, customer reviews, and purchase histories, but traditional recommendation systems use an approach that processes each source of information independently and fuses the results. This approach often fails to adequately meet customer needs.

For example, in the cold start problem, when a new customer or product cannot be recommended due to lack of user behavior data, diverse information must be leveraged to make appropriate recommendations for the new product or customer. Also, in order to respond to the request "I am looking for a black metal and glass coffee table under $300 for my living room," the product's look and shape must be related to other objects in the room. Such a request cannot be addressed by either text or images alone.

In addition, multimodal understanding is also important for requests that combine user-provided product images or audio (e.g., a song similar to the sound clip) with textual modification instructions, as well as complementary related products (e.g., the kickstand for the bicycle in the photo). Multimodal understanding is also necessary for recommendation systems with complex outputs, such as virtual try-on features or intelligent interactive shopping assistants.

However, there are several challenges in developing a multimodal recommendation system. First, it is more difficult to collect multimodal data than unimodal data, and annotations may be incomplete. It is also difficult to effectively combine different data modalities. For example, existing contrast learning approaches map each data modality to a common latent space, but may miss complementary information.

Furthermore, training multimodal models requires large amounts of data. Despite these challenges, recent research has made progress toward effective multimodal generative models. Specifically, these include synthetic data generation using large-scale language and diffusion models, high-quality unimodal encoders and decoders, techniques for matching latent spaces to shared spaces, efficient reparameterization and learning algorithms, and techniques for injecting structure into learned latent spaces.

Learning a generative model by multimodal requires learning a latent representation of each modality and ensuring that they are consistent. One way to address this challenge is to first learn the alignment among multiple modalities and then train the generative model on the "well-aligned" representations.

Typical contrast learning approaches are CLIP and ALBEF: CLIP projects images and associated text into the same embedding space using parallel encoders; ALBEF extends CLIP and uses multimodal encoders that fuse text and image embedding. ALBEF shows excellent results on zero-shot and fine-tuned multimodal benchmarks while pre-training with fewer images.

Contrast-based matching has shown impressive zero-shot classification and retrieval results, and has been successful for many tasks including object detection, segmentation, and action recognition. The same matching goals have been used across other modalities and in multiple modalities simultaneously.

Recommendation systems that leverage multimodal data provide richer and more accurate recommendations to users. This paper presents typical approaches to multimodal recommendation systems using generative models. The first ismultimodal VAE.While variational autoencoders (VAEs) can be applied directly to multimodal data, it is more effective to use modality-specific encoders and decoders that have been trained on large data sets. A common approach is to process both image and textual input and partition the latent space by modality. For example, ContrastVAE adds contrast loss between latent representations for each modality and is robust to perturbations in the latent space while addressing data uncertainty and scarcity.

The second isthe diffusion model. This is the state-of-the-art in image generation and can also be used for text generation. For example, DALL-E generates new images based on CLIP's embedding space, while Stable Diffusion uses UNet autoencoders trained with perceptual loss and patch-based adversarial objectives. This improves the controllability and consistency of the generated results and has been applied in applications such as virtual fitting.

The third is a multimodal large-scale language model (MLLM). This provides a natural language interface in which users express queries in multiple modalities and display responses in different modalities. It connects discriminatively pre-trained encoders and decoders and uses an adaptive layer to ensure that unimodal representations are consistent. For example, Llava accepts both text and image input and generates useful text responses.Research onmultimodal large-scale languagemodels is just beginning, but they are already being used in recommendation applications.

Generative multimodal recommendation systems have the potential to significantly improve the user experience. These technologies are expected to play an increasingly important role in the future.

Impact and Hazard Assessment

Evaluatingrecommender systemsis multifaceted and complex. These systems are composed of numerous recommendation models and other machine and non-machine learning components, making it difficult to evaluate the performance of individual models.In addition, quantifying the impact ofrecommendationsis also a difficult challenge, as they have a broad impact on user experience and behavior.In particular,the introduction of Gen-RecSys (generativerecommendation systems)has further complicated the evaluation process. In addition to the performance and capabilities of the system, it is important to evaluate therecommendationsystem, including its safety and potential for social harm. Here we review the key points of the evaluation and survey the evaluation metrics, outstanding issues, and future research directions. First, offline impact assessment. A common approach to model evaluation is to understand the accuracy and efficiency in an offline environment, followed by live experiments.

Common metrics used for the identification task include recall@k, precision@k, NDCG@k, AUC, ROC, RMSE, and MAE. For generative tasks, natural language processing techniques are useful. For example, BLEU scores are used for description and review generation, ROUGE scores are used for summary evaluation. perplexity is also useful for assessing the adequacy of language modeling. It is also important to evaluate the efficiency of training and inference of the generative recommendation model. This is considered an area for future research.

Benchmarking is also important.Popular benchmark datasets for discriminative recommendation models (Movielens, Amazon Reviews, Yelp Challenge, etc.) arealsouseful forgenerativerecommendation models.However, more recent datasets such as ReDial and INSPIRED are specialized for conversational recommendations. New benchmarks must be developed to address new tasks.

The next step is online and long-term evaluation. A/B testing is necessary because offline experiments cannot fully account for interdependencies among models and other factors. Simulations with agents are also useful. It is important to understand not only the short-term impact, but also the long-term impact using business metrics such as revenue and engagement.

Other useful indicators for evaluating conversational recommendations are BLEU and perplexity. They should be complemented by task-specific and goal-specific indicators. Powerful large-scale language models can serve as judges, but human evaluation is ultimately important; toolkits such as CRSLab can assist in this process. Milano et al.classify theharms associated with recommender systemsintosix categories (content, privacy violations, threats to human autonomy, transparency, filter bubbles, and fairness).Generative models present new challenges. These include biases in large-scale language models, environmental impacts, and the replacement of human workers.

Evaluating recommendation systems on offline indicators, online performance, and harm is challenging. Further research and tool development is required due to different evaluation approaches among different stakeholders; a comprehensive evaluation framework should be designed with reference to the HELM benchmarks.

Thus, theevaluation of the impact and harm ofrecommender systemsrequires a multifaceted perspective. The evaluation must take into account not only accuracy and efficiency, but also safety and social impact.Future research and the development of new benchmarkswill contribute to the evolution ofrecommender systems.

Summary

This paper is a survey conducted to explore the diversity and potential of generative models in recommender systems. As we have discussed.Theapplication of recommender systems and theirevaluation is becoming increasingly complex, and it is hoped that this survey will contribute to the development of this field.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Systematic Investigation Of Gen-RecSys, A Recommender System Evolving With Generative And Large-scale Language Models

Summary

Interaction-Driven Recommendations Based on Generative Models

Large-scale Language Models in Recommendation

Generative Multimodal Recommendation System

Impact and Hazard Assessment

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...