GenAI-Arena] New Platform To Evaluate Generated Models By User Votes

Large Language Models 20/12/2024

3 main points
✔️ Proposes GenAI-Arena, the first open platform to rank generative models based on user preferences
✔️ User voting torate generative models and support three tasks: image generation, image editing, and video generation
✔️ Data published as "GenAI-Bench" to promote development of research community

GenAI Arena: An Open Evaluation Platform for Generative Models
written by Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen
(Submitted on 6 Jun 2024)
Comments: 9 pages,7 figures
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Image generation and image editing techniques are evolving rapidly and are being used in a variety of fields, including artwork creation and medical imaging support. Despite this evolution, mastering models and evaluating their performance remains a challenging task. Traditional evaluation metrics such as PSNR, SSIM, LPIPS, and FID are useful for evaluating specific perspectives, but present challenges for comprehensive evaluation. In particular, there are challenges in evaluating subjective aspects such as aesthetics and user satisfaction.

To address these challenges, this paper proposes a new platform called GenAI-Arena. It is an interactive platform that allows users to generate images, compare them, and vote for their favorite models. genAI-Arena simplifies the process of comparing different models and ranks them according to user preferences, allowing for a more comprehensive evaluation of model capabilities. The platform is designed for text-based image generation. The platform supports a wide range of tasks , includingText-to-Image Generation,Text-Guided Image Editing, and Text-to-Video Generation. Text-to-Video Generation. It also provides a public voting process to ensure transparency.

Since February 11, 2024, this paper has collected over 6,000 votes for three multimodal generation tasks. These votes are used to create a leaderboard for each task. Analysis of the voting data also reveals that while Elo Rating (Elo Rating) is generally valid, it can be biased by the imbalance between "easy" and "difficult" games. We also conducted a quantitative analysis case study to show that users can vote on multiple rating perspectives to identify subtle differences in output and provide accurate votes for Elo Rating calculations.

Furthermore, automatically evaluating the quality of generated video image content is considered a challenging problem. Images and videos have many sensitive evaluation aspects such as visual quality, consistency, integrity, and artifacts, and these multifaceted natures make evaluation difficult. In addition, there is a lack of teacher data on the Web.Therefore, in this paper, we aim to promote further development in this area by making user voting data publicly available as GenAI-Bench.

We compute correlations betweenvarious automatic video evaluation models (e.g.,multimodal large-scale language modelssuch as GPT-4o and Gemini) and human preferences to evaluate their evaluation capabilities. And it is shown that even the best multimodal large-scale language model, GPT-4o, has a maximum Pearson correlation coefficient with human preferences of about 0.22.

The GenAI Arena consists of three components: the first isthe Arena for text-basedimage generation (T2I),image editing (Editing), andtext-basedvideogeneration(T2V), where the community votes to obtain their preferred pairs. The second is the Leaderboard, which uses these preference pairs to compute the illorating of all evaluated models; the third is the GenAI-Bench for evaluating various multimodal large-scale language models (evaluation models).

GenAI-Arena: Design and Implementation

GenAI-Arena is designed as an intuitive and comprehensive platform for evaluating generative models. It focuses on three main tasks: text-based image generation (T2I), image editing (Editing), and text-based video generation (T2V). Each task includes a voting system, playground, and leaderboard, as shown in the figure below, tomake it easily accessible tocasual users and researchers.This allows for casual and accurate evaluation of the model's performance.

To ensure a fair comparison of the various models, we standardize on the existing code base. During inference, the hyperparameters and prompt format are fixed, preventing instance-specific adjustment of prompts and hyperparameters. This makes inference for various models fair and reproducible. Following ImagenHub, we have also built a new library , VideoGenHub, to standardize inference procedures for text-to-video and image-to-video models. This allows us to find the optimal hyperparameters that will bring out the best performance of each model.

Voting is also designed to ensure unbiased voting and accurate evaluation of the generated models.(1) When the user enters a prompt,output is generated fromtwo (anonymous)models within the same task.(2)The outputs of thetwo (anonymous)generatedmodelsare displayed side-by-side and compared.(3) The user can vote from four options according to his/her own preference: "left is better," "right is better," "both are better," or "both are worse. These four options are used to calculate the illo rating. And finally, (4) once the user has made a decision, click the Vote button to submit the vote. If the model is revealed during this process, the vote is invalidated. In other words, the system is built to evaluate model preferences based simply on the output results alone.

GenAI-Arena incorporates state-of-the-art generative models that cover a wide range of generative tasks, including text-based image generation (T2I), image editing (Editing), and text-based video generation (T2V). For comprehensive evaluation, the platform integrates models that employ a variety of underlying technologies, including different architectures, learning paradigms, training data, and acceleration techniques. This provides insight for a rigorous understanding of each factor.

The table below shows all the "text-based image generation (T2I) models" used. For example, SDXL, SDXL-Turbo, and SDXL-Lightning are all based on SDXL, but SDXL-Turbo and SDXL-Lightning use different distillation methods. Also included are diffusion transformation models such as PixArt-α and PixArt-σ. Playground V2 and Playground V2.5 are based on the SDXL architecture and are trained from scratch on an internal data set by Playground.ai.

The table below shows all "image editing (Editing) models" and approaches. For example, plug-and-play approaches such as Pix2PixZero, InfEdit, and SDEdit do not require training and are applicable to a wide range of diffusion models. On the other hand, some models, such as PnP and Prompt2Prompt, require DDIM inversion, and these take longer than other approaches. Also included are professionally trained image editing models such as InstructP2P, MagicBrush, and CosXLEdit.

The table below also shows all of the Text-to-Video (T2V) models. For example, AnimateDiff, ModelScope , and Lavie were initialized from SD-1.5 and continue to learn by injecting motion layers to capture temporal relationships between frames. In contrast, StableVideoDiffusion and VideoCrafter2 were initialized from SD-2.1.

GenAI-Bench

The prompts entered by the user are for a wide range of users, andNSFW filters (Llama Guard)are applied toprotect theuserfrom potentially harmful or offensive content.

In total, thetext-based image generation (T2I) taskcollected 4,300 anonymous votes, but only 1,700 remained as safe content after filtering. A large number of prompts were filtered due to sexual content, which accounted for 85.6% of the discarded data.The task of image editing (Editing)collected 1,100 votes before filtering and 900 votes remained after applying Llama Guard. In this task, 87.5% of the inappropriate inputs contain violent crimes, while the remaining 12.5% are filtered for those related to sexual crimes. Finally, thetask of videogeneration by text(T2V)collected 1,200 votes before filtering andreleased 1,100 votes afterfilteringby the NSFW filter.All inappropriate data discarded in this task were attributed to sexual content.

Note that the current version of GenAI-Bench is available on the HuggingFace Dataset website under the MIT license.

To analyze the collected user votes, we compute correlations with several existing metrics: we use CLIPScore, GPT-4o, Gemini-1.5-Pro, Idefics2, and Mantis as evaluation criteria. We evaluate image generation tasks using VIEScore prompts for these multimodal large-scale language models, which include assessments of semantics, quality, and overall performance;since VIEScore does not include prompts related to video evaluation, we do not use text-based video generation (T2V) task, we design a multimodal large-scale language model prompt template to evaluate the output quality of the task. Videos are decomposed into image frames and input as image sequences. Voting results are encoded and correlations are calculated with score differences between existing indicators.As shown in the table below, the correlations are generally low. The correlations between this preference-based voting approach and the multimodal large-scale language model are nearly random.

Experimental Results

The leaderboard at the time this paper was written (2024/06/06) is shown in the table below. The image generation task has collected a total of 4,443 votes. The currently top-ranked models are Playground V2.5 and Playground V2, both released by Playground.ai. These models use the same architecture as SDXL, but are trained on private datasets. SDXL, on the other hand, is ranked 7th, well behind. This result shows the importance of the training dataset.

Following the Playground model is StableCascade, which uses a very efficient cascade architecture to reduce learning costs; according to Würstchen, StableCascade's learning costs are only 10% of SD-2.1's, but significantly exceed SDXL on the leader board and significantly outperforms SDXL. This shows the importance of the diffusion architecture.

The image editing task collected a total of 1,083 votes, with MagicBrush, InFEdit, CosXLEdit, and InstructPix2Pix coming out on top. These models are considered good at local editing of images. PNP, on the other hand, preserves structure while inputting features, limiting the diversity of editing. The older methods, Prompt-to-Prompt, CycleDiffusion, SDEdit, and Pix2PixZero, produce high-quality images, but often produce very different images during editing, which they attribute to the low ranking of these models.

In the text-based video generation task, which collected a total of 1,568 votes, T2VTurbo takes the top spot with the highest Elo score. StableVideoDiffusion comes in second, followed by VideoCrafter2 and AnimateDiff, which also have very close Elo ratings and show nearly equal capabilities: LaVie, OpenSora, and ModelScope, AnimateDiff-Turbo follow, with progressively lower scores.

The figure below visualizes a heat map of win rates. Each cell shows the percentage of wins for Model A against Model B. The models in the heatmapare ordered byIrorating.Along the horizontal axis of each row, Model A's win rateincreases asModel B'sIroratingdecreases,indicating the effectiveness ofIrorating.

PlayGround 2.5 achieves a state-of-the-art iorating for the textual image generation task, but its win rate against PixArt-σ is only 0.48, less than 50%. Similarly, T2V-Turbo, the state-of-the-art model for the text-based video generation task, has a low win rate against StableVideoDiffusion; T2V-Turbo's high iro rating may be due to its high vote share for "easy games" and low vote share for "hard games" This may be due to the fact that T2V-Turbo has more votes for "easy games" and fewer votes for "difficult games. For example, T2V-Turbo and AnimateDiff-Turbo have a high number of games (30), compared to about 10 with the other models (see figure below). These anomalies are indicative of potential shortcomings in irorating. Reliable iro-rating requires large amounts of voting data, and the estimated iro-rating may be biased by an imbalance between "easy" and "difficult" games.

The figure below presents case studies showing the votes collected in the three generation tasks. These cases demonstrate that GenAI-Arena users can provide high quality votes even for advanced models.

For example, in the text-based image generation task, the image generated by PlayGround V2.5 was preferred over the image generated by SDXL-Lightning for the prompt "cute dog playing with a ball. This is likely due to the fact that the latter depicted two dogs. Users can clearly distinguish and vote based on the quality of the output, even though both models completed the task. Similarly, in the image editing task, users voted for Prompt2Prompt's edited image because it looked more natural than InfEdit's edited image. Reliable votes are also collected in the text to video generation task.

Summary

In this paper, we propose an open platform called GenAI-Arena. The platform aims to rank generative models for three main tasks: text-to-image generation, image editing, and video generation based on user preferences. unlike other platforms, GenAI-Arena is run by community voting, which allows for transparent and and sustainable operation.

Since February 11, 2024, more than 6,000 votes have been collected by the voting system to rate the models. Based on these votes, an Iro Rating leaderboard has been created, showing that PlayGround V2.5, MagicBrush, and T2V-Turbo are the most advanced models in their respective tasks (June 4, 2024). An analysis based on the collected ballots shows that while the irorating works overall, it can be biased by an imbalance between "easy" and "difficult" games. Several case studies also show that the collected ballots are of high quality.

In addition, voting data based on human preferences is available as GenAI-Bench. We evaluate the generated images and videos on GenAI-Bench using existing multimodal large-scale language models and calculate their correlation with human votes. Experimental results show that existing multimodal large-scale language models exhibit very low correlation, with even the best model, GPT-4o, achieving only a Pearson correlation coefficient of about 0.22 in quality and comparable results to random guessing in other respects.

The authors will continue to collect votes to update the leaderboard and help the community track the progress of their research. They also plan to develop a more multimodal large-scale language model to more accurately approximate human ratings in GenAI-Bench. Further research is expected in the future.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

GenAI-Arena] New Platform To Evaluate Generated Models By User Votes

Summary

GenAI-Arena: Design and Implementation

GenAI-Bench

Experimental Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...