Persona Hub, A Large Dataset Built From 1 Billion Personas, Is Now Available!

Persona-driven Data Synthesis 19/12/2024

3 main points
✔️ Propose persona-driven data synthesis methodology, a new method for creating diverse synthetic data
✔️ A large data set of 1 billion personas from vast amounts of web data Constructed Persona Hub
✔️ Introduced various use cases to demonstrate the versatility of Persona Hub

Scaling Synthetic Data Creation with 1,000,000,000 Personas
written by Tao Ge,Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
(Submitted on 28 Jun 2024)
Comments: Work in progress
Subjects: Computation and Language (cs.CL); Machine Learning(cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Synthetic data is data generated by models and algorithms, unlike ordinary human-generated data, andhas been attracting increasing interest in recent years as it can be used as training data forLarge Language Models (LLMs).

However, while it is possible to scale up the amount of synthetic data, it is difficult to scale up its diversity, and a wide variety of prompts are needed to create diverse synthetic data.

In this paper, we propose a persona-driven data synthesis methodology, a new method for creating diverse synthetic data, and describe a paper that demonstrates its versatility by presenting a large data set of 1 billion personas from a vast amount of web data. The paperdemonstratesits versatility by building a Persona Hub, a large-scale dataset of 1 billion personas, from a vast amount of web data and introducing various examples of its use.

Persona-Driven Data Synthesis Methodology

In this paper, we propose a persona-driven data synthesis methodology for the large-scale creation of diverse synthetic data.

As shown in the figure below, this approach follows the idea that by simply adding personas to the prompts for synthesizing data, LLMs can be prompted to respond to the personas and create distinctive synthetic data.

In addition, since almost all LLM use cases can be associated with a specific persona, it is possible to create comprehensive synthetic data on a large scale as long as a comprehensive collection of personas is built, and this characteristic can be used to build the Persona Hub This characteristic was used to build the Persona Hub, which is described below.

Persona Hub

In this paper, we constructed Persona Hub, a large dataset containing 1 billion diverse personas (about 13% of the world's population) from a vast amount of web data.

To build a Persona Hub from vast amounts of web data, this paper proposes two approaches :text-to-persona andpersona-to-persona.

Text-to-Persona

This approach is based on the idea that specific personas can be inferred from texts in view of the fact that people with certain professional experiences and cultural backgrounds have unique characteristics in reading and writing texts.

Based on this idea, as shown in the figure below, it is possible to ask the LLM "Who is likely to [read/write/like/dislike/...] the text? " to obtain the persona corresponding to any given text.

In addition, the granularity of the personas acquired can be adjusted depending on the input text. As shown in the figure below, if the input text contains detailed information (e.g., a math subject or an academic paper on superconductivity), the resulting personas will also be more specific.

Thus, by applying Text-to-Persona to vast amounts of web text data, it is possible to obtain billions of diverse personas across different granularities.

Persona-to-Persona

While Text-to-Persona, described above, is a scalable method that covers almost all types of personas, it is also true that there are some personas that are not well known on the Web and are not likely to be obtained through Text-to-Persona.

Therefore, in order to complement personas that are difficult to acquire with Text-to-Persona, this paper proposes a method called Persona-to-Persona that derives interpersonal personas from those obtained with Text-to-Persona.

Persona-to-Persona is a method for acquiring various personas through interpersonal relationships, as shown in the figure below, and by asking the LLM "Who is in a close relationship with the given persona? pediatric nurse) to generate personas of patients (Patient) and colleagues (Colleague).

In this paper, for each persona acquired by Text-to-Persona, the Persona Hub was extended by Persona-to-Persona six times, resulting in a larger and richer data set.

Use Cases

This paper aims to demonstrate the versatility of Persona Hub by presenting various examples of its use in the real world.

Knowledge-rich Texts

Persona Hub can be easily applied to create knowledge-rich plain text to help with LLM pre-training and post-training.

This allows us to apply personas extracted from Persona Hub to encourage LLMs to write articles of a specialized nature, as shown in the figure below.

Extending this process to Persona Hub's one billion personas makes it easy to obtain a vast array of knowledgeable and voluminous texts covering topics of varying granularity.

Game NPCs

A straightforward and practical application of Persona Hub is to create a variety of non-player characters (NPCs) to match the scale of the game.

As long as you provide LLMs with information about the background and world of the game, you can encourage them to project Persona Hub personas onto the characters in the game world.

Thisallows you touse Persona Hub personas to create NPCsfor your game (World of Warcraft), for example, asshown in the figure below, greatly reducing the effort of creating NPCs in the game design process.

Tool(Function) Development

Persona Hub allows us to simulate a variety of real users and create tools that they may need.

The figure below is an example (e.g., a tool to assist cab drivers in checking traffic conditions).

Although these are just interface definitions, they can be easily converted into code implementations as shown in the figure below.

By taking these steps, it is hoped that it will not be necessary to build the tool from scratch each time.

Summary

How was it?In this issue, we proposed a persona-driven data synthesis methodology, a new method for creating diverse synthetic data, and constructed Persona Hub, a large-scale data set of 1 billion personas from a vast amount of web data, The paper demonstrated its versatility by presenting various application examples.

Persona Hub already includes 1 billion personas, but on the other hand, the challenge still remains that these personas focus only on key aspects and do not take into account detailed information (family background, historical background, life experiences, etc.).

With this information, each persona becomes more unique, and we are very much looking forward to the future, not only because it will allow Persona Hub to scale, but also because it opens up the possibility of practical applications such as personalized conversations.

Details of the Persona Hub and its use cases introduced in this article can be found in this paper, and those interested should refer to it.

Categories related to this article

田中侑李