Profiling The Relationship Between Websites And Their Audiences Enables The Detection Of Fake News And Political Bias!

GNN 12/04/2023

3 main points
✔️ Proposed a graph learning model that predicts article factuality and political bias by modeling audience overlap across websites
✔️ Using Alexa, we created a large graph representing the relationship between websites and audiences
✔️ On two standard datasets, existing Achieved significant accuracy improvements over existing models

GREENER: Graph Neural Networks for News Media Profiling
written by Panayot Panayotov, Utsav Shukla, Husrev Taha Sencar, Mohamed Nabeel, Preslav Nakov
(Submitted on 10 Nov 2022)
Comments: Accepted by ACL 2022
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In today's world of rapidly spreading social networking services, the social impact of fake news has become enormous, and the detection of such malicious fake news has attracted even more attention in recent years.

However, while there have been many studies focusing on text, few have profiled the media as a whole, rather than individual texts or articles.

This paper proposes a coarser granular solution to fake news detection and describes GREENER (Graph Networks for News Media Profiling), a method that models the relationship between news media based on audience overlap and profiles them using three different graph learning models. Profiling), a method for modeling relationships among news media based on audience overlap and profiling them using three different graph learning models, will be described.

History of Fake News Research

As mentioned earlier, existing fake news detection tasks focused primarily on analyzing textual content using natural language processing techniques.

While these text-based methods are useful for contextual analysis of articles, it is difficult to detect the credibility of the claims made in the article, and even using a state-of-the-art model, depending on the data set, they can achieve only 65-71% accuracy in detecting factuality (whether the article is correct or not ) and 70-80% accuracy in detecting political bias (bias toward a particular political position). The state-of-the-art model can only achieve a prediction accuracy of 65-71% for detecting factuality (whether the article is correct or not) and 70-80% for detecting political bias (words and actions that are biased toward a particular political position), depending on the data set.

Against this backdrop, several approaches have been proposed to detect fake news on social media platforms by capturing and comparing information about followers of news media and profiling how these followers respond to the content of the target news media with their comments and posts. Several approaches have been proposed to predict similarities among news media by capturing and comparing the information of news media followers and profiling how these followers respond to the content of the target news media.

These studies are based on the idea that if a group of persons shares a common interest in a website, those websites should be similar in some respect, and that by using not only textual and visual features to detect less factual websites, but also targeted websites By using features related to network design data, as well as textual and visual features to detect less factual websites, a more comprehensive analysis is possible.

In this paper, we extend these methods and propose modeling viewer similarity with a large-scale model based on the Alexa site info tool, a feature of Alexa, and three different graph learning methods.

GREENER - Graph Neural Networks for News Media Profiling

The Alexa site info tool used to create the graphs in this paper is a tool that, upon entering the address of a target website, returns a list of four to five sites that are highly similar to the entered website based on viewer overlap.

For example, if you enter the address of the website wsj.com, you will get similar sites and their similar scores, such as { marketwatch.com 39.4 cnbc.com 39.4 bloomberg.com 35.9 reuters.com 34.5 }. Similar Websites and their similarity scores can be found here.

Using these website pairs and overlap scores, this paper created the graph shown in the figure below by representing the websites as nodes, the overlapping relationship between viewers of the two websites, and their degree of overlap as edges.

These graphs were created using lists that were manually classified by a service that verifies the factuality of the information on the sites. To identify the relationships between websites in more detail, the initial graph as described above was expanded to add new nodes and edges by repeating the steps described above. The graph was extended to add new nodes and edges by repeating the aforementioned steps based on the initial graph as described above to further identify relationships between websites.

The result is a large-scale graph representing the relationship between each website and its audience, as shown in the figure below. (Red: websites with low factuality, green: websites with low factuality, white: websites with vague or unknown factuality)

From the large-scale graph above, we can confirm that the distribution is such that we can clearly distinguish between sites with high facticity and those with low facticity.

Representation learning on graphs

In this paper, we experimented with the following three models for the purpose of learning the representation of nodes and edges in the large-scale graphs described above.

Node2Vec: one of the earliest graph learning frameworks, a model that generates sequences for graphs by sampling random walks of fixed maximum length for each node
Graph Convolutional Networks (GCN): A graph neural network model. While Node2Vec performs embedding based only on the graph structure, GCN performs convolutional operations on all neighboring nodes, allowing it to embed both graph structures and nodes/edges. GCN can perform embedding on both graph structures and nodes/edges by performing convolutional operations on all neighboring nodes.
GraphSAGE: A graph neural network model that, unlike GCN, performs convolution operations only on a subset of sampled neighboring nodes.

Using these three graph representation learning algorithms, we were able to obtain a low-dimensional vector representation (512 for Node2Vec and 128 for GCN and GraphSAGE) of each node (website) in the graph.

Experiments and Evaluations

In this paper, two datasets, EMNLP-2018 (Baly et al., 2018) and ACL-2020 (Baly et al., 2020), used in existing studies, were used to compare the model obtained in this experiment with the existing model.

Both data sets are labeled with respect to factuality and political bias, with factuality being classified as high, mixed, or low based on the legitimacy of the article, and political bias being classified as lift, center, or right.

Five-fold cross-validation was employed to evaluate the prediction accuracy of the three models described above individually and in combination with each other, using node embedding and labels for facticity and political bias.

Experimental results for the task of predicting facticity using EMNLP-2018 are shown in the figure below.

Thus, all three models produced better accuracy than the existing models, especially when the three models were combined to obtain higher prediction accuracy.

Continuing, the following figure shows the experimental results of the task of predicting political bias using EMNLP-2018.

Here, Node2Vec achieved better accuracy than the other two GNN models (possibly due to node sparsity), but as with the facticity prediction, the combination of the three models resulted in the highest prediction accuracy, proving the effectiveness of this method.

summary

How was it? In this article, we introduced a paper that proposed a graph-learning model that predicts article factuality and political bias by creating a large graph representing the relationship between websites and their audiences and modeling the overlap of audiences across news media.

There are some problems with this experiment, such as the fact that it was limited to only the top five websites that are highly similar to a certain website, and the fact that errors are likely to occur for websites with a small number of viewers; improvements in these areas may lead to better results.

Although this experiment focused on websites only, the graph is considered to be effective for various media such as Twitter, Facebook, YouTube, and Wikipedia, and the company is considering creating a larger-scale graph that integrates this media. The future of this experiment will be closely watched.

Details of the architecture and experimental results of the model presented here can be found in this paper for those interested.

Categories related to this article

田中侑李