SCENT – Predicting domain credibility on the basis of backlink networks
Project overview
Websites sharing low-credibility information tend to link to each other. To the contrary, websites with more stringent editorial standards tend to point to other websites posting similarly-high information credibility. We therefore explored whether reliable domain-level credibility ratings could be derived automatically from the network of links pointing to and from a domain.
As a first step, expert input was gathered on the credibility of over 24,000 web domains resulting in the Consensus Credibility Scores. These domains served as a kernel around which to build a network of backlinks (identifying which websites have links pointing to which other websites).
A graph machine learning model was trained on this network to learn to predict the credibility of websites, building on the assumption that distinctive patterns of link-building can correlate with domain credibility. The Mean Absolute Error of the predictions made by the best-performing model was 0.11 on a credibility scale ranging from 0 to 1.
The resulting database, containing credibility scores for both the expert-rated 24,000 websites as well as for an extended set of 740,000 machine-rated domains, provides what is, to our knowledge, the most comprehensive publicly-available database of domain credibility to date. Public interest projects that wish to make use of this database are encouraged to express their interest here.
Data sources and preparation
A- Seeding the network
To create a kernel of domains around which to build a network of backlinks, we curated a list of over 24,000 websites whose credibility had been rated by experts in the information field. Aggregating pre-existing, publicly-available datasets was motivated by two objectives:
- Identifying well-connected domains. Domains that have received scrutiny from credibility raters tend to be influential in the information space, and therefore likely to have dense networks of backlinks.
- Maximizing the size of the training dataset. As for any machine learning task, the larger the training dataset, the better the accuracy of the model’s outputs. Leveraging pre-existing domain credibility ratings allowed for a list larger than any that could have been built from scratch. In addition, building on existing lists also permitted a broad geographic coverage in the domains rated.
Full details on the sources and methods used to devise this domain credibility list are available here.
B- Collecting links between domains with Buzzsumo
For each domain in the Consensus Credibility Score list, we set out to collect backlinks data, so as to draw connections between different domains.
Uncurated backlinks data usually comes with significant noise, such as links between storage service provider and website or between ad server and website, which we deemed a priori not a useful signal to establish similarity between domains.
We therefore opted to use Buzzsumo data (specifically, the https://api.buzzsumo.com/search/backlinks endpoint). For any given domain, Buzzsumo returns the URLs that link to the given domain and that have been shared on social media (ranked by descending number of interactions).
As a result of this double constraint, the noise that usually comes with backlinks data is filtered out: unsurprisingly, social media users post and react to links containing human readable content.
We capped the number of URLs returned for each domain at 1,000.
A directed edge between two domains was created when a URL belonging to one of the domains linked to the other domain.
C- Collecting backlink profile similarities with Semrush
To further enrich the graph, we also collected the ‘Backlink Competitors’ offered by Semrush, a provider of digital marketing data. For any domain, Semrush returns up to 100 other domains which receive referrals from the same domains as the domain in question (specifically, Semrush’s measure is based on “the number of referring domains pointing to each competitor and the number of common referring domains between the two competitors”).
Adding this data was driven by the hypothesis that domains whose referring networks significantly overlap will have similar credibility. As an illustration, domains publishing anti-vaccination content frequently link to other domains that disseminate anti-vaccination content, which is something that authoritative sources do not do.
Undirected edges were created between the queried domain and up to 100 ‘Competitor domains’ identified by Semrush.
Training & evaluating the prediction model
The data collection and processing steps above resulted in a graph with the following characteristics:
- 24,118 ‘seed’ nodes from the Consensus Credibility Score list, all of them with a ground-truth credibility rating between 0 and 1
- 716,857 nodes added to the network on the basis of Buzzsumo and Semrush data, all of them without a credibility rating
- 2.4 million directed edges derived from Buzzsumo backlinks data
- 2.2 million undirected edges derived from Semrush domain competitors data
We aimed to build a model capable of predicting the credibility of a domain on the basis of its network of backlinks. To do so, we chose to use a Relational Graph Convolutional Network, as it can natively handle multiple types of edges.
The model was trained to minimize the prediction error on 80% of the seed nodes, with a validation set of 10% of the seed nodes, leaving another 10% of the nodes for testing. Following graph machine learning good practice, we left the test and validation nodes in the graph (so that the model could learn from a complete graph structure), but masked their credibility ratings (so as to avoid overfit).
Various combinations of network architectures and hyperparameters were tested to select the best-performing model.
Results
The best-performing model obtained a Mean Absolute Error of 0.11 (as a reminder, the credibility scale ranges from 0 to 1) on the test set.
Individual domain-level predictions can be accessed here. If the domain is already in the database and hence a score has already been computed, the API will return the score (almost instantaneously). If the domain is not in the database, the system will collect the data and output a prediction (takes around 30 seconds).
If you are interested in gaining access to the full dataset (the 24k domains covered by the Consensus Credibility Scores and/or the 740k domains with a predicted credibility score), please get in touch here. We encourage all projects advancing the public interest (research, civil society, government…) to reach out.
Acknowledgements
This project was supported by the EMIF.
The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund Partners, the Calouste Gulbenkian Foundation and the European University Institute.