G9 - Wikipedia Graph and Related Entity Recommendation Dataset, version 1.0 (18.5 GB) (Hosted on AWS)

This dataset was developed to train and evaluate models for recommending related entities on Wikipedia. It consists of a large, normalized, entity graph generated in May 2020 from Wikipedia by aggregating hyperlinks between Wikipedia pages across languages (10 million vertices and 998 million edges, each with some extra features), the corresponding entity embeddings trained from the graph using the lg2vec method (10 million vectors of dimension 200), and a labeled dataset consisting of 45k query entities and their list of recommended related entities that can be used as ground truth for training and evaluating related-entity recommendation systems. We are making it available via our Webscope data-sharing program to further advance research in graph mining and entity recommendation.

