Graph and Social Data

G4 - Yahoo! Network Flows Data, version 1.0 (multi part) (Hosted on AWS)

Yahoo! network flows data contains communication patterns between end-users in the large Internet and Yahoo servers. A netflow record includes timestamp, source IP address, destination IP address, source port, destination port, protocol, number of packets, and number of bytes transferred from the source to the destination. The record does not include the content of the data communication. Each Nntflow data file consists of sampled netflow records exported from routers in 15-minute intervals. The dataset includes netflow data files collected from three border routers in October 11 2007. All IP addresses in the dataset are anonymized using a random permutation algorithm. There are 6 files in this dataset with sizes 7.8 Gbyte, 7.3 Gbyte, 7.8 Gbyte, 7.5 Gbyte, 7.4 Gbyte and 3.6 Gbyte.

Here are all the papers published on this Webscope Dataset:

G6 - Yahoo! Instant Messenger Friends Connectivity Graph (28 MB)

Millions use Yahoo! Messenger every day to communicate by text or by voice between PCs or from PCs to phones. This dataset contains a sample of the Yahoo! Messenger "friends graph", where users are represented as meaningless anonymous numbers so that no identifying information is revealed. Users are nodes in the graph, with edges indicating that a user is a friend of another user. The dataset consists only of the anonymous friends graph, and does not contain any information about users or discussions. The Yahoo! Messenger friends graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The total size for this dataset is 28 MB.

G5 - Yahoo! Messenger User Communication Pattern (32 MB)

This dataset contains a small sample of the Yahoo! Messenger community's communication (IM) log at a high level for a period of 4 weeks. Specifically, this dataset only records the first communication event from one user to another on a particular day, and generates such records for a period of 28 days. This dataset does not contain any specific IM content, or user identification. It is only intended to help identify the social graph based on communication patterns between a small subset of the total number of Yahoo! Messenger users. This dataset also has some information regarding the locale of users in an anonymous form. The dataset may be used by researchers to validate claims on social networking theory and corroborate their assumptions/analysis against a real time social network graph consisting of a small subset of Yahoo! Messenger users. The total size for this dataset is 32 MB.

G3 - Yahoo! Groups User-Group Membership Bipartite Graph, version 1.0 (93 MB)

Millions of communities and groups use Yahoo! Groups as a meeting place and forum to discuss mutual interests on nearly any topic. This dataset contains a sample of the "membership graph" of Yahoo! Groups, where both users and groups are represented as meaningless anonymous numbers so that no identifying information is revealed. Users and groups are nodes in the membership graph, with edges indicating that a user is a member of a group. The dataset consists only of the anonymous bipartite membership graph, and does not contain any information about users, groups, or discussions. The Yahoo! Groups membership graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The size of this dataset is 93 MB.

Here are all the papers published on this Webscope Dataset:

G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS)

This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002. The dataset encodes the graph or map of links among web pages, where nodes in the graph are URLs. The Yahoo! AltaVista web graph is an example of a large real-world graph. The dataset may serve as a testbed for matrix, graph, clustering, data mining, and machine learning algorithms. There are 3 files in this dataset with sizes 3.2 Gbyte, 5.0 Gbyte and 3.4 Gbyte.

G1 - Yahoo! Search Marketing Advertiser-Phrase Bipartite Graph, Version 1.0 (14 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel." An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic: advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This is a small completely anonymized graph reflecting the pattern of connectivity between some Yahoo! Search Marketing advertisers and some of the search keyword phrases that they bid on. Both advertisers and keyword phrases are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset contains 459,678 anonymous phrases ids, 193,582 anonymous advertiser ids, and 2,278,448 edges, representing the act of an advertiser bidding on a phrase. The size of this dataset is 14 MB.

Here are all the papers published on this Webscope Dataset:

G7 - Yahoo! Property and Instant Messenger Data use for a Sample of Users, v.1.0 (4.3 Gb) (Hosted on AWS)

This dataset was generated starting with a sample of Yahoo! users as a seed list, then following their IM communication links out two steage, gender), Yahoo! property use, and characteristics of users' mobile devices. The size of this dataset is 4.3 Gbyte.

G8 - Yahoo! Messenger Client to Server Protocol-Level Events, version 1.0 (477 MB)

Yahoo! Messenger Client to Server protocol level events, including around 200 different opcodes. This dataset contains 2 sets of data related to Yahoo! Messenger: EVENT_DATA: A small sample of Client's protocol level event message streams to the Yahoo! Messenger Servers collected over a period of 24 hrs in June 2010. VALIDATION_DATA: Client to Client message events for the same users in the previous set, calculated over the same 24 hrs period in June The size of this dataset is 477 MB.

G9 - Wikipedia Graph and Related Entity Recommendation Dataset, version 1.0 (18.5 GB) (Hosted on AWS)

This dataset was developed to train and evaluate models for recommending related entities on Wikipedia. It consists of a large, normalized, entity graph generated in May 2020 from Wikipedia by aggregating hyperlinks between Wikipedia pages across languages (10 million vertices and 998 million edges, each with some extra features), the corresponding entity embeddings trained from the graph using the lg2vec method (10 million vectors of dimension 200), and a labeled dataset consisting of 45k query entities and their list of recommended related entities that can be used as ground truth for training and evaluating related-entity recommendation systems. We are making it available via our Webscope data-sharing program to further advance research in graph mining and entity recommendation.

Here are all the papers published on this Webscope Dataset: