Datasets

G4 - Yahoo! Network Flows Data, version 1.0 (multi part) (Hosted on AWS)

Yahoo! network flows data contains communication patterns between end-users in the large Internet and Yahoo servers. A netflow record includes timestamp, source IP address, destination IP address, source port, destination port, protocol, number of packets, and number of bytes transferred from the source to the destination. The record does not include the content of the data communication. Each Nntflow data file consists of sampled netflow records exported from routers in 15-minute intervals. The dataset includes netflow data files collected from three border routers in October 11 2007. All IP addresses in the dataset are anonymized using a random permutation algorithm. There are 6 files in this dataset with sizes 7.8 Gbyte, 7.3 Gbyte, 7.8 Gbyte, 7.5 Gbyte, 7.4 Gbyte and 3.6 Gbyte.

Here are all the papers published on this Webscope Dataset:

R1 - Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. The dataset contains over ten million ratings of musical artists given by Yahoo! Music users over the course of a one month period sometime prior to March 2004. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a testbed for matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 423 MB.

Here are all the papers published on this Webscope Dataset:

C14 - Yahoo! Learning to Rank Challenge (421 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. The size of this dataset is 421 MB.

Here are all the papers published on this Webscope Dataset:

L13 - Yahoo! Search Query Tiny Sample (41 K)

This dataset contains a random sample of 4496 queries posted to Yahoo's US search engine in January, 2009. For privacy reasons, the query set contains only queries that have been asked by at least three different users and contain only letters of the English alphabet, sequences of numbers not longer than four numbers and punctuation characters. The query set does not contain user information nor does it preserve temporal aspects of the query log. Total size for this dataset is 41K.

L12 - Yahoo! Search Popularity by Location for Websites on Politician and Athletes (14 M)

This dataset contains lists of popular web sources clicked by Yahoo! users when they search for entities within two popular domains, Athletes and Politician. In addition to the global popular list, the is dataset also includes location specific and location/entity specific popular lists. This dataset can be used to study the location bias in entities and web sources, and therefore allow the researchers to study location specific information extraction and entity portal generation. Total size for this dataset is 14MB.

G6 - Yahoo! Instant Messenger Friends Connectivity Graph (28 MB)

Millions use Yahoo! Messenger every day to communicate by text or by voice between PCs or from PCs to phones. This dataset contains a sample of the Yahoo! Messenger "friends graph", where users are represented as meaningless anonymous numbers so that no identifying information is revealed. Users are nodes in the graph, with edges indicating that a user is a friend of another user. The dataset consists only of the anonymous friends graph, and does not contain any information about users or discussions. The Yahoo! Messenger friends graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The total size for this dataset is 28 MB.

G5 - Yahoo! Messenger User Communication Pattern (32 MB)

This dataset contains a small sample of the Yahoo! Messenger community's communication (IM) log at a high level for a period of 4 weeks. Specifically, this dataset only records the first communication event from one user to another on a particular day, and generates such records for a period of 28 days. This dataset does not contain any specific IM content, or user identification. It is only intended to help identify the social graph based on communication patterns between a small subset of the total number of Yahoo! Messenger users. This dataset also has some information regarding the locale of users in an anonymous form. The dataset may be used by researchers to validate claims on social networking theory and corroborate their assumptions/analysis against a real time social network graph consisting of a small subset of Yahoo! Messenger users. The total size for this dataset is 32 MB.

S2 - Yahoo! Statistical Information Regarding Files and Access Pattern to Files in one of Yahoo's Clusters (2.3 K)

This dataset contains total number of files, total file size, number of file accesses, number of days between the first access and the most recent access, file distribution, deletion rate of files and directories, creation rate of files and directories in a dilithium-gold cluster. The size of this dataset is 2.3K.

S1 - Yahoo! Sherpa database platform system measurements, version 1.0 (33 K)

This dataset contains a series of traces of hardware resource usage during operation of the PNUTS/Sherpa database. The measurements include CPU utilization, memory utilization, disk utilization, network traffic, and so on. Additionally, metrics specific to particular components of the system, such as the Apache and MySQL? servers, are also included. The traces represent measuring the system resource usage at 1 minute granularity during various database workloads, including read-heavy, write-heavy and scan-oriented workloads. The data can be used to analyze and simulate the bottlenecks experienced in a real cloud database system under load. This size of this dataset is 33 K.

L15 - Yahoo! Search queries that share clicked URLs with TREC queries, version 1.0 (33 K)

This dataset consists of Yahoo! search queries that share clicked URLs with TREC queries. Queries that share clicked URLs are often referred to as co-clicked queries. The TREC queries cover a subset of TREC topics 451-550, 701-850, and wt09-01 through wt09-50, which are widely used within the information retrieval community for various Web search-related experiments. The size of this dataset is 33K.

A3 - Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords, version 1.0 (845 MB)

This dataset contains a small sample of advertiser's bid and revenue information over a period of 4 months. Bid and revenue information is aggregated with a granularity of a day over advertiser account id, keyphrase and rank. Apart from bid and revenue, impressions and clicks information is also included. Sequence of keywords make a keyphrase, keyphrase category is a keyword. A keyphrase can belong to one or more keyphrase categories. Keyphrase categories used for this dataset have been listed in a separate file. Advertiser account id is represented as a meaningless string. Keyphrase is represented as sequence of meaningless strings, where each string represents a keyword or keyphrase category. The size of this dataset is 845 MB.

Here are all the papers published on this Webscope Dataset:

A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version 1.0 (25 MB)

The Yahoo!/O'Reilly Tech Buzz Game tests the theory that a free electronic market can predict trends in technology. In the game, users buy and sell fantasy "stocks" in various technologies. The prices of the stocks fluctuate according to supply and demand using a market mechanism invented at Yahoo! called a dynamic pari-mutuel market. Weekly dividends are paid out to stockholders based on the current "buzz score," or percentage of Yahoo! searches associated with each technology. This dataset shows all the transactions over the course of the Tech Buzz Game, divided into two periods. The first period spans April 1, 2005 to July 31, 2005, after which all markets were closed and cashed out according to final buzz scores at that time. The second period spans August 22, 2005 to the present, at the beginning of which new markets were established and the market reset to a starting point. Traders are represented as meaningless anonymous numbers so that no identifying information is revealed. In addition, the dataset contains buzz score data showing the daily percentages of search volume for each stock. Researchers may use the data to test the predictive value of the market or to test market behavioral theories. The size of this dataset is 25 MB.

A1 - Yahoo! Search Marketing Advertiser Bidding Data, version 1.0 (81 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel". An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic. Advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This dataset contains the bids over time of all advertisers participating in Yahoo! Search Marketing auctions for the top 1000 search queries during the period from June 15, 2002, to June 14, 2003. Advertisers' identities and query phrases are represented as meaningless anonymous numbers so that no identifying information about advertisers is revealed. The data may be used by economists or other researchers to investigate the behavior of bidders in this unique real-time auction format, responsible for roughly two billion dollars in revenue in 2005 and growing. The size of this dataset is 81 MB.

Here are all the papers published on this Webscope Dataset:

G3 - Yahoo! Groups User-Group Membership Bipartite Graph, version 1.0 (93 MB)

Millions of communities and groups use Yahoo! Groups as a meeting place and forum to discuss mutual interests on nearly any topic. This dataset contains a sample of the "membership graph" of Yahoo! Groups, where both users and groups are represented as meaningless anonymous numbers so that no identifying information is revealed. Users and groups are nodes in the membership graph, with edges indicating that a user is a member of a group. The dataset consists only of the anonymous bipartite membership graph, and does not contain any information about users, groups, or discussions. The Yahoo! Groups membership graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The size of this dataset is 93 MB.

Here are all the papers published on this Webscope Dataset:

L2 - Metadata Extracted from Publicly Available Web Pages, version 1.0 (1.5 GB & 700 MB)

The dataset contains about 100 million triples of RDF data obtained by extracting metadata from publicly available webpages. Three forms of embedded metadata are extracted: microformats (hCard, hCalendar and hReview), RDFa metadata and RDF documents linked to webpages. All metadata extracted from a webpage is converted to RDF. The data is made available in the WARC format, version 0.9. The dataset may serve as a testbed for research in scalability in the Semantic Web area and also for developing methods to deal with metadata that is incomplete, erroneous or biased in some way. The size of this dataset is 2.3 Gbyte in two parts.

R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0 (1.4 Gbyte & 1.1 Gbyte)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various songs. The dataset contains over 717 million ratings of 136 thousand songs given by 1.8 million users of Yahoo! Music services. The data was collected between 2002 and 2006. Each song in the dataset is accompanied by artist, album, and genre attributes. The users, songs, artists, and albums are represented by randomly assigned numeric id's so that no identifying information is revealed. The mapping from genre id's to genre, as well as the genre hierarchy, is given. There are 2 sets in this dataset. Part one is 1.4 Gbytes and part 2 is 1.1 Gbytes.

Here are all the papers published on this Webscope Dataset:

Dataset has been added to your cart

R3 - Yahoo! Music ratings for User Selected and Randomly Selected songs, version 1.0 (1.2 MB)

This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services. The second source consists of ratings for randomly selected songs collected during an online survey conducted by Yahoo! Research. The rating data includes 15,400 users, and 1000 songs. The data contains at least ten ratings collected during normal use of Yahoo! Music services for each user, and exactly ten ratings for randomly selected songs for each of the first 5400 users in the dataset. The dataset includes approximately 300,000 user-supplied ratings, and exactly 54,000 ratings for randomly selected songs. All users and items are represented by randomly assigned numeric identification numbers. In addition, the dataset includes responses to seven multiple-choice survey questions regarding rating-behavior for each of the first 5400 users. The survey data and ratings for randomly selected songs were collected between August 22, 2006 and September 7, 2006. The normal-interaction data was collected between 2002 and 2006. The size of this dataset is 1.2 MB.

Here are all the papers published on this Webscope Dataset:

R4 - Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 (23 MB)

This dataset contains a small sample of the Yahoo! Movies community's preferences for various movies, rated on a scale from A+ to F. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset also contains a large amount of descriptive information about many movies released prior to November 2003, including cast, crew, synopsis, genre, average ratings, awards, etc. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms, including hybrid content and collaborative filtering algorithms. The dataset may serve as a testbed for relational learning and data mining algorithms as well as matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 23 MB.

Here are all the papers published on this Webscope Dataset:

R5 - Yahoo! Delicious Popular URLs and Tags, version 1.0 (4.5MB)

This dataset represents 100,000 URLs that were bookmarked on Delicious by users of the service. Each URL has been saved at least 100 times. For each URL, the date that it was first bookmarked by a Delicious user is indicated, along with the total number of saves. Also indicated are the ten most commonly used tags for each URL, along with the number of times each tag was used. This dataset provides a view into the nature of popular content in the Delicious social bookmarking system, including how users apply tags to individual items. The size of this dataset is 4.5 MB.

L1 - Yahoo! N-Grams, version 2.0 (multi part) (Hosted on AWS)

This dataset contains n-grams (contiguous sets of words of size n), n = 1 to 5, extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites. The documents were published on these sites between February 2006 and December 2006. The dataset does not contain the documents themselves, but only the n-grams that occur at least twice. It provides statistics such as frequency of occurrence, number and entropy of different left (right) single-token contexts of each n-gram. This dataset may be used by researchers to build statistical language models for speech or handwriting recognition or machine translation. There are 3 files in this dataset. They are 3.5 Gbyte, 4.3 Gbyte and 4.4 Gbyte.

G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS)

This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002. The dataset encodes the graph or map of links among web pages, where nodes in the graph are URLs. The Yahoo! AltaVista web graph is an example of a large real-world graph. The dataset may serve as a testbed for matrix, graph, clustering, data mining, and machine learning algorithms. There are 3 files in this dataset with sizes 3.2 Gbyte, 5.0 Gbyte and 3.4 Gbyte.

L3 - Yahoo! Semantically Annotated Snapshot of the English Wikipedia, version 1.0 (multi part)

This SW1 dataset contains a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. In order to build SW1, we started from the XML-ized Wikipedia dump distributed by the University of Amsterdam. This snapshot of the English Wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output. Raw Data (Multitag format) The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries. There are 4 files in this dataset ranging in size from 1.3 Gbyte to 1.8 Gbyte.

Here are all the papers published on this Webscope Dataset:

L4 - Yahoo! Answers Manner Questions, version 1.0 (102 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 102 MB.

Here are all the papers published on this Webscope Dataset:

L5 - Yahoo! Answers Manner Questions, version 2.0 (104 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 104 MB.

L6 - Yahoo! Answers Comprehensive Questions and Answers version 1.0 (multi part)

Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus distributed here contains 4,483,032 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs and all user ids were anonymized so that no identifying information is revealed. This dataset may be used by researchers to learn and validate answer extraction models. An example of such work was published by Surdeanu et al. (2008). There are 2 files in this dataset. Part 1 is 1.7 Gbyte and part 2 is 1.9 Gybte.

Here are all the papers published on this Webscope Dataset:

L8 - Yahoo! Search Query Logs for Nine Languages, version 1.0 (45 K)

This dataset contains the 1000 most frequent web search queries issued to Yahoo! Search for nine different languages. The languages covered are Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The dataset may be useful for various information retrieval and data mining research investigations, especially those involving cross- or multi-lingual search tasks. The size of this dataset is 45 K.

Here are all the papers published on this Webscope Dataset:

L9 - Yahoo! Answers Question Types Sample of 1000, version 1.0 (14 K)

This dataset contains URLs of questions posted to Yahoo! Answers, along with the question types assigned to these questions by human judges. The question types are "informational", "advice", "opinion", and "polling". The size of this dataset is 14 K.

G1 - Yahoo! Search Marketing Advertiser-Phrase Bipartite Graph, Version 1.0 (14 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel." An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic: advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This is a small completely anonymized graph reflecting the pattern of connectivity between some Yahoo! Search Marketing advertisers and some of the search keyword phrases that they bid on. Both advertisers and keyword phrases are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset contains 459,678 anonymous phrases ids, 193,582 anonymous advertiser ids, and 2,278,448 edges, representing the act of an advertiser bidding on a phrase. The size of this dataset is 14 MB.

Here are all the papers published on this Webscope Dataset:

G7 - Yahoo! Property and Instant Messenger Data use for a Sample of Users, v.1.0 (4.3 Gb) (Hosted on AWS)

This dataset was generated starting with a sample of Yahoo! users as a seed list, then following their IM communication links out two steage, gender), Yahoo! property use, and characteristics of users' mobile devices. The size of this dataset is 4.3 Gbyte.

G8 - Yahoo! Messenger Client to Server Protocol-Level Events, version 1.0 (477 MB)

Yahoo! Messenger Client to Server protocol level events, including around 200 different opcodes. This dataset contains 2 sets of data related to Yahoo! Messenger: EVENT_DATA: A small sample of Client's protocol level event message streams to the Yahoo! Messenger Servers collected over a period of 24 hrs in June 2010. VALIDATION_DATA: Client to Client message events for the same users in the previous set, calculated over the same 24 hrs period in June The size of this dataset is 477 MB.

L16 - Yahoo! Answers Query to Questions (1.5 MB)

This dataset contains a small sample of Yahoo! Answers question/answers pages visited following search engine queries in August 2010. The dataset also contains user ratings of query clarity, query-question match, and query-answer satisfaction collected using Amazon Mechanical Turk. The dataset may be used by researchers to validate algorithms to predict searcher satisfaction with existing community-based answers. It may also enable researchers to validate algorithms to predict query clarity and query-question match. The size of this dataset is 1.5 MB.

C15 - Yahoo! Music user ratings of musical tracks, albums, artists and genres, v 1.0 (1.5 Gbyte)

Yahoo! Music offers a wealth of information and services related to many aspects of music. This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical items. A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. We provide four different versions of the dataset, differing by their number of ratings, so that researchers can select the version best catering for their needs and hardware availability. In addition, one of the datasets is oriented towards a unique learning-to-rank goal, unlike the predictive / error-minimization criteria common at recommender systems. The dataset contains ratings provided by true Y! Music customers during 1999-2009. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed. ) The important and unique features of the dataset are fourfold: It is of very large scale compared to other datasets in the field, e.g, 13X larger than the Netflix prize dataset. It has a very large set of items – much larger than any similar dataset, where usually only number of users is large. There are four different categories of items, which are all linked together within a defined hierarchy. It allows performing session analysis of user activities. We expect that the novel features of the dataset will make it a subject of active research and a standard in the field of recommender systems. In particular, the dataset is expected to ignite research into algorithms that utilize hierarchical structure annotating the item set. Dataset does not require department head approval. The size of this dataset is 1.5 Gbyte.

R6A - Yahoo! Front Page Today Module User Click Log Dataset, version 1.0 (1.1 GB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Our dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. To the best of our knowledge, this is the first real-world benchmark evaluating bandit algorithms reliably. The dataset contains 45,811,883 user visits to the Today Module. For each visit, both the user and each of the candidate articles are associated with a feature vector of dimension 6 (including a constant feature), constructed using a conjoint analysis with a bilinear model; see Chu et al. [KDD 2009] for more details. The size of this dataset is 1.1GB

Here are all the papers published on this Webscope Dataset:

L18 - Anonymized Yahoo! Search Logs with Relevance Judgments (1.3 Gbyte)

Annonymized Yahoo! Search Logs with Relevance Judgments version 1.0 The size of this dataset is 1.3 Gbyte.

Here are all the papers published on this Webscope Dataset:

R6B - Yahoo! Front Page Today Module User Click Log Dataset, version 2.0 (300 MB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Similar to the previous version, this dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo!'s front page. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. Compared to the previous version, this data is larger (containing 15 days of data from October 2 to 16, 2011), and contains raw features (so that researchers can try out different feature generation methods in multi-armed bandits). The dataset contains 28041015 user visits to the Today Module on Yahoo!'s frontpage. For each visit, the user is associated with a binary feature vector of dimension 136 (including a constant feature with ID 1) that contains information about the user like age, gender, behavior targeting features, etc. For sensitivity and privacy reasons, feature definitions are not revealed, and browser cookies (bcookies) of the users are replaced with a constant string 'user'.

L11 - HTML Forms Extracted from Publicly Available Webpages, version 1.0 (50Gb+) (Hosted on AWS)

This dataset contains a small sample of pages that contain complex HTML forms. Complex forms are HTML forms that have 3') or more form controls such as input tags (type: text|checkbox|radio|image|button) and select tags (dropdown boxes). The dataset contains 2.67 million complex forms. Such data may be useful for form classification and in uncovering hidden web data. This dataset is very, very large over 50 Gbyte.

L19 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains a large sample of noun phrases and their context, extracted from Yahoo! News data. The data can be used for AI and NLP studies.

L20 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains browsing behavior data for a collection of users on Yahoo! Answers from several months in 2006. Users interact socially and are rewarded by a point system based on Q&A system. The data includes questions, answers, and browsing behavior for users on the site. There is no textual or NLP information. The data may be used by machine learning techniques to discern the different types of users on Yahoo! Answers and how they interact with the site by asking and answering questions or browsing, or test models of different classes of users by how they move around the site and interact with the point system and with each other.

R9 - Yahoo! Music Internet Radio Playlist, version 1.0 (273 MB) (Hosted on AWS)

This dataset contains a snapshot of metadata collected during a period of 15 days between September 22nd and October 6th, 2011. The metadata was collected across more than 4K Internet radio stations. For each sampled track play, we extracted the metadata, the station local time of play, and its corresponding system time. The monitored stations are all associated with the ShoutCast directory and therefore tend to provide metadata in a similar manner

L21 - Yahoo! Answers Query To Questions, version 2.0 (24K)

This dataset contains a sample of Yahoo! Search queries issued in June 2011 by searchers who posted a relevant question on Yahoo! Answers shortly after searching. It also contains the category information of the posted question. The dataset may be used by researchers to better understand the transition from searchers to askers, especially the information needs causing the transition.

L22 - Yahoo! News Sessions Content, version 1.0 (16 MB)

This dataset contains a small sample of user sessions that contained a click in the Yahoo! News domain, along with the contents of a number of news articles present in those user sessions. Users and textual content are represented as meaningless anonymous numbers so that no identifying information is revealed. The textual content includes the tokens (words) found in the news articles, the article publication date, time expressions found in the news articles, and entities (locations, persons and organizations). The dataset includes a ground truth file that contains, for a given article, what articles have been clicked next in a user trail. The dataset may be used by researchers to validate content-based recommender systems or ranking algorithms. The dataset may serve as a testbed for ser-trail and content-based mining and recommendation algorithms.

C14B - Yahoo! Learn to Rank Challenge version 2.0 (616 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. There are two datasets in this distribution: a large one and a small one. Each dataset is divided in 3 sets: training, validation, and test. Statistics are as follows:

Set 1 Set 2

Train Val Test Train Val Test

# queries 19,944 2,994 6,983 1,266 1,266 3,798

# urls 473,134 71,083 165,660 34,815 34,881 103,174

# features 519 596

Number of features in the union of the two sets: 700; in the intersection: 415. Each feature has been normalized to be in the [0,1] range.

Each url is given a relevance judgment with respect to the query. There are 5 levels of relevance from 0 (least relevant) to 4 (most relevant).



L23 - Yahoo Answers Synthetic Questions, version 1.0

Yahoo Answers Synthetic Questions, version 1.0

L24 - Yahoo Search Query Log To Entities, version 1.0(1.7MB)

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a "span"), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

L25 - Yahoo N-Gram Representations, version 2.0 (2.6Gb) (Hosted on AWS)

This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. We would like for researchers to be able to produce query rewrites based these representations and test them against other state-of-the-art techniques.

S5 - A Labeled Anomaly Detection Dataset, version 1.0(16M)

Automatic anomaly detection is critical in today's world where the sheer volume of data makes it impossible to tag outliers manually. The goal of this dataset is to benchmark your anomaly detection algorithm. The dataset consists of real and synthetic time-series with tagged anomaly points. The dataset tests the detection accuracy of various anomaly-types including outliers and change-points. The synthetic dataset consists of time-series with varying trend, noise and seasonality. The real dataset consists of time-series representing the metrics of various Yahoo services.

Here are all the papers published on this Webscope Dataset:

L26 - Yahoo! Answers consisting of questions asked in French, version 1.0 (3.8Gb) (Hosted on AWS)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. We only include questions which have been resolved, that is, questions which have received one or more answers. The dataset may serve as a testbed for multilingual question answering system as well as research into user behavior on community question answer sites in other languages.data

L27 - Yahoo Answers Factoids Queries, version 1.0 (3.5MB)

The dataset includes English queries that were input to a search engine in 2012-2014, and identified as a "factoid" queries, i.e., referring to a short fact (filtered by the answer being no longer than 3 words). These queries were identified based on questions in English on Yahoo Answers that have a short best answer and a link to English Wikipedia. The dataset includes the query, its corresponding question title, the best answer, a number indicating the occurrence frequency of the query, the link(s) to English Wikipedia, and the URL of the Yahoo Answers page.

A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)

This data set contains a small sample of user profiles and their interests generated from several months of user activities at Yahoo webpages. Each user is represented as one feature vector and its associated labels, where all user identifiers were removed. Feature vectors are derived from user activities during a training period of 90 days, and labels from a test period of 2 weeks that immediately followed the training period. Each dimension of the feature vector quantifies a user activity with a certain interest category from an internal Yahoo taxonomy (e.g., "Sports/Baseball", "Travel/Europe"), calculated from user interactions with pages, ads, and search results, all of which are internally classified into these interest categories. The labels are derived in a similar way, based on user interactions with classified pages, ads, and search results during the test period. It is important to note that there exists a hierarchical structure among the labels, which is also provided in the data set.

All feature and label names are replaced with meaningless anonymous numbers so that no identifying information is revealed. The dataset is of particular interest to Machine Learning and Data Mining communities, as it may serve as a testbed for classification and multi-label algorithms, as well as for classifiers that account for structure among labels.

L28 - Yahoo Answers Query Treebank, version 1.0 (456KB)

User queries that were issued to the Yahoo Web search engine and for which the user ended in clicked a Yahoo Answers result. The queries are tagged by linguists for syntactic segmentation and dependency parse tree within each segment.

L29 - Yahoo Answers Novelty Based Answer Ranking, version 1.0 (331KB)

CQA healthcare-related answers annotated with mapping between textual propositions within each answer and relevant aspects to the target question.

L30 - Model files for Fast Entity Linker, version 1.0 (5.2G) (Hosted on AWS)

Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).

In this dataset we release datapacks for English, Spanish, and Chinese built for our unsupervised, accurate, and extensible multilingual named entity recognition and linking system Fast Entity Linker.

In our system, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data-structures and aggressive hashing functions. The models released in this dataset include "entity embeddings" and "wikipedia click-log data".

Entity Embeddings are vector-based representations that capture how entities are referred to in language contexts. We train entity embeddings using Wikipedia articles and use hyperlinks in the articles to their canonical forms for their associated entities.

Wikipedia click-logs gives very useful signals to disambiguate partial or ambiguous entity mentions, such as Obama (Michelle or Barack), Liverpool (City or Football team), or Fox (Person or Organization). We extract the in-wiki links from Wikipedia and create pairs (alias, entity) where the alias is the text in the anchor and the entity is the id of a Wikipedia page pointed out by an outgoing link.

L31 - Questions on Yahoo Answers labeled as either informational or conversational, version 1.0 (766KB)

The dataset includes non-deleted English questions from Yahoo Answers, posted between the years 2006 and 2016, sampled uniformly at random. Each question include a URL to its Yahoo Answers page, its title, description, high-level category (one of 26), direct category, and a label marking it as informational ('0') or conversational ('1'). A small subset of the questions is marked as borderline ('2').

L32 - The Yahoo News Annotated Comments Corpus, version 1.0 (47MB)

The dataset contains comment threads posted in response to online news articles. We annotated the dataset at the comment-level and the thread-level. The annotations include 6 dimensions of individual comments and 3 dimensions of threads on the whole. The coding was done by professional, trained editors and untrained crowdsourced workers. The corpus contains annotations for a novel corpus of 2.4k threads and 9.2k comments from Yahoo News and 1k threads from Internet Argument Corpus.

L33 - Yahoo News Ranked Multi-label Corpus, version1.0 (59MB)

Tagging textual documents/articles with relevant tags is an important problem for many applications including Yahoo News, Newsroom, Tumblr, and other textual media platforms. Multilabel learning is at the core of this problem and recently got a revived interest. There are many standard datasets available for this task but all of them provide features and not the actual text of the documents. This corpus provides the actual text so that the researchers can derive their own features that are good best for their algorithms. Apart from that, this corpus to the best of our knowledge is the only one that provides a ranking of labels for each document in terms of its importance. Related publications to be cited: 1. "RIPML: A Restricted Isometry based Approach to Multilabel Learning, Akshay Soni and Yashar Mehdad. FLAIRS 2017.” 2. "DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging", 2017. Sheng Chen, Aasish Pappu, Akshay Soni and Yashar Mehdad.

R11 - Yahoo News Video dataset, version 1.0 (645MB)

The dataset is a collection of 964 hours (22K videos) of news broadcast videos that appeared on Yahoo news website's properties, e.g., World News, US News, Sports, Finance, and a mobile application during August 2017. The videos were either part of an article or displayed standalone in a news property. Many of the videos served in this platform lack important metadata, such as an exhaustive list of topics associated with the video. We label each of the videos in the dataset using a collection of 336 tags based on a news taxonomy designed by in-house editors. In the taxonomy, the closer the tag is to the root, the more generic (topically) it is.

G9 - Wikipedia Graph and Related Entity Recommendation Dataset, version 1.0 (18.5 GB) (Hosted on AWS)

This dataset was developed to train and evaluate models for recommending related entities on Wikipedia. It consists of a large, normalized, entity graph generated in May 2020 from Wikipedia by aggregating hyperlinks between Wikipedia pages across languages (10 million vertices and 998 million edges, each with some extra features), the corresponding entity embeddings trained from the graph using the lg2vec method (10 million vectors of dimension 200), and a labeled dataset consisting of 45k query entities and their list of recommended related entities that can be used as ground truth for training and evaluating related-entity recommendation systems. We are making it available via our Webscope data-sharing program to further advance research in graph mining and entity recommendation.

Here are all the papers published on this Webscope Dataset:

A5 - Yahoo! User prospective conversion prediction dataset, version 2.0 (56GB)

We share user activity trail datasets of timestamp annotated sequences of activities collected from users online, derived from various sources, e.g., Yahoo Search, commercial email receipts, reading news, and other content on publisher's webpages associated with Verizon Media such as Yahoo and AOL homepages, Yahoo Finance, Sports and News, HuffPost, TechCrunch, etc., advertising data from Yahoo Gemini and Verizon Media DSP, including ad activity and advertiser data (e.g., ad impressions, clicks, conversions, and site visits). These sequences precede events of particular interest to advertisers. Two types of events of interest are considered: conversion, in retargeting and prospecting setup, and retargeting events. These three setups create three datasets for two major advertisers, from retail and communications categories, running conversion campaigns with Verizon Media collected over 100 days ending in May 2019, totaling 6 distinct datasets.

- The first dataset type is a conversion prediction dataset with a sequence of all user activities preceding the conversion with an advertiser.
- The second dataset type is a prospective conversion prediction dataset with a sequence of eligible (non-retargeting) user activities preceding the conversion with an advertiser.
- The third dataset contains a sequence of all user activities preceding the first recorded retargeting event of a user.

For dataset labels, Advertiser 1 defined three unique conversions, while Advertiser 2 defined one, both advertisers define a single retargeting target rule.

These datasets contain no user information, while activities are anonymized and timestamp information is misaligned between different user trails while preserving all necessary information. Researchers may validate conversion prediction and other systems and algorithms run on users’ historical data activities. The dataset may serve as a testbed for modeling sequences of users’ activities and modeling temporal information through Deep Learning or other Machine Learning and Statistics techniques for a variety of tasks in supervised and unsupervised learning.

A6 - Yahoo! Auction State for a Sample of Real-Time Bid Video Ads Version 1.0 (8.4 MB)

The dataset contains a small sample of the video ads running on Verizon SSP during 2018, and the auctions in which those ads participated. For each date, hour, ad, a series of binned bid prices from 0..50 in increments of 0.10 are shown, along with the number of impressions that could have been bought at that price.

R13 - Yahoo Knowledge Broadcasted Sports-Clocks’ Text Detection and Recognition Dataset, version 1.0 (3.7GB)(Hosted on AWS)

This dataset contains text detection and recognition annotations for sports clocks images collected from sports broadcast videos of NBA and Soccer. These sports clocks are overlaid at different positions on the screen during the game broadcast and are very rich in shapes, sizes, colors and text styles. They contain various semantic text strings like time, team, quarter and scores which can be used to map video segments with the corresponding record in the play-by-play commentary. It has 20 styles of clocks from the NBA and 10 styles of clocks from Soccer.

A7 - Challenging Fashion Queries dataset, version1.0 (199KB)

This dataset contains annotator judgments of the accuracy and reasonableness of a set of apparel products as responses to queries, where each query consists of a product together with a short caption describing how the response product should be different. There are three categories of products: 134 “dresses”, 100 “gowns” and 100 “sundresses”, each represented by an image to which the dataset provides a URL. Each of 40 queries (20 dresses, 10 gowns, 10 sundresses) consists of one of the products plus a short caption.