Competition Data

C14 - Yahoo! Learning to Rank Challenge (421 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. The size of this dataset is 421 MB.

Here are all the papers published on this Webscope Dataset:

C15 - Yahoo! Music user ratings of musical tracks, albums, artists and genres, v 1.0 (1.5 Gbyte)

Yahoo! Music offers a wealth of information and services related to many aspects of music. This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical items. A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. We provide four different versions of the dataset, differing by their number of ratings, so that researchers can select the version best catering for their needs and hardware availability. In addition, one of the datasets is oriented towards a unique learning-to-rank goal, unlike the predictive / error-minimization criteria common at recommender systems. The dataset contains ratings provided by true Y! Music customers during 1999-2009. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed. ) The important and unique features of the dataset are fourfold: It is of very large scale compared to other datasets in the field, e.g, 13X larger than the Netflix prize dataset. It has a very large set of items – much larger than any similar dataset, where usually only number of users is large. There are four different categories of items, which are all linked together within a defined hierarchy. It allows performing session analysis of user activities. We expect that the novel features of the dataset will make it a subject of active research and a standard in the field of recommender systems. In particular, the dataset is expected to ignite research into algorithms that utilize hierarchical structure annotating the item set. Dataset does not require department head approval. The size of this dataset is 1.5 Gbyte.

C14B - Yahoo! Learn to Rank Challenge version 2.0 (616 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. There are two datasets in this distribution: a large one and a small one. Each dataset is divided in 3 sets: training, validation, and test. Statistics are as follows:

Set 1 Set 2

Train Val Test Train Val Test

# queries 19,944 2,994 6,983 1,266 1,266 3,798

# urls 473,134 71,083 165,660 34,815 34,881 103,174

# features 519 596

Number of features in the union of the two sets: 700; in the intersection: 415. Each feature has been normalized to be in the [0,1] range.

Each url is given a relevance judgment with respect to the query. There are 5 levels of relevance from 0 (least relevant) to 4 (most relevant).