Ratings and Classification Data

R1 - Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. The dataset contains over ten million ratings of musical artists given by Yahoo! Music users over the course of a one month period sometime prior to March 2004. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a testbed for matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 423 MB.

Here are all the papers published on this Webscope Dataset:

R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0 (1.4 Gbyte & 1.1 Gbyte)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various songs. The dataset contains over 717 million ratings of 136 thousand songs given by 1.8 million users of Yahoo! Music services. The data was collected between 2002 and 2006. Each song in the dataset is accompanied by artist, album, and genre attributes. The users, songs, artists, and albums are represented by randomly assigned numeric id's so that no identifying information is revealed. The mapping from genre id's to genre, as well as the genre hierarchy, is given. There are 2 sets in this dataset. Part one is 1.4 Gbytes and part 2 is 1.1 Gbytes.

Here are all the papers published on this Webscope Dataset:

Dataset has been added to your cart

R3 - Yahoo! Music ratings for User Selected and Randomly Selected songs, version 1.0 (1.2 MB)

This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services. The second source consists of ratings for randomly selected songs collected during an online survey conducted by Yahoo! Research. The rating data includes 15,400 users, and 1000 songs. The data contains at least ten ratings collected during normal use of Yahoo! Music services for each user, and exactly ten ratings for randomly selected songs for each of the first 5400 users in the dataset. The dataset includes approximately 300,000 user-supplied ratings, and exactly 54,000 ratings for randomly selected songs. All users and items are represented by randomly assigned numeric identification numbers. In addition, the dataset includes responses to seven multiple-choice survey questions regarding rating-behavior for each of the first 5400 users. The survey data and ratings for randomly selected songs were collected between August 22, 2006 and September 7, 2006. The normal-interaction data was collected between 2002 and 2006. The size of this dataset is 1.2 MB.

Here are all the papers published on this Webscope Dataset:

R4 - Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 (23 MB)

This dataset contains a small sample of the Yahoo! Movies community's preferences for various movies, rated on a scale from A+ to F. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset also contains a large amount of descriptive information about many movies released prior to November 2003, including cast, crew, synopsis, genre, average ratings, awards, etc. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms, including hybrid content and collaborative filtering algorithms. The dataset may serve as a testbed for relational learning and data mining algorithms as well as matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 23 MB.

Here are all the papers published on this Webscope Dataset:

R5 - Yahoo! Delicious Popular URLs and Tags, version 1.0 (4.5MB)

This dataset represents 100,000 URLs that were bookmarked on Delicious by users of the service. Each URL has been saved at least 100 times. For each URL, the date that it was first bookmarked by a Delicious user is indicated, along with the total number of saves. Also indicated are the ten most commonly used tags for each URL, along with the number of times each tag was used. This dataset provides a view into the nature of popular content in the Delicious social bookmarking system, including how users apply tags to individual items. The size of this dataset is 4.5 MB.

R6A - Yahoo! Front Page Today Module User Click Log Dataset, version 1.0 (1.1 GB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Our dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. To the best of our knowledge, this is the first real-world benchmark evaluating bandit algorithms reliably. The dataset contains 45,811,883 user visits to the Today Module. For each visit, both the user and each of the candidate articles are associated with a feature vector of dimension 6 (including a constant feature), constructed using a conjoint analysis with a bilinear model; see Chu et al. [KDD 2009] for more details. The size of this dataset is 1.1GB

Here are all the papers published on this Webscope Dataset:

R6B - Yahoo! Front Page Today Module User Click Log Dataset, version 2.0 (300 MB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Similar to the previous version, this dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo!'s front page. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. Compared to the previous version, this data is larger (containing 15 days of data from October 2 to 16, 2011), and contains raw features (so that researchers can try out different feature generation methods in multi-armed bandits). The dataset contains 28041015 user visits to the Today Module on Yahoo!'s frontpage. For each visit, the user is associated with a binary feature vector of dimension 136 (including a constant feature with ID 1) that contains information about the user like age, gender, behavior targeting features, etc. For sensitivity and privacy reasons, feature definitions are not revealed, and browser cookies (bcookies) of the users are replaced with a constant string 'user'.

R9 - Yahoo! Music Internet Radio Playlist, version 1.0 (273 MB) (Hosted on AWS)

This dataset contains a snapshot of metadata collected during a period of 15 days between September 22nd and October 6th, 2011. The metadata was collected across more than 4K Internet radio stations. For each sampled track play, we extracted the metadata, the station local time of play, and its corresponding system time. The monitored stations are all associated with the ShoutCast directory and therefore tend to provide metadata in a similar manner

R11 - Yahoo News Video dataset, version 1.0 (645MB)

The dataset is a collection of 964 hours (22K videos) of news broadcast videos that appeared on Yahoo news website's properties, e.g., World News, US News, Sports, Finance, and a mobile application during August 2017. The videos were either part of an article or displayed standalone in a news property. Many of the videos served in this platform lack important metadata, such as an exhaustive list of topics associated with the video. We label each of the videos in the dataset using a collection of 336 tags based on a news taxonomy designed by in-house editors. In the taxonomy, the closer the tag is to the root, the more generic (topically) it is.

R13 - Yahoo Knowledge Broadcasted Sports-Clocks’ Text Detection and Recognition Dataset, version 1.0 (3.7GB)(Hosted on AWS)

This dataset contains text detection and recognition annotations for sports clocks images collected from sports broadcast videos of NBA and Soccer. These sports clocks are overlaid at different positions on the screen during the game broadcast and are very rich in shapes, sizes, colors and text styles. They contain various semantic text strings like time, team, quarter and scores which can be used to map video segments with the corresponding record in the play-by-play commentary. It has 20 styles of clocks from the NBA and 10 styles of clocks from Soccer.