Ratings and Classification Data

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R1 - Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. The dataset contains over ten million ratings of musical artists given by Yahoo! Music users over the course of a one month period sometime prior to March 2004. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a testbed for matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 423 MB.

Here are all the papers published on this Webscope Dataset:

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0 (1.4 Gbyte & 1.1 Gbyte)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various songs. The dataset contains over 717 million ratings of 136 thousand songs given by 1.8 million users of Yahoo! Music services. The data was collected between 2002 and 2006. Each song in the dataset is accompanied by artist, album, and genre attributes. The users, songs, artists, and albums are represented by randomly assigned numeric id's so that no identifying information is revealed. The mapping from genre id's to genre, as well as the genre hierarchy, is given. There are 2 sets in this dataset. Part one is 1.4 Gbytes and part 2 is 1.1 Gbytes.

Here are all the papers published on this Webscope Dataset:

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R3 - Yahoo! Music ratings for User Selected and Randomly Selected songs, version 1.0 (1.2 MB)

This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services. The second source consists of ratings for randomly selected songs collected during an online survey conducted by Yahoo! Research. The rating data includes 15,400 users, and 1000 songs. The data contains at least ten ratings collected during normal use of Yahoo! Music services for each user, and exactly ten ratings for randomly selected songs for each of the first 5400 users in the dataset. The dataset includes approximately 300,000 user-supplied ratings, and exactly 54,000 ratings for randomly selected songs. All users and items are represented by randomly assigned numeric identification numbers. In addition, the dataset includes responses to seven multiple-choice survey questions regarding rating-behavior for each of the first 5400 users. The survey data and ratings for randomly selected songs were collected between August 22, 2006 and September 7, 2006. The normal-interaction data was collected between 2002 and 2006. The size of this dataset is 1.2 MB.

Here are all the papers published on this Webscope Dataset:

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R4 - Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 (23 MB)

This dataset contains a small sample of the Yahoo! Movies community's preferences for various movies, rated on a scale from A+ to F. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset also contains a large amount of descriptive information about many movies released prior to November 2003, including cast, crew, synopsis, genre, average ratings, awards, etc. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms, including hybrid content and collaborative filtering algorithms. The dataset may serve as a testbed for relational learning and data mining algorithms as well as matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 23 MB.

Here are all the papers published on this Webscope Dataset:

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R5 - Yahoo! Delicious Popular URLs and Tags, version 1.0 (4.5MB)

This dataset represents 100,000 URLs that were bookmarked on Delicious by users of the service. Each URL has been saved at least 100 times. For each URL, the date that it was first bookmarked by a Delicious user is indicated, along with the total number of saves. Also indicated are the ten most commonly used tags for each URL, along with the number of times each tag was used. This dataset provides a view into the nature of popular content in the Delicious social bookmarking system, including how users apply tags to individual items. The size of this dataset is 4.5 MB.

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

Dataset has been added to your cart

R6A - Yahoo! Front Page Today Module User Click Log Dataset, version 1.0 (1.1 GB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Our dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. To the best of our knowledge, this is the first real-world benchmark evaluating bandit algorithms reliably. The dataset contains 45,811,883 user visits to the Today Module. For each visit, both the user and each of the candidate articles are associated with a feature vector of dimension 6 (including a constant feature), constructed using a conjoint analysis with a bilinear model; see Chu et al. [KDD 2009] for more details. The size of this dataset is 1.1GB

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R6B - Yahoo! Front Page Today Module User Click Log Dataset, version 2.0 (300 MB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Similar to the previous version, this dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo!'s front page. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. Compared to the previous version, this data is larger (containing 15 days of data from October 2 to 16, 2011), and contains raw features (so that researchers can try out different feature generation methods in multi-armed bandits). The dataset contains 28041015 user visits to the Today Module on Yahoo!'s frontpage. For each visit, the user is associated with a binary feature vector of dimension 136 (including a constant feature with ID 1) that contains information about the user like age, gender, behavior targeting features, etc. For sensitivity and privacy reasons, feature definitions are not revealed, and browser cookies (bcookies) of the users are replaced with a constant string 'user'.

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R7 - Yahoo! Flickr Images Features EC1M version 1.0 (83 Gbyte) (Hosted on AWS)

These are image features and labels to accompany the EC2M dataset. The EC1M database is a set of Flickr images taken in European cities and were selected by the National Technical University of Athens. The database just provides a list of URLs and each user is responsible for downloading the needed images, under the normal Flickr TOS. More details on this dataset are available at http://image.ntua.gr/iva/datasets/ec1m/ We are releasing two different kinds of data: image features and similarity ground truth. First, we wish to release the SURF and DBN features that describe these images. Both SURF and DBN are algorithms for image analysis that were invented and popularized outside of Yahoo. We wish to release our image feature data so that our collaborators are all working on exactly the same bits. Others could recompute the features, but the codes are not very stable. This is the best way to guarantee that all of us are working with the same bits. This is important for the image-classification and image-similarity work that we are doing. The second set of data we wish to release is some image-similarity labels. We have manually annotated 26000+ images as to whether they are relevant or not relevant for 25 different queries. This is the ground truth for image-similarity studies. This dataset is very large at 83 Gbyte.

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R8 - Yahoo! Celebrity Faces Images, version 1.0 (49MB)

The goal of releasing this dataset is to introduce a more realistic and uncontrolled dataset for the task of human face recognition (often known as face recognition in wild). This dataset contains a total number of 2025 low-resolution gray-scale faces of 45 celebrities. There is no meta-data associated with the dataset. The faces are extracted from 2025 Getty images using an open-source face detector (OpenCV? ), and then converted to gray-scale, and sub-sampled to 128*128 pixels. The main reason for releasing this dataset is to provide the academic researchers with more realistic face recognition benchmark datasets, with the aim of developing new face recognition algorithms that work in the wild (and not fully controlled) pose and illumination conditions. Most existing benchmark datasets are collected under controlled pose and illumination conditions. As a result, most face recognition algorithms work well with controlled pose and illumination conditions, but then the performances of those algorithms are significantly lower on wild (uncontrolled) datasets. A wild face recognition data is a dataset in which there is a large variation in the pose, alignment, and illumination of the face images to recognize. The problem of real-world face recognition is (despite its significant importance in different applications), a wild face recognition algorithm, where as most existing face recognition benchmarks are controlled. The problem of wild face recognition is still not completely solved! The main purpose of releasing this dataset is to provide a new benchmark for the problem of wild (uncontrolled) face recognition. The celebrity face dataset provides a new benchmark in which the provided faces are subject to large variation in pose, illumination, and also occlusion. We hope that releasing this dataset helps the academic machine learning and computer vision researchers to come up with more accurate face recognition algorithms.

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.

R9 - Yahoo! Music Internet Radio Playlist, version 1.0 (273 MB)(Hosted on AWS)

This dataset contains a snapshot of metadata collected during a period of 15 days between September 22nd and October 6th, 2011. The metadata was collected across more than 4K Internet radio stations. For each sampled track play, we extracted the metadata, the station local time of play, and its corresponding system time. The monitored stations are all associated with the ShoutCast directory and therefore tend to provide metadata in a similar manner

All datasets have been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved.