Datasets

R1 - Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. The dataset contains over ten million ratings of musical artists given by Yahoo! Music users over the course of a one month period sometime prior to March 2004. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a testbed for matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 423 MB.

Here are all the papers published on this Webscope Dataset:

R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0 (1.4 Gbyte & 1.1 Gbyte)

This dataset represents a snapshot of the Yahoo! Music community's preferences for various songs. The dataset contains over 717 million ratings of 136 thousand songs given by 1.8 million users of Yahoo! Music services. The data was collected between 2002 and 2006. Each song in the dataset is accompanied by artist, album, and genre attributes. The users, songs, artists, and albums are represented by randomly assigned numeric id's so that no identifying information is revealed. The mapping from genre id's to genre, as well as the genre hierarchy, is given. There are 2 sets in this dataset. Part one is 1.4 Gbytes and part 2 is 1.1 Gbytes.

Here are all the papers published on this Webscope Dataset:

Dataset has been added to your cart

R3 - Yahoo! Music ratings for User Selected and Randomly Selected songs, version 1.0 (1.2 MB)

This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services. The second source consists of ratings for randomly selected songs collected during an online survey conducted by Yahoo! Research. The rating data includes 15,400 users, and 1000 songs. The data contains at least ten ratings collected during normal use of Yahoo! Music services for each user, and exactly ten ratings for randomly selected songs for each of the first 5400 users in the dataset. The dataset includes approximately 300,000 user-supplied ratings, and exactly 54,000 ratings for randomly selected songs. All users and items are represented by randomly assigned numeric identification numbers. In addition, the dataset includes responses to seven multiple-choice survey questions regarding rating-behavior for each of the first 5400 users. The survey data and ratings for randomly selected songs were collected between August 22, 2006 and September 7, 2006. The normal-interaction data was collected between 2002 and 2006. The size of this dataset is 1.2 MB.

Here are all the papers published on this Webscope Dataset:

R4 - Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 (23 MB)

This dataset contains a small sample of the Yahoo! Movies community's preferences for various movies, rated on a scale from A+ to F. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset also contains a large amount of descriptive information about many movies released prior to November 2003, including cast, crew, synopsis, genre, average ratings, awards, etc. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms, including hybrid content and collaborative filtering algorithms. The dataset may serve as a testbed for relational learning and data mining algorithms as well as matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 23 MB.

Here are all the papers published on this Webscope Dataset:

R5 - Yahoo! Delicious Popular URLs and Tags, version 1.0 (4.5MB)

This dataset represents 100,000 URLs that were bookmarked on Delicious by users of the service. Each URL has been saved at least 100 times. For each URL, the date that it was first bookmarked by a Delicious user is indicated, along with the total number of saves. Also indicated are the ten most commonly used tags for each URL, along with the number of times each tag was used. This dataset provides a view into the nature of popular content in the Delicious social bookmarking system, including how users apply tags to individual items. The size of this dataset is 4.5 MB.

L1 - Yahoo! N-Grams, version 2.0 (multi part) (Hosted on AWS)

This dataset contains n-grams (contiguous sets of words of size n), n = 1 to 5, extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites. The documents were published on these sites between February 2006 and December 2006. The dataset does not contain the documents themselves, but only the n-grams that occur at least twice. It provides statistics such as frequency of occurrence, number and entropy of different left (right) single-token contexts of each n-gram. This dataset may be used by researchers to build statistical language models for speech or handwriting recognition or machine translation. There are 3 files in this dataset. They are 3.5 Gbyte, 4.3 Gbyte and 4.4 Gbyte.

L2 - Metadata Extracted from Publicly Available Web Pages, version 1.0 (1.5 GB & 700 MB)

The dataset contains about 100 million triples of RDF data obtained by extracting metadata from publicly available webpages. Three forms of embedded metadata are extracted: microformats (hCard, hCalendar and hReview), RDFa metadata and RDF documents linked to webpages. All metadata extracted from a webpage is converted to RDF. The data is made available in the WARC format, version 0.9. The dataset may serve as a testbed for research in scalability in the Semantic Web area and also for developing methods to deal with metadata that is incomplete, erroneous or biased in some way. The size of this dataset is 2.3 Gbyte in two parts.

L3 - Yahoo! Semantically Annotated Snapshot of the English Wikipedia, version 1.0 (multi part)

This SW1 dataset contains a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. In order to build SW1, we started from the XML-ized Wikipedia dump distributed by the University of Amsterdam. This snapshot of the English Wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output. Raw Data (Multitag format) The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries. There are 4 files in this dataset ranging in size from 1.3 Gbyte to 1.8 Gbyte.

Here are all the papers published on this Webscope Dataset:

L4 - Yahoo! Answers Manner Questions, version 1.0 (102 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 102 MB.

Here are all the papers published on this Webscope Dataset:

L5 - Yahoo! Answers Manner Questions, version 2.0 (104 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 104 MB.

L6 - Yahoo! Answers Comprehensive Questions and Answers version 1.0 (multi part)

Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus distributed here contains 4,483,032 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs and all user ids were anonymized so that no identifying information is revealed. This dataset may be used by researchers to learn and validate answer extraction models. An example of such work was published by Surdeanu et al. (2008). There are 2 files in this dataset. Part 1 is 1.7 Gbyte and part 2 is 1.9 Gybte.

Here are all the papers published on this Webscope Dataset:

L8 - Yahoo! Search Query Logs for Nine Languages, version 1.0 (45 K)

This dataset contains the 1000 most frequent web search queries issued to Yahoo! Search for nine different languages. The languages covered are Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The dataset may be useful for various information retrieval and data mining research investigations, especially those involving cross- or multi-lingual search tasks. The size of this dataset is 45 K.

Here are all the papers published on this Webscope Dataset:

L9 - Yahoo! Answers Question Types Sample of 1000, version 1.0 (14 K)

This dataset contains URLs of questions posted to Yahoo! Answers, along with the question types assigned to these questions by human judges. The question types are "informational", "advice", "opinion", and "polling". The size of this dataset is 14 K.

G1 - Yahoo! Search Marketing Advertiser-Phrase Bipartite Graph, Version 1.0 (14 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel." An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic: advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This is a small completely anonymized graph reflecting the pattern of connectivity between some Yahoo! Search Marketing advertisers and some of the search keyword phrases that they bid on. Both advertisers and keyword phrases are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset contains 459,678 anonymous phrases ids, 193,582 anonymous advertiser ids, and 2,278,448 edges, representing the act of an advertiser bidding on a phrase. The size of this dataset is 14 MB.

Here are all the papers published on this Webscope Dataset:

G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS)

This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002. The dataset encodes the graph or map of links among web pages, where nodes in the graph are URLs. The Yahoo! AltaVista web graph is an example of a large real-world graph. The dataset may serve as a testbed for matrix, graph, clustering, data mining, and machine learning algorithms. There are 3 files in this dataset with sizes 3.2 Gbyte, 5.0 Gbyte and 3.4 Gbyte.

G3 - Yahoo! Groups User-Group Membership Bipartite Graph, version 1.0 (93 MB)

Millions of communities and groups use Yahoo! Groups as a meeting place and forum to discuss mutual interests on nearly any topic. This dataset contains a sample of the "membership graph" of Yahoo! Groups, where both users and groups are represented as meaningless anonymous numbers so that no identifying information is revealed. Users and groups are nodes in the membership graph, with edges indicating that a user is a member of a group. The dataset consists only of the anonymous bipartite membership graph, and does not contain any information about users, groups, or discussions. The Yahoo! Groups membership graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The size of this dataset is 93 MB.

Here are all the papers published on this Webscope Dataset:

G4 - Yahoo! Network Flows Data, version 1.0 (multi part) (Hosted on AWS)

Yahoo! network flows data contains communication patterns between end-users in the large Internet and Yahoo servers. A netflow record includes timestamp, source IP address, destination IP address, source port, destination port, protocol, number of packets, and number of bytes transferred from the source to the destination. The record does not include the content of the data communication. Each Nntflow data file consists of sampled netflow records exported from routers in 15-minute intervals. The dataset includes netflow data files collected from three border routers in October 11 2007. All IP addresses in the dataset are anonymized using a random permutation algorithm. There are 6 files in this dataset with sizes 7.8 Gbyte, 7.3 Gbyte, 7.8 Gbyte, 7.5 Gbyte, 7.4 Gbyte and 3.6 Gbyte.

Here are all the papers published on this Webscope Dataset:

A1 - Yahoo! Search Marketing Advertiser Bidding Data, version 1.0 (81 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel". An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic. Advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This dataset contains the bids over time of all advertisers participating in Yahoo! Search Marketing auctions for the top 1000 search queries during the period from June 15, 2002, to June 14, 2003. Advertisers' identities and query phrases are represented as meaningless anonymous numbers so that no identifying information about advertisers is revealed. The data may be used by economists or other researchers to investigate the behavior of bidders in this unique real-time auction format, responsible for roughly two billion dollars in revenue in 2005 and growing. The size of this dataset is 81 MB.

Here are all the papers published on this Webscope Dataset:

A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version 1.0 (25 MB)

The Yahoo!/O'Reilly Tech Buzz Game tests the theory that a free electronic market can predict trends in technology. In the game, users buy and sell fantasy "stocks" in various technologies. The prices of the stocks fluctuate according to supply and demand using a market mechanism invented at Yahoo! called a dynamic pari-mutuel market. Weekly dividends are paid out to stockholders based on the current "buzz score," or percentage of Yahoo! searches associated with each technology. This dataset shows all the transactions over the course of the Tech Buzz Game, divided into two periods. The first period spans April 1, 2005 to July 31, 2005, after which all markets were closed and cashed out according to final buzz scores at that time. The second period spans August 22, 2005 to the present, at the beginning of which new markets were established and the market reset to a starting point. Traders are represented as meaningless anonymous numbers so that no identifying information is revealed. In addition, the dataset contains buzz score data showing the daily percentages of search volume for each stock. Researchers may use the data to test the predictive value of the market or to test market behavioral theories. The size of this dataset is 25 MB.

A3 - Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords, version 1.0 (845 MB)

This dataset contains a small sample of advertiser's bid and revenue information over a period of 4 months. Bid and revenue information is aggregated with a granularity of a day over advertiser account id, keyphrase and rank. Apart from bid and revenue, impressions and clicks information is also included. Sequence of keywords make a keyphrase, keyphrase category is a keyword. A keyphrase can belong to one or more keyphrase categories. Keyphrase categories used for this dataset have been listed in a separate file. Advertiser account id is represented as a meaningless string. Keyphrase is represented as sequence of meaningless strings, where each string represents a keyword or keyphrase category. The size of this dataset is 845 MB.

Here are all the papers published on this Webscope Dataset:

L15 - Yahoo! Search queries that share clicked URLs with TREC queries, version 1.0 (33 K)

This dataset consists of Yahoo! search queries that share clicked URLs with TREC queries. Queries that share clicked URLs are often referred to as co-clicked queries. The TREC queries cover a subset of TREC topics 451-550, 701-850, and wt09-01 through wt09-50, which are widely used within the information retrieval community for various Web search-related experiments. The size of this dataset is 33K.

S1 - Yahoo! Sherpa database platform system measurements, version 1.0 (33 K)

This dataset contains a series of traces of hardware resource usage during operation of the PNUTS/Sherpa database. The measurements include CPU utilization, memory utilization, disk utilization, network traffic, and so on. Additionally, metrics specific to particular components of the system, such as the Apache and MySQL? servers, are also included. The traces represent measuring the system resource usage at 1 minute granularity during various database workloads, including read-heavy, write-heavy and scan-oriented workloads. The data can be used to analyze and simulate the bottlenecks experienced in a real cloud database system under load. This size of this dataset is 33 K.

S2 - Yahoo! Statistical Information Regarding Files and Access Pattern to Files in one of Yahoo's Clusters (2.3 K)

This dataset contains total number of files, total file size, number of file accesses, number of days between the first access and the most recent access, file distribution, deletion rate of files and directories, creation rate of files and directories in a dilithium-gold cluster. The size of this dataset is 2.3K.

G5 - Yahoo! Messenger User Communication Pattern (32 MB)

This dataset contains a small sample of the Yahoo! Messenger community's communication (IM) log at a high level for a period of 4 weeks. Specifically, this dataset only records the first communication event from one user to another on a particular day, and generates such records for a period of 28 days. This dataset does not contain any specific IM content, or user identification. It is only intended to help identify the social graph based on communication patterns between a small subset of the total number of Yahoo! Messenger users. This dataset also has some information regarding the locale of users in an anonymous form. The dataset may be used by researchers to validate claims on social networking theory and corroborate their assumptions/analysis against a real time social network graph consisting of a small subset of Yahoo! Messenger users. The total size for this dataset is 32 MB.

G6 - Yahoo! Instant Messenger Friends Connectivity Graph (28 MB)

Millions use Yahoo! Messenger every day to communicate by text or by voice between PCs or from PCs to phones. This dataset contains a sample of the Yahoo! Messenger "friends graph", where users are represented as meaningless anonymous numbers so that no identifying information is revealed. Users are nodes in the graph, with edges indicating that a user is a friend of another user. The dataset consists only of the anonymous friends graph, and does not contain any information about users or discussions. The Yahoo! Messenger friends graph is an example of a large real-world power-law graph. The dataset may serve as a testbed for matrix and graph algorithms including PCA and graph clustering algorithms, as well as machine learning algorithms. The total size for this dataset is 28 MB.

L12 - Yahoo! Search Popularity by Location for Websites on Politician and Athletes (14 M)

This dataset contains lists of popular web sources clicked by Yahoo! users when they search for entities within two popular domains, Athletes and Politician. In addition to the global popular list, the is dataset also includes location specific and location/entity specific popular lists. This dataset can be used to study the location bias in entities and web sources, and therefore allow the researchers to study location specific information extraction and entity portal generation. Total size for this dataset is 14MB.

L13 - Yahoo! Search Query Tiny Sample (41 K)

This dataset contains a random sample of 4496 queries posted to Yahoo's US search engine in January, 2009. For privacy reasons, the query set contains only queries that have been asked by at least three different users and contain only letters of the English alphabet, sequences of numbers not longer than four numbers and punctuation characters. The query set does not contain user information nor does it preserve temporal aspects of the query log. Total size for this dataset is 41K.

C14 - Yahoo! Learning to Rank Challenge (421 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. The size of this dataset is 421 MB.

Here are all the papers published on this Webscope Dataset:

G7 - Yahoo! Property and Instant Messenger Data use for a Sample of Users, v.1.0 (4.3 Gb) (Hosted on AWS)

This dataset was generated starting with a sample of Yahoo! users as a seed list, then following their IM communication links out two steage, gender), Yahoo! property use, and characteristics of users' mobile devices. The size of this dataset is 4.3 Gbyte.

G8 - Yahoo! Messenger Client to Server Protocol-Level Events, version 1.0 (477 MB)

Yahoo! Messenger Client to Server protocol level events, including around 200 different opcodes. This dataset contains 2 sets of data related to Yahoo! Messenger: EVENT_DATA: A small sample of Client's protocol level event message streams to the Yahoo! Messenger Servers collected over a period of 24 hrs in June 2010. VALIDATION_DATA: Client to Client message events for the same users in the previous set, calculated over the same 24 hrs period in June The size of this dataset is 477 MB.

L16 - Yahoo! Answers Query to Questions (1.5 MB)

This dataset contains a small sample of Yahoo! Answers question/answers pages visited following search engine queries in August 2010. The dataset also contains user ratings of query clarity, query-question match, and query-answer satisfaction collected using Amazon Mechanical Turk. The dataset may be used by researchers to validate algorithms to predict searcher satisfaction with existing community-based answers. It may also enable researchers to validate algorithms to predict query clarity and query-question match. The size of this dataset is 1.5 MB.

C15 - Yahoo! Music user ratings of musical tracks, albums, artists and genres, v 1.0 (1.5 Gbyte)

Yahoo! Music offers a wealth of information and services related to many aspects of music. This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical items. A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. We provide four different versions of the dataset, differing by their number of ratings, so that researchers can select the version best catering for their needs and hardware availability. In addition, one of the datasets is oriented towards a unique learning-to-rank goal, unlike the predictive / error-minimization criteria common at recommender systems. The dataset contains ratings provided by true Y! Music customers during 1999-2009. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed. ) The important and unique features of the dataset are fourfold: It is of very large scale compared to other datasets in the field, e.g, 13X larger than the Netflix prize dataset. It has a very large set of items – much larger than any similar dataset, where usually only number of users is large. There are four different categories of items, which are all linked together within a defined hierarchy. It allows performing session analysis of user activities. We expect that the novel features of the dataset will make it a subject of active research and a standard in the field of recommender systems. In particular, the dataset is expected to ignite research into algorithms that utilize hierarchical structure annotating the item set. Dataset does not require department head approval. The size of this dataset is 1.5 Gbyte.

R6A - Yahoo! Front Page Today Module User Click Log Dataset, version 1.0 (1.1 GB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Our dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. To the best of our knowledge, this is the first real-world benchmark evaluating bandit algorithms reliably. The dataset contains 45,811,883 user visits to the Today Module. For each visit, both the user and each of the candidate articles are associated with a feature vector of dimension 6 (including a constant feature), constructed using a conjoint analysis with a bilinear model; see Chu et al. [KDD 2009] for more details. The size of this dataset is 1.1GB

Here are all the papers published on this Webscope Dataset:

L18 - Anonymized Yahoo! Search Logs with Relevance Judgments (1.3 Gbyte)

Annonymized Yahoo! Search Logs with Relevance Judgments version 1.0 The size of this dataset is 1.3 Gbyte.

Here are all the papers published on this Webscope Dataset:

R6B - Yahoo! Front Page Today Module User Click Log Dataset, version 2.0 (300 MB)

Online content recommendation represents an important example of interactive machine learning problems that require an efficient tradeoff between exploration and exploitation. Such problems, often formulated as various types of multi-armed bandits, have received extensive research in the machine learning and statistics literature. Due to the inherent interactive nature, creating a benchmark dataset for reliable algorithm evaluation is not as straightforward as in other fields of machine learning or recommendation, whose objects are often prediction. Similar to the previous version, this dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo!'s front page. The articles were chosen uniformly at random, which allows one to use a recently developed method of Li et al. [WSDM 2011] to obtain an unbiased evaluation of a bandit algorithm. Compared to the previous version, this data is larger (containing 15 days of data from October 2 to 16, 2011), and contains raw features (so that researchers can try out different feature generation methods in multi-armed bandits). The dataset contains 28041015 user visits to the Today Module on Yahoo!'s frontpage. For each visit, the user is associated with a binary feature vector of dimension 136 (including a constant feature with ID 1) that contains information about the user like age, gender, behavior targeting features, etc. For sensitivity and privacy reasons, feature definitions are not revealed, and browser cookies (bcookies) of the users are replaced with a constant string 'user'.

R7 - Yahoo! Flickr Images Features EC1M version 1.0 (83 Gb)

These are image features and labels to accompany the EC2M dataset. The EC1M database is a set of Flickr images taken in European cities and were selected by the National Technical University of Athens. The database just provides a list of URLs and each user is responsible for downloading the needed images, under the normal Flickr TOS. More details on this dataset are available at http://image.ntua.gr/iva/datasets/ec1m/ We are releasing two different kinds of data: image features and similarity ground truth. First, we wish to release the SURF and DBN features that describe these images. Both SURF and DBN are algorithms for image analysis that were invented and popularized outside of Yahoo. We wish to release our image feature data so that our collaborators are all working on exactly the same bits. Others could recompute the features, but the codes are not very stable. This is the best way to guarantee that all of us are working with the same bits. This is important for the image-classification and image-similarity work that we are doing. The second set of data we wish to release is some image-similarity labels. We have manually annotated 26000+ images as to whether they are relevant or not relevant for 25 different queries. This is the ground truth for image-similarity studies. This dataset is very large at 83 Gbyte.

L11 - HTML Forms Extracted from Publicly Available Webpages, version 1.0 (50Gb+) (Hosted on AWS)

This dataset contains a small sample of pages that contain complex HTML forms. Complex forms are HTML forms that have 3') or more form controls such as input tags (type: text|checkbox|radio|image|button) and select tags (dropdown boxes). The dataset contains 2.67 million complex forms. Such data may be useful for form classification and in uncovering hidden web data. This dataset is very, very large over 50 Gbyte.

L19 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains a large sample of noun phrases and their context, extracted from Yahoo! News data. The data can be used for AI and NLP studies.

R8 - Yahoo! Celebrity Faces Images, version 1.0 (49MB)

The goal of releasing this dataset is to introduce a more realistic and uncontrolled dataset for the task of human face recognition (often known as face recognition in wild). This dataset contains a total number of 2025 low-resolution gray-scale faces of 45 celebrities. There is no meta-data associated with the dataset. The faces are extracted from 2025 Getty images using an open-source face detector (OpenCV? ), and then converted to gray-scale, and sub-sampled to 128*128 pixels. The main reason for releasing this dataset is to provide the academic researchers with more realistic face recognition benchmark datasets, with the aim of developing new face recognition algorithms that work in the wild (and not fully controlled) pose and illumination conditions. Most existing benchmark datasets are collected under controlled pose and illumination conditions. As a result, most face recognition algorithms work well with controlled pose and illumination conditions, but then the performances of those algorithms are significantly lower on wild (uncontrolled) datasets. A wild face recognition data is a dataset in which there is a large variation in the pose, alignment, and illumination of the face images to recognize. The problem of real-world face recognition is (despite its significant importance in different applications), a wild face recognition algorithm, where as most existing face recognition benchmarks are controlled. The problem of wild face recognition is still not completely solved! The main purpose of releasing this dataset is to provide a new benchmark for the problem of wild (uncontrolled) face recognition. The celebrity face dataset provides a new benchmark in which the provided faces are subject to large variation in pose, illumination, and also occlusion. We hope that releasing this dataset helps the academic machine learning and computer vision researchers to come up with more accurate face recognition algorithms.

S3 - Yahoo Hadoop grid logs, version 1.0 (8.8G) (Hosted on AWS)

This dataset contains the HDFS audit logs that contain information about HDFS file access.

L20 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains browsing behavior data for a collection of users on Yahoo! Answers from several months in 2006. Users interact socially and are rewarded by a point system based on Q&A system. The data includes questions, answers, and browsing behavior for users on the site. There is no textual or NLP information. The data may be used by machine learning techniques to discern the different types of users on Yahoo! Answers and how they interact with the site by asking and answering questions or browsing, or test models of different classes of users by how they move around the site and interact with the point system and with each other.

R9 - Yahoo! Music Internet Radio Playlist, version 1.0 (273 MB) (Hosted on AWS)

This dataset contains a snapshot of metadata collected during a period of 15 days between September 22nd and October 6th, 2011. The metadata was collected across more than 4K Internet radio stations. For each sampled track play, we extracted the metadata, the station local time of play, and its corresponding system time. The monitored stations are all associated with the ShoutCast directory and therefore tend to provide metadata in a similar manner

L21 - Yahoo! Answers Query To Questions, version 2.0 (24K)

This dataset contains a sample of Yahoo! Search queries issued in June 2011 by searchers who posted a relevant question on Yahoo! Answers shortly after searching. It also contains the category information of the posted question. The dataset may be used by researchers to better understand the transition from searchers to askers, especially the information needs causing the transition.

I1 - Yahoo! Flickr Creative Common Images tagged with ten concepts, version 1.0

This dataset contains a list of Creative Commons licensed images from Flickr and features computed on the images.

L22 - Yahoo! News Sessions Content, version 1.0 (16 MB)

This dataset contains a small sample of user sessions that contained a click in the Yahoo! News domain, along with the contents of a number of news articles present in those user sessions. Users and textual content are represented as meaningless anonymous numbers so that no identifying information is revealed. The textual content includes the tokens (words) found in the news articles, the article publication date, time expressions found in the news articles, and entities (locations, persons and organizations). The dataset includes a ground truth file that contains, for a given article, what articles have been clicked next in a user trail. The dataset may be used by researchers to validate content-based recommender systems or ranking algorithms. The dataset may serve as a testbed for ser-trail and content-based mining and recommendation algorithms.

I2 - Yahoo! Shopping Shoes Image Content, version 1.0 (131 MB)

The main purpose of releasing this dataset is to provide a new benchmark for the problem of fine grained object recognition using shoe as an example. Most of the existing datasets in the research community can be used to develop algorithms to classify coarse level objects, e.g. is this a dog or cat. But in reality, sometimes we need to do fine grained object recognition, e.g. is this a German Shepherd or Chiwawa? Our Shoe dataset provides a new benchmark which contains a diverse collection of types of shoe photos. Object recognition algorithms aim to identify if there is a pair of shoe and the type of shoes (clogs or high heels) appear in a photo automatically. Yahoo! Shopping is the best place to read user reviews, explore great products and buy online. We collect a small subset of product from Yahoo! Shopping to reflect the interesting real-world problem of fine-grained object recognition. We hope that releasing this dataset helps the academic machine learning and computer vision researchers to come up with more accurate object recognition algorithms. This dataset contains a small sample of the Yahoo! Shopping shoe photos. This dataset contains 107 folders, each corresponding to a type and brand of shoe.Also included is a .mat file (shoe_annos.mat), which contains a bounding box for each shoe image. The dataset may be used by researchers to validate image classification systems for research purpose.

C14B - Yahoo! Learn to Rank Challenge version 2.0 (616 MB)

Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. There are two datasets in this distribution: a large one and a small one. Each dataset is divided in 3 sets: training, validation, and test. Statistics are as follows:

Set 1 Set 2

Train Val Test Train Val Test

# queries 19,944 2,994 6,983 1,266 1,266 3,798

# urls 473,134 71,083 165,660 34,815 34,881 103,174

# features 519 596

Number of features in the union of the two sets: 700; in the intersection: 415. Each feature has been normalized to be in the [0,1] range.

Each url is given a relevance judgment with respect to the query. There are 5 levels of relevance from 0 (least relevant) to 4 (most relevant).



L23 - Yahoo Answers Synthetic Questions, version 1.0

Yahoo Answers Synthetic Questions, version 1.0

L24 - Yahoo Search Query Log To Entities, version 1.0(1.7MB)

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a "span"), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

I3 - Yahoo Flickr Creative Commons 100M (14G) (Hosted on AWS)

This dataset contains a list of photos and videos. This list is compiled from data available on Yahoo! Flickr. All the photos and videos provided in the list are licensed under one of the Creative Commons copyright licenses, and as such they can be used for benchmarking purposes as long as the photographer/videographer is credited for the original creation.

If you decide to use the YFCC100M dataset in your work, please cite the following paper: B. Thomee, D.A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L. Li, "YFCC100M: The New Data in Multimedia Research", Communications of the ACM, 59(2), pp. 64-73, 2016.

This dataset is hosted on the Amazon Web Services platform, which requires a free Amazon Web Services login for access.

Here are all the papers published on this Webscope Dataset:

L25 - Yahoo N-Gram Representations, version 2.0 (2.6Gb) (Hosted on AWS)

This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. We would like for researchers to be able to produce query rewrites based these representations and test them against other state-of-the-art techniques.

S5 - A Labeled Anomaly Detection Dataset, version 1.0(16M)

Automatic anomaly detection is critical in today's world where the sheer volume of data makes it impossible to tag outliers manually. The goal of this dataset is to benchmark your anomaly detection algorithm. The dataset consists of real and synthetic time-series with tagged anomaly points. The dataset tests the detection accuracy of various anomaly-types including outliers and change-points. The synthetic dataset consists of time-series with varying trend, noise and seasonality. The real dataset consists of time-series representing the metrics of various Yahoo services.

I4 - Title-based Video Summarization dataset, version 1.1(644M)

The TVSum50 dataset contains 50 videos and their shot-level importance scores obtained via crowdsourcing. The 50 videos, collected from YouTube?, comes with the Creative Commons CC-BY (v3.0) license. We release both the video files and their URLs. The shot-level importance scores are annotated by crowd-workers and contain 20 annotations per video. This dataset may serve as a benchmark to validate video summarization techniques.

L26 - Yahoo! Answers consisting of questions asked in French, version 1.0 (3.8Gb) (Hosted on AWS)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. We only include questions which have been resolved, that is, questions which have received one or more answers. The dataset may serve as a testbed for multilingual question answering system as well as research into user behavior on community question answer sites in other languages.data

R10 - Yahoo News Feed dataset, version 1.0 (1.5TB)

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.

The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.

The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.

This Dataset is no longer available
View Cart

L27 - Yahoo Answers Factoids Queries, version 1.0 (3.5MB)

The dataset includes English queries that were input to a search engine in 2012-2014, and identified as a "factoid" queries, i.e., referring to a short fact (filtered by the answer being no longer than 3 words). These queries were identified based on questions in English on Yahoo Answers that have a short best answer and a link to English Wikipedia. The dataset includes the query, its corresponding question title, the best answer, a number indicating the occurrence frequency of the query, the link(s) to English Wikipedia, and the URL of the Yahoo Answers page.

I5 - Yahoo Flickr mobile photo filters, vision tags and engagement metrics, Version 1.0

This dataset contains a small sample of Flickr mobile photo meta data. These photos were uploaded through Flickr or Instagram on Flickr mobile app and some of them were filtered by the user prior to upload. The dataset contains the information on whether the photo was filtered, the vision tags associated with the photo and engagement metrics on the photo. The dataset may be used by researchers to validate impacts of filters and vision tags on engagement.

A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)

This data set contains a small sample of user profiles and their interests generated from several months of user activities at Yahoo webpages. Each user is represented as one feature vector and its associated labels, where all user identifiers were removed. Feature vectors are derived from user activities during a training period of 90 days, and labels from a test period of 2 weeks that immediately followed the training period. Each dimension of the feature vector quantifies a user activity with a certain interest category from an internal Yahoo taxonomy (e.g., "Sports/Baseball", "Travel/Europe"), calculated from user interactions with pages, ads, and search results, all of which are internally classified into these interest categories. The labels are derived in a similar way, based on user interactions with classified pages, ads, and search results during the test period. It is important to note that there exists a hierarchical structure among the labels, which is also provided in the data set.

All feature and label names are replaced with meaningless anonymous numbers so that no identifying information is revealed. The dataset is of particular interest to Machine Learning and Data Mining communities, as it may serve as a testbed for classification and multi-label algorithms, as well as for classifiers that account for structure among labels.

L28 - Yahoo Answers Query Treebank, version 1.0

User queries that were issued to the Yahoo Web search engine and for which the user ended in clicked a Yahoo Answers result.
The queries are tagged by linguists for syntactic segmentation and dependency parse tree within each segment.

L29 - Yahoo Answers Novelty Based Answer Ranking, version 1.0

CQA healthcare-related answers annotated with mapping between textual propositions within each answer and relevant aspects to the target question.

L30 - Model files for Fast Entity Linker, version 1.0 (5.2G) (Hosted on AWS)

Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).

In this dataset we release datapacks for English, Spanish, and Chinese built for our unsupervised, accurate, and extensible multilingual named entity recognition and linking system Fast Entity Linker.

In our system, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data-structures and aggressive hashing functions. The models released in this dataset include "entity embeddings" and "wikipedia click-log data".

Entity Embeddings are vector-based representations that capture how entities are referred to in language contexts. We train entity embeddings using Wikipedia articles and use hyperlinks in the articles to their canonical forms for their associated entities.

Wikipedia click-logs gives very useful signals to disambiguate partial or ambiguous entity mentions, such as Obama (Michelle or Barack), Liverpool (City or Football team), or Fox (Person or Organization). We extract the in-wiki links from Wikipedia and create pairs (alias, entity) where the alias is the text in the anchor and the entity is the id of a Wikipedia page pointed out by an outgoing link.

L31 - Questions on Yahoo Answers labeled as either informational or conversational, version 1.0

The dataset includes non-deleted English questions from Yahoo Answers, posted between the years 2006 and 2016, sampled uniformly at random. Each question include a URL to its Yahoo Answers page, its title, description, high-level category (one of 26), direct category, and a label marking it as informational ('0') or conversational ('1'). A small subset of the questions is marked as borderline ('2').

L32 - The Yahoo News Annotated Comments Corpus, version 1.0

The dataset contains comment threads posted in response to online news articles. We annotated the dataset at the comment-level and the thread-level. The annotations include 6 dimensions of individual comments and 3 dimensions of threads on the whole. The coding was done by professional, trained editors and untrained crowdsourced workers. The corpus contains annotations for a novel corpus of 2.4k threads and 9.2 comments from Yahoo News and 1k threads from Internet Argument Corpus.