Advertising and Market Data

A1 - Yahoo! Search Marketing Advertiser Bidding Data, version 1.0 (81 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel". An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic. Advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This dataset contains the bids over time of all advertisers participating in Yahoo! Search Marketing auctions for the top 1000 search queries during the period from June 15, 2002, to June 14, 2003. Advertisers' identities and query phrases are represented as meaningless anonymous numbers so that no identifying information about advertisers is revealed. The data may be used by economists or other researchers to investigate the behavior of bidders in this unique real-time auction format, responsible for roughly two billion dollars in revenue in 2005 and growing. The size of this dataset is 81 MB.

Here are all the papers published on this Webscope Dataset:

A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version 1.0 (25 MB)

The Yahoo!/O'Reilly Tech Buzz Game tests the theory that a free electronic market can predict trends in technology. In the game, users buy and sell fantasy "stocks" in various technologies. The prices of the stocks fluctuate according to supply and demand using a market mechanism invented at Yahoo! called a dynamic pari-mutuel market. Weekly dividends are paid out to stockholders based on the current "buzz score," or percentage of Yahoo! searches associated with each technology. This dataset shows all the transactions over the course of the Tech Buzz Game, divided into two periods. The first period spans April 1, 2005 to July 31, 2005, after which all markets were closed and cashed out according to final buzz scores at that time. The second period spans August 22, 2005 to the present, at the beginning of which new markets were established and the market reset to a starting point. Traders are represented as meaningless anonymous numbers so that no identifying information is revealed. In addition, the dataset contains buzz score data showing the daily percentages of search volume for each stock. Researchers may use the data to test the predictive value of the market or to test market behavioral theories. The size of this dataset is 25 MB.

A3 - Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords, version 1.0 (845 MB)

This dataset contains a small sample of advertiser's bid and revenue information over a period of 4 months. Bid and revenue information is aggregated with a granularity of a day over advertiser account id, keyphrase and rank. Apart from bid and revenue, impressions and clicks information is also included. Sequence of keywords make a keyphrase, keyphrase category is a keyword. A keyphrase can belong to one or more keyphrase categories. Keyphrase categories used for this dataset have been listed in a separate file. Advertiser account id is represented as a meaningless string. Keyphrase is represented as sequence of meaningless strings, where each string represents a keyword or keyphrase category. The size of this dataset is 845 MB.

Here are all the papers published on this Webscope Dataset:

A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)

This data set contains a small sample of user profiles and their interests generated from several months of user activities at Yahoo webpages. Each user is represented as one feature vector and its associated labels, where all user identifiers were removed. Feature vectors are derived from user activities during a training period of 90 days, and labels from a test period of 2 weeks that immediately followed the training period. Each dimension of the feature vector quantifies a user activity with a certain interest category from an internal Yahoo taxonomy (e.g., "Sports/Baseball", "Travel/Europe"), calculated from user interactions with pages, ads, and search results, all of which are internally classified into these interest categories. The labels are derived in a similar way, based on user interactions with classified pages, ads, and search results during the test period. It is important to note that there exists a hierarchical structure among the labels, which is also provided in the data set.

All feature and label names are replaced with meaningless anonymous numbers so that no identifying information is revealed. The dataset is of particular interest to Machine Learning and Data Mining communities, as it may serve as a testbed for classification and multi-label algorithms, as well as for classifiers that account for structure among labels.