Advertising and Market Data

A3 - Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords, version 1.0 (845 MB)

This dataset contains a small sample of advertiser's bid and revenue information over a period of 4 months. Bid and revenue information is aggregated with a granularity of a day over advertiser account id, keyphrase and rank. Apart from bid and revenue, impressions and clicks information is also included. Sequence of keywords make a keyphrase, keyphrase category is a keyword. A keyphrase can belong to one or more keyphrase categories. Keyphrase categories used for this dataset have been listed in a separate file. Advertiser account id is represented as a meaningless string. Keyphrase is represented as sequence of meaningless strings, where each string represents a keyword or keyphrase category. The size of this dataset is 845 MB.

Here are all the papers published on this Webscope Dataset:

A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version 1.0 (25 MB)

The Yahoo!/O'Reilly Tech Buzz Game tests the theory that a free electronic market can predict trends in technology. In the game, users buy and sell fantasy "stocks" in various technologies. The prices of the stocks fluctuate according to supply and demand using a market mechanism invented at Yahoo! called a dynamic pari-mutuel market. Weekly dividends are paid out to stockholders based on the current "buzz score," or percentage of Yahoo! searches associated with each technology. This dataset shows all the transactions over the course of the Tech Buzz Game, divided into two periods. The first period spans April 1, 2005 to July 31, 2005, after which all markets were closed and cashed out according to final buzz scores at that time. The second period spans August 22, 2005 to the present, at the beginning of which new markets were established and the market reset to a starting point. Traders are represented as meaningless anonymous numbers so that no identifying information is revealed. In addition, the dataset contains buzz score data showing the daily percentages of search volume for each stock. Researchers may use the data to test the predictive value of the market or to test market behavioral theories. The size of this dataset is 25 MB.

A1 - Yahoo! Search Marketing Advertiser Bidding Data, version 1.0 (81 MB)

Yahoo! Search Marketing operates Yahoo!'s auction-based platform for selling advertising space next to Yahoo! Search results. Advertisers bid for the right to appear alongside the results of particular search queries. For example, a travel vendor might bid for the right to appear alongside the results of the search query "Las Vegas travel". An advertiser's bid is the price the advertiser is willing to pay whenever a user actually clicks on their ad. Yahoo! Search Marketing auctions are continuous and dynamic. Advertisers may alter their bids at any time, for example raising their bid for the query "buy flowers" during the week before Valentines Day. This dataset contains the bids over time of all advertisers participating in Yahoo! Search Marketing auctions for the top 1000 search queries during the period from June 15, 2002, to June 14, 2003. Advertisers' identities and query phrases are represented as meaningless anonymous numbers so that no identifying information about advertisers is revealed. The data may be used by economists or other researchers to investigate the behavior of bidders in this unique real-time auction format, responsible for roughly two billion dollars in revenue in 2005 and growing. The size of this dataset is 81 MB.

Here are all the papers published on this Webscope Dataset:

A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)

This data set contains a small sample of user profiles and their interests generated from several months of user activities at Yahoo webpages. Each user is represented as one feature vector and its associated labels, where all user identifiers were removed. Feature vectors are derived from user activities during a training period of 90 days, and labels from a test period of 2 weeks that immediately followed the training period. Each dimension of the feature vector quantifies a user activity with a certain interest category from an internal Yahoo taxonomy (e.g., "Sports/Baseball", "Travel/Europe"), calculated from user interactions with pages, ads, and search results, all of which are internally classified into these interest categories. The labels are derived in a similar way, based on user interactions with classified pages, ads, and search results during the test period. It is important to note that there exists a hierarchical structure among the labels, which is also provided in the data set.

All feature and label names are replaced with meaningless anonymous numbers so that no identifying information is revealed. The dataset is of particular interest to Machine Learning and Data Mining communities, as it may serve as a testbed for classification and multi-label algorithms, as well as for classifiers that account for structure among labels.

A5 - Yahoo! User prospective conversion prediction dataset, version 2.0 (56GB)

We share user activity trail datasets of timestamp annotated sequences of activities collected from users online, derived from various sources, e.g., Yahoo Search, commercial email receipts, reading news, and other content on publisher's webpages associated with Verizon Media such as Yahoo and AOL homepages, Yahoo Finance, Sports and News, HuffPost, TechCrunch, etc., advertising data from Yahoo Gemini and Verizon Media DSP, including ad activity and advertiser data (e.g., ad impressions, clicks, conversions, and site visits). These sequences precede events of particular interest to advertisers. Two types of events of interest are considered: conversion, in retargeting and prospecting setup, and retargeting events. These three setups create three datasets for two major advertisers, from retail and communications categories, running conversion campaigns with Verizon Media collected over 100 days ending in May 2019, totaling 6 distinct datasets.

- The first dataset type is a conversion prediction dataset with a sequence of all user activities preceding the conversion with an advertiser.
- The second dataset type is a prospective conversion prediction dataset with a sequence of eligible (non-retargeting) user activities preceding the conversion with an advertiser.
- The third dataset contains a sequence of all user activities preceding the first recorded retargeting event of a user.

For dataset labels, Advertiser 1 defined three unique conversions, while Advertiser 2 defined one, both advertisers define a single retargeting target rule.

These datasets contain no user information, while activities are anonymized and timestamp information is misaligned between different user trails while preserving all necessary information. Researchers may validate conversion prediction and other systems and algorithms run on users’ historical data activities. The dataset may serve as a testbed for modeling sequences of users’ activities and modeling temporal information through Deep Learning or other Machine Learning and Statistics techniques for a variety of tasks in supervised and unsupervised learning.

A6 - Yahoo! Auction State for a Sample of Real-Time Bid Video Ads Version 1.0 (8.4 MB)

The dataset contains a small sample of the video ads running on Verizon SSP during 2018, and the auctions in which those ads participated. For each date, hour, ad, a series of binned bid prices from 0..50 in increments of 0.10 are shown, along with the number of impressions that could have been bought at that price.

A7 - Challenging Fashion Queries dataset, version1.0 (199KB)

This dataset contains annotator judgments of the accuracy and reasonableness of a set of apparel products as responses to queries, where each query consists of a product together with a short caption describing how the response product should be different. There are three categories of products: 134 “dresses”, 100 “gowns” and 100 “sundresses”, each represented by an image to which the dataset provides a URL. Each of 40 queries (20 dresses, 10 gowns, 10 sundresses) consists of one of the products plus a short caption.