Advertising and Market Data

A5 - Yahoo! User prospective conversion prediction dataset, version 2.0 (56GB)

We share user activity trail datasets of timestamp annotated sequences of activities collected from users online, derived from various sources, e.g., Yahoo Search, commercial email receipts, reading news, and other content on publisher's webpages associated with Verizon Media such as Yahoo and AOL homepages, Yahoo Finance, Sports and News, HuffPost, TechCrunch, etc., advertising data from Yahoo Gemini and Verizon Media DSP, including ad activity and advertiser data (e.g., ad impressions, clicks, conversions, and site visits). These sequences precede events of particular interest to advertisers. Two types of events of interest are considered: conversion, in retargeting and prospecting setup, and retargeting events. These three setups create three datasets for two major advertisers, from retail and communications categories, running conversion campaigns with Verizon Media collected over 100 days ending in May 2019, totaling 6 distinct datasets.

- The first dataset type is a conversion prediction dataset with a sequence of all user activities preceding the conversion with an advertiser.
- The second dataset type is a prospective conversion prediction dataset with a sequence of eligible (non-retargeting) user activities preceding the conversion with an advertiser.
- The third dataset contains a sequence of all user activities preceding the first recorded retargeting event of a user.

For dataset labels, Advertiser 1 defined three unique conversions, while Advertiser 2 defined one, both advertisers define a single retargeting target rule.

These datasets contain no user information, while activities are anonymized and timestamp information is misaligned between different user trails while preserving all necessary information. Researchers may validate conversion prediction and other systems and algorithms run on users’ historical data activities. The dataset may serve as a testbed for modeling sequences of users’ activities and modeling temporal information through Deep Learning or other Machine Learning and Statistics techniques for a variety of tasks in supervised and unsupervised learning.