Language Data

L1 - Yahoo! N-Grams, version 2.0 (multi part) (Hosted on AWS)

This dataset contains n-grams (contiguous sets of words of size n), n = 1 to 5, extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites. The documents were published on these sites between February 2006 and December 2006. The dataset does not contain the documents themselves, but only the n-grams that occur at least twice. It provides statistics such as frequency of occurrence, number and entropy of different left (right) single-token contexts of each n-gram. This dataset may be used by researchers to build statistical language models for speech or handwriting recognition or machine translation. There are 3 files in this dataset. They are 3.5 Gbyte, 4.3 Gbyte and 4.4 Gbyte.

L2 - Metadata Extracted from Publicly Available Web Pages, version 1.0 (1.5 GB & 700 MB)

The dataset contains about 100 million triples of RDF data obtained by extracting metadata from publicly available webpages. Three forms of embedded metadata are extracted: microformats (hCard, hCalendar and hReview), RDFa metadata and RDF documents linked to webpages. All metadata extracted from a webpage is converted to RDF. The data is made available in the WARC format, version 0.9. The dataset may serve as a testbed for research in scalability in the Semantic Web area and also for developing methods to deal with metadata that is incomplete, erroneous or biased in some way. The size of this dataset is 2.3 Gbyte in two parts.

L3 - Yahoo! Semantically Annotated Snapshot of the English Wikipedia, version 1.0 (multi part)

This SW1 dataset contains a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. In order to build SW1, we started from the XML-ized Wikipedia dump distributed by the University of Amsterdam. This snapshot of the English Wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output. Raw Data (Multitag format) The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries. There are 4 files in this dataset ranging in size from 1.3 Gbyte to 1.8 Gbyte.

Here are all the papers published on this Webscope Dataset:

L4 - Yahoo! Answers Manner Questions, version 1.0 (102 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 102 MB.

Here are all the papers published on this Webscope Dataset:

L5 - Yahoo! Answers Manner Questions, version 2.0 (104 MB)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008). The size of this dataset is 104 MB.

L6 - Yahoo! Answers Comprehensive Questions and Answers version 1.0 (multi part)

Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus distributed here contains 4,483,032 questions and their answers. In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs and all user ids were anonymized so that no identifying information is revealed. This dataset may be used by researchers to learn and validate answer extraction models. An example of such work was published by Surdeanu et al. (2008). There are 2 files in this dataset. Part 1 is 1.7 Gbyte and part 2 is 1.9 Gybte.

Here are all the papers published on this Webscope Dataset:

L8 - Yahoo! Search Query Logs for Nine Languages, version 1.0 (45 K)

This dataset contains the 1000 most frequent web search queries issued to Yahoo! Search for nine different languages. The languages covered are Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The dataset may be useful for various information retrieval and data mining research investigations, especially those involving cross- or multi-lingual search tasks. The size of this dataset is 45 K.

Here are all the papers published on this Webscope Dataset:

L9 - Yahoo! Answers Question Types Sample of 1000, version 1.0 (14 K)

This dataset contains URLs of questions posted to Yahoo! Answers, along with the question types assigned to these questions by human judges. The question types are "informational", "advice", "opinion", and "polling". The size of this dataset is 14 K.

L15 - Yahoo! Search queries that share clicked URLs with TREC queries, version 1.0 (33 K)

This dataset consists of Yahoo! search queries that share clicked URLs with TREC queries. Queries that share clicked URLs are often referred to as co-clicked queries. The TREC queries cover a subset of TREC topics 451-550, 701-850, and wt09-01 through wt09-50, which are widely used within the information retrieval community for various Web search-related experiments. The size of this dataset is 33K.

L12 - Yahoo! Search Popularity by Location for Websites on Politician and Athletes (14 M)

This dataset contains lists of popular web sources clicked by Yahoo! users when they search for entities within two popular domains, Athletes and Politician. In addition to the global popular list, the is dataset also includes location specific and location/entity specific popular lists. This dataset can be used to study the location bias in entities and web sources, and therefore allow the researchers to study location specific information extraction and entity portal generation. Total size for this dataset is 14MB.

L13 - Yahoo! Search Query Tiny Sample (41 K)

This dataset contains a random sample of 4496 queries posted to Yahoo's US search engine in January, 2009. For privacy reasons, the query set contains only queries that have been asked by at least three different users and contain only letters of the English alphabet, sequences of numbers not longer than four numbers and punctuation characters. The query set does not contain user information nor does it preserve temporal aspects of the query log. Total size for this dataset is 41K.

L16 - Yahoo! Answers Query to Questions (1.5 MB)

This dataset contains a small sample of Yahoo! Answers question/answers pages visited following search engine queries in August 2010. The dataset also contains user ratings of query clarity, query-question match, and query-answer satisfaction collected using Amazon Mechanical Turk. The dataset may be used by researchers to validate algorithms to predict searcher satisfaction with existing community-based answers. It may also enable researchers to validate algorithms to predict query clarity and query-question match. The size of this dataset is 1.5 MB.

L18 - Anonymized Yahoo! Search Logs with Relevance Judgments (1.3 Gbyte)

Annonymized Yahoo! Search Logs with Relevance Judgments version 1.0 The size of this dataset is 1.3 Gbyte.

Here are all the papers published on this Webscope Dataset:

L11 - HTML Forms Extracted from Publicly Available Webpages, version 1.0 (50Gb+) (Hosted on AWS)

This dataset contains a small sample of pages that contain complex HTML forms. Complex forms are HTML forms that have 3') or more form controls such as input tags (type: text|checkbox|radio|image|button) and select tags (dropdown boxes). The dataset contains 2.67 million complex forms. Such data may be useful for form classification and in uncovering hidden web data. This dataset is very, very large over 50 Gbyte.

L19 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains a large sample of noun phrases and their context, extracted from Yahoo! News data. The data can be used for AI and NLP studies.

L20 - Yahoo! Answers browsing behavior, version 1.0 (166Gb) (Hosted on AWS)

This dataset contains browsing behavior data for a collection of users on Yahoo! Answers from several months in 2006. Users interact socially and are rewarded by a point system based on Q&A system. The data includes questions, answers, and browsing behavior for users on the site. There is no textual or NLP information. The data may be used by machine learning techniques to discern the different types of users on Yahoo! Answers and how they interact with the site by asking and answering questions or browsing, or test models of different classes of users by how they move around the site and interact with the point system and with each other.

L21 - Yahoo! Answers Query To Questions, version 2.0 (24K)

This dataset contains a sample of Yahoo! Search queries issued in June 2011 by searchers who posted a relevant question on Yahoo! Answers shortly after searching. It also contains the category information of the posted question. The dataset may be used by researchers to better understand the transition from searchers to askers, especially the information needs causing the transition.

L22 - Yahoo! News Sessions Content, version 1.0 (16 MB)

This dataset contains a small sample of user sessions that contained a click in the Yahoo! News domain, along with the contents of a number of news articles present in those user sessions. Users and textual content are represented as meaningless anonymous numbers so that no identifying information is revealed. The textual content includes the tokens (words) found in the news articles, the article publication date, time expressions found in the news articles, and entities (locations, persons and organizations). The dataset includes a ground truth file that contains, for a given article, what articles have been clicked next in a user trail. The dataset may be used by researchers to validate content-based recommender systems or ranking algorithms. The dataset may serve as a testbed for ser-trail and content-based mining and recommendation algorithms.

L23 - Yahoo Answers Synthetic Questions, version 1.0

Yahoo Answers Synthetic Questions, version 1.0

L24 - Yahoo Search Query Log To Entities, version 1.0(1.7MB)

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a "span"), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

L25 - Yahoo N-Gram Representations, version 2.0 (2.6Gb) (Hosted on AWS)

This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. We would like for researchers to be able to produce query rewrites based these representations and test them against other state-of-the-art techniques.

L26 - Yahoo! Answers consisting of questions asked in French, version 1.0 (3.8Gb) (Hosted on AWS)

Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. We only include questions which have been resolved, that is, questions which have received one or more answers. The dataset may serve as a testbed for multilingual question answering system as well as research into user behavior on community question answer sites in other languages.data

L27 - Yahoo Answers Factoids Queries, version 1.0 (3.5MB)

The dataset includes English queries that were input to a search engine in 2012-2014, and identified as a "factoid" queries, i.e., referring to a short fact (filtered by the answer being no longer than 3 words). These queries were identified based on questions in English on Yahoo Answers that have a short best answer and a link to English Wikipedia. The dataset includes the query, its corresponding question title, the best answer, a number indicating the occurrence frequency of the query, the link(s) to English Wikipedia, and the URL of the Yahoo Answers page.

L28 - Yahoo Answers Query Treebank, version 1.0

User queries that were issued to the Yahoo Web search engine and for which the user ended in clicked a Yahoo Answers result.
The queries are tagged by linguists for syntactic segmentation and dependency parse tree within each segment.

L29 - Yahoo Answers Novelty Based Answer Ranking, version 1.0

CQA healthcare-related answers annotated with mapping between textual propositions within each answer and relevant aspects to the target question.

L30 - Model files for Fast Entity Linker, version 1.0 (5.2G) (Hosted on AWS)

Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).

In this dataset we release datapacks for English, Spanish, and Chinese built for our unsupervised, accurate, and extensible multilingual named entity recognition and linking system Fast Entity Linker.

In our system, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data-structures and aggressive hashing functions. The models released in this dataset include "entity embeddings" and "wikipedia click-log data".

Entity Embeddings are vector-based representations that capture how entities are referred to in language contexts. We train entity embeddings using Wikipedia articles and use hyperlinks in the articles to their canonical forms for their associated entities.

Wikipedia click-logs gives very useful signals to disambiguate partial or ambiguous entity mentions, such as Obama (Michelle or Barack), Liverpool (City or Football team), or Fox (Person or Organization). We extract the in-wiki links from Wikipedia and create pairs (alias, entity) where the alias is the text in the anchor and the entity is the id of a Wikipedia page pointed out by an outgoing link.

L31 - Questions on Yahoo Answers labeled as either informational or conversational, version 1.0

The dataset includes non-deleted English questions from Yahoo Answers, posted between the years 2006 and 2016, sampled uniformly at random. Each question include a URL to its Yahoo Answers page, its title, description, high-level category (one of 26), direct category, and a label marking it as informational ('0') or conversational ('1'). A small subset of the questions is marked as borderline ('2').

L32 - The Yahoo News Annotated Comments Corpus, version 1.0

The dataset contains comment threads posted in response to online news articles. We annotated the dataset at the comment-level and the thread-level. The annotations include 6 dimensions of individual comments and 3 dimensions of threads on the whole. The coding was done by professional, trained editors and untrained crowdsourced workers. The corpus contains annotations for a novel corpus of 2.4k threads and 9.2k comments from Yahoo News and 1k threads from Internet Argument Corpus.