 This dataset contains n-grams (contiguous sets of words of size n), n = 1 to 5, extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites. The documents were published on these sites between February 2006 and December 2006. The dataset does not contain the documents themselves, but only the n-grams that occur at least twice. It provides statistics such as frequency of occurrence, number and entropy of different left (right) single-token contexts of each n-gram. This dataset may be used by researchers to build statistical language models for speech or handwriting recognition or machine translation.
There are 3 files in this dataset. They are 3.5 Gbyte, 4.3 Gbyte and 4.4 Gbyte. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  The dataset contains about 100 million triples of RDF data obtained by extracting metadata from publicly available webpages. Three forms of embedded metadata are extracted: microformats (hCard, hCalendar and hReview), RDFa metadata and RDF documents linked to webpages. All metadata extracted from a webpage is converted to RDF. The data is made available in the WARC format, version 0.9.
The dataset may serve as a testbed for research in scalability in the Semantic Web area and also for developing methods to deal with metadata that is incomplete, erroneous or biased in some way.
The size of this dataset is 2.4 Gbyte. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This SW1 dataset contains a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. In order to build SW1, we started from the XML-ized Wikipedia dump distributed by the University of Amsterdam. This snapshot of the English Wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output. Raw Data (Multitag format)
The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries.
There are 4 files in this dataset ranging in size from 1.3 Gbyte to 1.8 Gbyte. Here are all the papers published on this Webscope Dataset: - Inferring the Most Important Types of a Query: a Semantic Approach
- Ranking Very Many Typed Entities on Wikipedia
- Learning to Tag and Tagging to Learn: A Case Study on Wikipedia
All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers.
In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008).
The size of this dataset is 102 MB. Here are all the papers published on this Webscope Dataset: - Learning to Rank Answers on Large Online QA Collections
All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  Yahoo! Answers is a website where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. It is a small subset of the questions, selected for their linguistic properties (for example they all start with "how {to|do|did|does|can|would|could|should}"). Additionally, we removed questions and answers of obvious low quality, i.e., we kept only questions and answers that have at least four words, out of which at least one is a noun and at least one is a verb. The final subset contains 142,627 questions and their answers.
In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs were replaced with locally-generated identifiers. This dataset may be used by researchers to learn and validate answer-extraction models for manner questions. An example of such work was published by Surdeanu et al. (2008).
The size of this dataset is 104 MB. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus distributed here contains 4,483,032 questions and their answers.
In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs and all user ids were anonymized so that no identifying information is revealed. This dataset may be used by researchers to learn and validate answer extraction models. An example of such work was published by Surdeanu et al. (2008).
There are 2 files in this dataset. Part 1 is 1.7 Gbyte and part 2 is 1.9 Gybte. Here are all the papers published on this Webscope Dataset: - Evaluating and Predicting Answer Quality in Community QA
All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains URLs of questions posted to Yahoo! Answers, along with the question types assigned to these questions by human judges. The question types are "informational", "advice", "opinion", and "polling".
The size of this dataset is 14 K. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains the 1000 most frequent web search queries issued to Yahoo! Search for nine different languages. The languages covered are Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The dataset may be useful for various information retrieval and data mining research investigations, especially those involving cross- or multi-lingual search tasks.
The size of this dataset is 45 K. Here are all the papers published on this Webscope Dataset: All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains URLs of questions posted to Yahoo! Answers, along with the question types assigned to these questions by human judges. The question types are "informational", "advice", "opinion", and "polling".
The size of this dataset is 14 K. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset consists of Yahoo! search queries that share clicked URLs with TREC queries. Queries that share clicked URLs are often referred to as co-clicked queries. The TREC queries cover a subset of TREC topics 451-550, 701-850, and wt09-01 through wt09-50, which are widely used within the information retrieval community for various Web search-related experiments.
The size of this dataset is 33K. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains lists of popular web sources clicked by Yahoo! users when they search for entities within two popular domains, Athletes and Politician. In addition to the global popular list, the is dataset also includes location specific and location/entity specific popular lists. This dataset can be used to study the location bias in entities and web sources, and therefore allow the researchers to study location specific information extraction and entity portal generation.
Total size for this dataset is 14MB. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains a random sample of 4496 queries posted to Yahoo's US search engine in January, 2009. For privacy reasons, the query set contains only queries that have been asked by at least three different users and contain only letters of the English alphabet, sequences of numbers not longer than four numbers and punctuation characters. The query set does not contain user information nor does it preserve temporal aspects of the query log.
Total size for this dataset is 41K. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains a small sample of Yahoo! Answers question/answers pages visited following search engine queries in August 2010. The dataset also contains user ratings of query clarity, query-question match, and query-answer satisfaction collected using Amazon Mechanical Turk. The dataset may be used by researchers to validate algorithms to predict searcher satisfaction with existing community-based answers. It may also enable researchers to validate algorithms to predict query clarity and query-question match.
The size of this dataset is 1.5 MB. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  Annonymized Yahoo! Search Logs with Relevance Judgments version 1.0
The size of this dataset is 1.3 Gbyte. Here are all the papers published on this Webscope Dataset: - Ranked accuracy and unstructured distributed search
All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains a small sample of pages that contain complex HTML forms. Complex forms are HTML forms that have 3') or more form controls such as input tags (type: text|checkbox|radio|image|button) and select tags (dropdown boxes). The dataset contains 2.67 million complex forms. Such data may be useful for form classification and in uncovering hidden web data.
This dataset is very, very large over 50 Gbyte. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains a large sample of noun phrases and their context, extracted from Yahoo! News data. The data can be used for AI and NLP studies. Due to the size of this dataset, it can only be shipped at this time All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains browsing behavior data for a collection of users on Yahoo! Answers from several months in 2006. Users interact socially and are rewarded by a point system based on Q&A system. The data includes questions, answers, and browsing behavior for users on the site. There is no textual or NLP information. The data may be used by machine learning techniques to discern the different types of users on Yahoo! Answers and how they interact with the site by asking and answering questions or browsing, or test models of different classes of users by how they move around the site and interact with the point system and with each other. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart  This dataset contains a sample of Yahoo! Search queries issued in June 2011 by searchers who posted a relevant question on Yahoo! Answers shortly after searching. It also contains the category information of the posted question. The dataset may be used by researchers to better understand the transition from searchers to askers, especially the information needs causing the transition. All datasets have been reviewed to conform to Yahoo!'s data protection standards, including strict controls on privacy. We have a number of datasets that we are excited to share with you. Learn how to get involved. Department Head approval is not required.Dataset has been added to your cart View Cart
|