Outreach > Datasets > Language Data
L6 - Yahoo! Answers Comprehensive Questions and Answers version 1.0 (multi part)
Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus distributed here contains 4,483,032 questions and their answers.
In addition to question and answer text, the corpus contains a small amount of metadata, i.e., which answer was selected as the best answer, and the category and sub-category that was assigned to this question. No personal information is included in the corpus. The question URIs and all user ids were anonymized so that no identifying information is revealed. This dataset may be used by researchers to learn and validate answer extraction models. An example of such work was published by Surdeanu et al. (2008).
There are 2 files in this dataset. Part 1 is 1.7 Gbyte and part 2 is 1.9 Gybte.
Here are all the papers published on this Webscope Dataset:
Dataset has been added to your cart