Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).
In this dataset we release datapacks for English, Spanish, and Chinese built for our unsupervised, accurate, and extensible multilingual named entity recognition and linking system Fast Entity Linker.
In our system, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data-structures and aggressive hashing functions. The models released in this dataset include "entity embeddings" and "wikipedia click-log data".
Entity Embeddings are vector-based representations that capture how entities are referred to in language contexts. We train entity embeddings using Wikipedia articles and use hyperlinks in the articles to their canonical forms for their associated entities.
Wikipedia click-logs gives very useful signals to disambiguate partial or ambiguous entity mentions, such as Obama (Michelle or Barack), Liverpool (City or Football team), or Fox (Person or Organization). We extract the in-wiki links from Wikipedia and create pairs (alias, entity) where the alias is the text in the anchor and the entity is the id of a Wikipedia page pointed out by an outgoing link.