Back to the news list

AGDISTIS - Agnostic Named Entity Disambiguation Framework

14.09.2017

A contribution from Diego Moussallem

More than one exabyte of data is added to the Web every day₁. Automatic extraction of knowledge from this data demands the use of efficient Natural Language Processing (NLP) techniques such as text aggregation, text summarization and knowledge extraction. One of the most important NLP tasks is Entity Linking (EL), also known as Named Entity Disambiguation (NED). Informally, the goal here is as follows: Given a piece of text, a reference knowledge base K and a set of entity mentions in that text, map each entity mention to the corresponding resource in K.

Several challenges have to be addressed when dealing with EL. For example, an entity can have a large number of Surface Forms (SF) (also known as labels) due to synonymy, acronyms and typos. For example, New York City, NY and Big Apple are all labels for the same entity. Moreover, multiple entities can share the same name due to homonymy and ambiguity. For example, both the state and the city of New York are called New York.

Project page: <link http: aksw.org projects agdistis.html>aksw.org/Projects/AGDISTIS.html
Source code: <link https: github.com aksw agdistis>github.com/AKSW/AGDISTIS
Manual: <link https: github.com aksw agdistis wiki>github.com/AKSW/AGDISTIS/wiki

Therefore, EL has recently been the subject of a significant body of research. Although, there is a multitude of significant EL approaches, most of them focus only on the English language. We present AGDISTIS to address this gap in other languages.

What is AGDISTIS?
AGDISTIS is an Open Source Named Entity Disambiguation Framework published in 2014 at ISWC₂ that is able to link entities to every Linked Data Knowledge Base. AGDISTIS combines the graph-based algorithms with label expansion strategies and string similarity measures. Based on this combination, it can efficiently detect the correct URIs for a given set of named entities within an input text.

The latest version of AGDISTIS comprises a novel algorithm called MAG₃ which contains additional normalization patterns for dealing with abbreviations, user’s typos and unseen surface forms of a given entity. These patterns are based on multilingual methods thus making AGDISTIS capable of working on multiple languages. Additionally, MAG performs an extra search considering the document’s context information. This context capability allows MAG to increase its quality, ensuring the result contains more related entities through their relationships from a given KB. Moreover, MAG is able to disambiguate also common nouns, such as medicine, as opposed to only named entities such as Aspirin. Finally, MAG relies on two graph-based algorithms - HITS and PageRank while disambiguating.

How to run AGDISTIS?

The current version of AGDISTIS should be run by following the instructions <link https: github.com aksw agdistis wiki>on wiki. In the future, we will provide a Docker container and instructions for how to use it.

You can also ask our multilingual endpoints from command line via curl.

English:
curl --data-urlencode "text='The <entity>University of Leipzig</entity> in <entity>Barack Obama</entity>.'" -d type='agdistis' titan.informatik.uni-leipzig.de/AGDISTIS

German:

curl --data-urlencode "text='Die Stadt <entity>Dresden</entity> liegt in <entity>Sachsen</entity>.'" -d type='agdistis' 139.18.2.164/AGDISTIS_DE

Chinese:
curl --data-urlencode "text='The <entity>shanghai</entity> in <entity>北京市</entity>.'" -d type='agdistis' 139.18.2.164/AGDISTIS_ZH

With the extension MAG, it is possible to select one of the two disambiguation algorithms and also to send and retrieve data in NIF format₄. In addition, MAG returns the score per candidate if requested. The commands below are not working currently in our web services, to use them requires the installation of AGDISTIS locally:

curl --data-urlencode "text='The <entity>University of Leipzig</entity> in <entity>Barack Obama</entity>.'" -d type='candidates' localhost/AGDISTIS

or if you want to use a larger text or NIF file:

curl --data-urlencode "text@test.txt" -d type=agdistis localhost/AGDISTIS
curl --data-urlencode "text@nif.ttl" -d type=agdistis localhost/AGDISTIS

Please note that every entity that one needs to disambiguate must be recognized beforehand by a NER tool such as FOX since AGDISTIS is only an entity linking tool.

footnotes

₁www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day/
₂dl.acm.org/citation.cfm
₃arxiv.org/abs/1707.05288
₄ <link http: persistence.uni-leipzig.org nlp2rdf>persistence.uni-leipzig.org/nlp2rdf/