The cross-lingual semantic annotation links the linguistic resources in one language to resources in the knowledge bases in any other language or to language independent representations. This semantic representation is later used in XLike for document mining purposes such as enabling cross-lingual services for publishers, media monitoring or developing new business intelligence applications.
The goal is to map word phrases in different languages into the same semantic interlingua, which consists of resources specified in knowledge bases such as Wikipedia and Linked Open Data (LOD) sources. Cross-lingual semantic annotation is performed in two stages: (1) first, candidate concepts in the knowledge base are linked to the linguistic resources based on a newly developed cross-lingual linked data lexica, called xLiD-Lexica, (2) next the candidate concepts get disambiguated based on the personalized PageRank algorithm by utilizing the structure of information contained in the knowledge base.
The xLiD-Lexica is stored in RDF format and contains about 300 million triples of cross-lingual groundings. It is extracted from Wikipedia dumps of July 2013 in English, German, Spanish, Catalan, Slovenian and Chinese, and based on the canonicalized datasets of DBpedia 3.8. More details can be found in .
The xLiD-Lexica SPARQL Endpoint and cross-lingual semantic annotation services are described as follows:
- xLiD-Lexica: The cross-lingual groundings in xLiD-Lexica are translated into RDF data and are accessible through a SPARQL endpoint , based on OpenLink Virtuoso as the back-end database engine.
- Semantic Annotation: The cross-lingual semantic annotation service is based on the xLiD-Lexica for entity mention recognition and the Java Universal Network/Graph Framework for graph-based disambiguation. An example of the service for annotating the XLike website using DBpedia in German is accessible under the URL .