Cross-lingual Document Linking

Jan 21 2013 Published by blaz under Project News, XLike Technology

Measuring similarity between documents written in different languages is useful for several tasks, for example when building a cross-lingual content based recommendation system. Another example is tracking how news spreads which may involve crossing different languages.

Having a cross-lingual similarity function and a common representation which is language independent enables us to transform cross-lingual text mining problems (CL-classification, CL-information retrieval, CL-clustering) to standard machine learning techniques.

Below we illustrate how to construct the language independent document representations as well as the cross-lingual similarity function, based on a multilingual document collection (training data).

The current technology is based on LSI (latent semantic index) and CCA (canonical correlation analysis) approach described in:

(LSI) Primoz Skraba, Jan Rupnik, and Andrej Muhic. Low-rank approximations for large, multi-lingual data. Low Rank Approximation and Sparse Representation, NIPS 2011 Workshop, 2011.B [link].
(CCA) Cross-lingual document retrieval through hub languages. V: 2012 Workshop book : NIPS 2012, Neural Information Processing Systems Workshop, December 7-8, 2012, Lake Tahoe, Nevada, US. [S. l.]: Neural Information Processing System Foundation, 2012, 5 str. [link].

Both approaches use the Wikipedia alignment information to produce the compressed aligned topics. That enables the mapping of documents in language independent space. Data compression and multilingual topic computation in LSI case is done using SVD – singular value decomposition to reduce the noise and the complexity of similarity computation. In CCA case we first compress the covariance matrices using SVD and then refine the topics using generalized version of CCA.

Web demo of cross-lingual similarity search

FP7

XLike is funded by the European Community's Seventh Framework Programme FP7/2007-2013
Recent News
Links
- Language Processing Pipeline Open source prototypes from XLike
- GitHub code repository Open source prototypes from XLike
- Newsfeed Clean stream of semantically enriched news articles
- Multilingual Language Processing Wweb services for multilingual language processing
- Cross-lingual Document Linking Demo of cross-lingual similarity search
- News Data Visualization Interactive interface to Newsfeed data enriched with XLike technologies
- Event Registry
- QMiner Analytic platform for real-time large-scale streams containing structured and unstructured data
Project related videos
- xLiTe: Cross-Lingual Technologies
- Kickoff Meeting 2012, Bled

Cross-lingual Document Linking

FP7

Recent News

Links

Project related videos