{"id":453,"date":"2015-01-05T12:48:25","date_gmt":"2015-01-05T11:48:25","guid":{"rendered":"http:\/\/xlike.ijs.si\/?p=453"},"modified":"2019-04-02T09:06:08","modified_gmt":"2019-04-02T08:06:08","slug":"language-processing-pipeline","status":"publish","type":"post","link":"http:\/\/xlike.ijs.si\/language-processing-pipeline\/","title":{"rendered":"Language Processing Pipeline"},"content":{"rendered":"

Xlike requires the linguistic processing of large numbers of documents in a variety of languages.<\/p>\n

DEMO<\/a><\/p>\n

Thus, WP2 is devoted to building language analysis pipelines that will extract from texts the core knowledge that the project is built upon.<\/p>\n

The different language functionalities are implemented following the service oriented architecture (SOA) approach defined in the project Xlike, and presented in Figure 1.<\/p>\n

\"xlike-nlp\"<\/a>Figure 1: Xlike Language Processing Architecture.<\/em><\/p>\n

Therefore all the pipelines (one for each language) have been implemented as web services and may be requested to produce different levels of analysis (e.g. Tokenization, lemmatization, NERC, parsing, relation extraction, etc.). This approach is very appealing due to the fact that it allows to treat every language independently and to execute the whole language analysis process at different threads or computers allowing an easier parallelization (e.g., using external high performance platforms such as Amazon Elastic Compute Cloud EC2<\/a> as needed. Furthermore, it also provides independent development life-cycles for each language which is crucial in this type of research projects. Recall that these web services can be deployed locally or\u00a0remotely, maintaining the option of using them in a stand-alone configuration.<\/p>\n

Figure 1 also represents by large boxes the different technology used for the implementation of each module. White square modules indicates those functionalities that run locally inside a web service and can’t be accessed directly, and shaded round modules indicate private web services which can be called remotely for accessing the specified functionality.<\/p>\n

Each language analysis service is able to process thousands of words per second when performing shallow analysis (up to NE recognition), and hundreds of words per second when producing the semantic representation based on full analysis.<\/p>\n

For instance, the average speed for analyzing an English document with shallow analysis (tokenizer, splitter, morphological analyzer, POS tagger, lemmatization, and NE detection and classification) is about 1,300 tokens\/sec on a i7 3.4 Ghz processor (including communication\u00a0overhead, XML parsing, etc.). This means that an average document (e.g, a news item of around 400 tokens) is analyzed in 0.3 seconds.<\/p>\n

When using deep analysis (i.e., adding WSD, dependency parsing, and SRL to the previous steps), the speed drops to about 70 tokens\/sec, thus an average document takes about 5.8 seconds to be analyzed.
\nThe parsing and SRL models are still in a prototype stage, and we expect to largely reduce the difference between shallow and deep analysis times.<\/p>\n

However, it is worth noting that the web-service architecture enables the same server to run a different thread for each client without using much extra memory. This exploitation of multiprocessor capabilities allows a parallelism degree of as many request streams as available cores, yielding an actually much higher average speed when large collections must be processed.<\/p>\n

Semantic Representation<\/strong><\/p>\n

Apart from the basic state-of-the-art tokenizers, lemmatizers, PoS\/MSD taggers, and NE recognizers, each pipeline requires deeper processors able to build the target language-independent semantic representation. For that, we rely on three steps: dependency parsing, semantic role labeling and word sense disambiguation. These three processes, combined with multilingual ontological resources such as different WordNets, are the key to the construction of our semantic representation.<\/p>\n

Dependency Parsing<\/strong><\/p>\n

In XLike, we use the so-called graph-based methods for dependency parsing. In particular we use MSTParser<\/a> for Chinese and Croatian, and Treeler<\/a> –a library developed by the UPC team that implements several methods for dependency parsing, among other statistical methods for tagging and parsing– for the other languages.<\/p>\n

Semantic Role Labeling<\/strong><\/p>\n

As with syntactic parsing, we are using the Treeler library to develop machine-learning based SRL methods. In order to train models for this task, we use the treebanks made available by the CoNLL-2009 shared task, which provided data annotated with predicate-argument relations for English, Spanish, Catalan, German and Chinese. No treebank annotated with semantic roles exists for Slovene or Croatian yet, thus, no SRL module is available for these languages in XLike pipelines.<\/p>\n

Word Sense Disambiguation<\/strong><\/p>\n

The used Word Sense Disambiguation engine is the UKB implementation provided by FreeLing. UKB is a non-supervised algorithm based on PageRank over a semantic graph such as WordNet.
\nWord sense disambiguation is performed for all languages for which a WordNet is publicly available. This includes all languages in the project except Chinese.<\/p>\n

The goal of WSD is to map specific languages to a common semantic space, in this case, WN synsets. Thanks to existing connections between WN and other resources, SUMO and OpenCYC sense codes are also output when available.\u00a0Finally, we use PredicateMatrix –a lexical semantics resource combining WordNet, FrameNet, PropBank, and VerbNet– to project the obtained concepts to PropBank predicates and FrameNet\u00a0diathesis structures, achieving a normalization of the semantic roles produced by the SRL (which are treebank-dependent, and thus, not the same for all languages).<\/p>\n

Frame Extraction<\/strong><\/p>\n

The final step is to convert all the gathered linguistic information into a semantic representation. Our method is based on the notion of frame: a semantic frame is a schematic representation of a situation involving various participants. In a frame, each participant plays a role. There is a direct correspondence between roles in a frame and semantic roles; namely, frames correspond to predicates, and participants correspond to the arguments of the predicate. We distinguish three types of participants: entities, words, and frames.<\/p>\n


\n1\u00a0Acme\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0acme\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0NP\u00a0\u00a0\u00a0B-PER\u00a0\u00a08\u00a0SBJ\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A1\u00a0\u00a0\u00a0\u00a0\u00a0A0\u00a0\u00a0\u00a0\u00a0\u00a0A0
\n2\u00a0,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Fc\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a01\u00a0P\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n3\u00a0based\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0base\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0VBN\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a01\u00a0APPO\u00a0\u00a000636888-v\u00a0base.01\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n4\u00a0in\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0in\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0IN\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03\u00a0LOC\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0AM-LOC\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n5\u00a0New_York\u00a0\u00a0\u00a0new_york\u00a0\u00a0\u00a0NP\u00a0\u00a0\u00a0B-LOC\u00a0\u00a04\u00a0PMOD\u00a0\u00a009119277-n\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n6\u00a0,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Fc\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a01\u00a0P\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n7\u00a0now\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0now\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0RB\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a08\u00a0TMP\u00a0\u00a0\u00a009119277-n\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0AM-TMP\u00a0_
\n8\u00a0plans\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0plan\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0VBZ\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00\u00a0ROOT\u00a0\u00a000704690-v\u00a0plan.01\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n9\u00a0to\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0to\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0TO\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a08\u00a0OPRD\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A1\u00a0\u00a0\u00a0\u00a0\u00a0_
\n10\u00a0make\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0make\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0VB\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a09\u00a0IM\u00a0\u00a0\u00a0\u00a001617192-v\u00a0make.01\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n11\u00a0computer\u00a0\u00a0\u00a0computer\u00a0\u00a0\u00a0NN\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a010\u00a0OBJ\u00a0\u00a0\u00a003082979-n\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0A1
\n12\u00a0and\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0and\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0CC\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a011\u00a0COORD\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n13\u00a0electronic\u00a0electronic\u00a0JJ\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a014\u00a0NMOD\u00a0\u00a002718497-a\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n14\u00a0products\u00a0\u00a0\u00a0product\u00a0\u00a0\u00a0\u00a0NNS\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a012\u00a0CONJ\u00a0\u00a004007894-n\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n15\u00a0.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Fp\u00a0\u00a0\u00a0O\u00a0\u00a0\u00a0\u00a0\u00a0\u00a08\u00a0P\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0_
\n<\/code><\/p>\n

Figure 2: Output of the analyzers for the sentence Acme, based in New York, now plans to make computer and electronic products<\/em><\/p>\n

For example, in the sentence in Figure 2, we can find three frames:<\/p>\n