Der DocBlog

Promotion in eigener Sache

Category: Lucene

(German) Language Processing for Lucene

This paper introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. Continue reading

POS filtering for Lucene 4.6

IF you are working with Lucene you will probably use a stop word list. Most likely, you will not search for pronouns (he/she/it) for example. Instead of using a predefined stop word list, you can also look at the word categories, i.e. the parts of speech, of the words and exclude those word from the index, that are not interested in.
The approach describes here is language independent. All you need to have is the right model for the language you are working with and pass the path to that model to the software. Then you create a list of those POS your don’t need. Models can be found here.
Continue reading

German Lemmatization in Lucene 4.6.0

Synthetic languages, such as German, are marked by the usage of morphemes, rathern than prepositions for example, to mark the function or the relations of a word in a sentence. Either stemming or lemmatization can be used to overcome problems in information retrieval arising from declension (of nouns, adjectives, pronouns: for example die schönen Häuser) and conjugation (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)).
Stemming shortens a word down to it’s stem. For the three forms of sprechen above this leads to the common stem sprech. Lemmatization leads for all three forms to the common lemma sprechen. The Lemmatization has two advantages over the stemming: Continue reading

Using GermaNet to insert synonyms into Lucene 4.6.0 index

The goal is to add all possible synonyms during indexing (in contrast to finding synonyms during search) to avoid multiple computations of the possible synonyms.

This tutorial is based on the approach described in „Lucene in Action“ (source code can be found here: It was altered in a few points:
it supports Lucene 4.6.0
it looks up synonyms from GermaNet
it looks stores synonyms to the index

Continue reading

© 2017 Der DocBlog

Theme by Anders NorenUp ↑