Der DocBlog

Promotion in eigener Sache

Graph-Based, Supervised Machine Learning Approach to (Irregular) Polysemy in WordNet

This paper presents a supervised machine learning approach that aims at annotating those homograph word forms in WordNet that share some common meaning and can hence be thought of as belonging to a polysemous word. Continue reading

(German) Language Processing for Lucene

This paper introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. Continue reading

Encfs „GUI“ for Linux


The encfsApp icon on the Desktop

After I created a little Automator script to facilitate mounting encfs volumes under Mac before (see here), I did the same for my Ubuntu machine. If works very similar: You drag and drop the encfs encoded folder onto the icon, the script asks for the passwords and mount the encfs volume. Dropping the folder onto the icon a second, it umounts the folder.

Continue reading

MacOS: BoxCryptor alternative encfs and

I was using BoxCryptor Classic (the free edition) for well over a year or maybe longer. While it usually does what it is designed for (encrypt a folder und mounting the folder by double clicking) it (the free version) comes with some restriction. For one, it does not encode the file-names. Furthermore, you can only mount one device at a time, actually, you can only manage one device at a time, which is even more annoying. To get rid of these restrictions, you can pay € 34.99 to get the private full version.

Since I’m constantly switching between a Mac and Linux-PCs, I was using encfs on the Linux machines. Encfs supports more than one mounted volume and file-name encryption. Encfs is compatible to BoxCryptor (well, actually its the otherway around), but it is not as comfortable to use. Usually you would have to mount the folder using the commandline. (Although there are some tools that try to facilitate this). Continue reading

Language or Hoax: Some statistical features of the Voynich manuscript

It has been claimed that the Voynich manuscript is a encrypted text, and encrypted (natural?) human language. Other take it for a fraud, an educated hoax.

Recently I have been wondering about the statistical features of the Voynich manuscript, and about statistical features of human language texts that distinguish them from gibberish. In the following I will look at the word distribution and the entropy of the Voynich manuscript compared to Latin, modern German, and English.

Continue reading

German Language Processing 4 Lucene: Implementation

In the following, the implementation of the German Language Processing 4 Lucene (glp3lucene: is described. An example of how to use the package can be found under the link above, and is described in more detail HERE.
Continue reading

POS filtering for Lucene 4.6

IF you are working with Lucene you will probably use a stop word list. Most likely, you will not search for pronouns (he/she/it) for example. Instead of using a predefined stop word list, you can also look at the word categories, i.e. the parts of speech, of the words and exclude those word from the index, that are not interested in.
The approach describes here is language independent. All you need to have is the right model for the language you are working with and pass the path to that model to the software. Then you create a list of those POS your don’t need. Models can be found here.
Continue reading

German Lemmatization in Lucene 4.6.0

Synthetic languages, such as German, are marked by the usage of morphemes, rathern than prepositions for example, to mark the function or the relations of a word in a sentence. Either stemming or lemmatization can be used to overcome problems in information retrieval arising from declension (of nouns, adjectives, pronouns: for example die schönen Häuser) and conjugation (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)).
Stemming shortens a word down to it’s stem. For the three forms of sprechen above this leads to the common stem sprech. Lemmatization leads for all three forms to the common lemma sprechen. The Lemmatization has two advantages over the stemming: Continue reading

Using GermaNet to insert synonyms into Lucene 4.6.0 index

The goal is to add all possible synonyms during indexing (in contrast to finding synonyms during search) to avoid multiple computations of the possible synonyms.

This tutorial is based on the approach described in „Lucene in Action“ (source code can be found here: It was altered in a few points:
it supports Lucene 4.6.0
it looks up synonyms from GermaNet
it looks stores synonyms to the index

Continue reading

Server Messages: Send mail when job has finished

If you ever used a commercial cloud computing service, you might know their nice monitoring services, which inform you when jobs are done.

If you run long running programs, scripts, or services on your own machine, you don’t want to have to look at them every few hours, days, or weeks to see if their done. I set up my server to send me an e-mail when certain jobs are have finished. Continue reading

« Older posts

© 2016 Der DocBlog

Theme by Anders NorenUp ↑