After 3 years of writing and over one year of reviewing, defending, and editing my dissertation „On Link Predictions on Complex Networks with an Application to Ontologies and Semantics“ is now finished and available on different channels.
The original version is available at http://geb.uni-giessen.de/geb/volltexte/2017/13007 (open access).
The book version is available on Amazon and at other book stores.
For the Kindle version click here
For the print paperback: click here
This paper presents a supervised machine learning approach that aims at annotating those homograph word forms in WordNet that share some common meaning and can hence be thought of as belonging to a polysemous word. Continue reading
This paper introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. Continue reading
The encfsApp icon on the Desktop
After I created a little Automator script to facilitate mounting encfs volumes under Mac before (see here), I did the same for my Ubuntu machine. If works very similar: You drag and drop the encfs encoded folder onto the icon, the script asks for the passwords and mount the encfs volume. Dropping the folder onto the icon a second, it umounts the folder.
I was using BoxCryptor Classic (the free edition) for well over a year or maybe longer. While it usually does what it is designed for (encrypt a folder und mounting the folder by double clicking) it (the free version) comes with some restriction. For one, it does not encode the file-names. Furthermore, you can only mount one device at a time, actually, you can only manage one device at a time, which is even more annoying. To get rid of these restrictions, you can pay € 34.99 to get the private full version.
Since I’m constantly switching between a Mac and Linux-PCs, I was using encfs on the Linux machines. Encfs supports more than one mounted volume and file-name encryption. Encfs is compatible to BoxCryptor (well, actually its the otherway around), but it is not as comfortable to use. Usually you would have to mount the folder using the commandline. (Although there are some tools that try to facilitate this). Continue reading
It has been claimed that the Voynich manuscript is a encrypted text, and encrypted (natural?) human language. Other take it for a fraud, an educated hoax.
Recently I have been wondering about the statistical features of the Voynich manuscript, and about statistical features of human language texts that distinguish them from gibberish. In the following I will look at the word distribution and the entropy of the Voynich manuscript compared to Latin, modern German, and English.
In the following, the implementation of the German Language Processing 4 Lucene (glp3lucene: http://sourceforge.net/projects/glpforlucene/files/) is described. An example of how to use the package can be found under the link above, and is described in more detail HERE.
IF you are working with Lucene you will probably use a stop word list. Most likely, you will not search for pronouns (he/she/it) for example. Instead of using a predefined stop word list, you can also look at the word categories, i.e. the parts of speech, of the words and exclude those word from the index, that are not interested in.
The approach describes here is language independent. All you need to have is the right model for the language you are working with and pass the path to that model to the software. Then you create a list of those POS your don’t need. Models can be found here.
Synthetic languages, such as German, are marked by the usage of morphemes, rathern than prepositions for example, to mark the function or the relations of a word in a sentence. Either stemming or lemmatization can be used to overcome problems in information retrieval arising from declension (of nouns, adjectives, pronouns: for example die schönen Häuser) and conjugation (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)).
Stemming shortens a word down to it’s stem. For the three forms of sprechen above this leads to the common stem sprech. Lemmatization leads for all three forms to the common lemma sprechen. The Lemmatization has two advantages over the stemming: Continue reading
The goal is to add all possible synonyms during indexing (in contrast to finding synonyms during search) to avoid multiple computations of the possible synonyms.
This tutorial is based on the approach described in „Lucene in Action“ (source code can be found here: http://www.manning.com/hatcher3/). It was altered in a few points:
it supports Lucene 4.6.0
it looks up synonyms from GermaNet
it looks stores synonyms to the index