In the following, the implementation of the German Language Processing 4 Lucene (glp3lucene: http://sourceforge.net/projects/glpforlucene/files/) is described. An example of how to use the package can be found under the link above, and is described in more detail HERE.

The glp4lucene package consists of 3 parts: a lemmatizer, a part-of-speech (POS) filter, and a synonym-analyzer.

The Lemmatizer

Synthetic languages, such as German, are marked by the usage of morphemes, rathern than prepositions for example, to mark the function or the relations of a word in a sentence. Either stemming or lemmatization can be used to overcome problems in information retrieval arising from declension (of nouns, adjectives, pronouns: for example die schönen Häuser) and conjugation (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)).
Stemming shortens a word down to it’s stem. For the three forms of sprechen above this leads to the common stem sprech. Lemmatization leads for all three forms to the common lemma sprechen. The Lemmatization has two advantages over the stemming: When people conduct a websearch, tehy are likely to look for a lemma. If the index is already lemmatized, there is no need for further treatment of the search term (although this is not always true, one should therefor lemmatize both index and search terms!). But a lemmatized word can be used to look a word in a dictionary and possibly find synonyms or a definition.
In the following I will describe how to include Lemmatization in the lucene indexing procedure. While there are some implemantations for English, there is no handy solution for German. I will use the mate-tool lemmatizer.

The package for lemmatization consists of three clases that will be examined in the following:

public final class GermanLemmatizerFilter extends TokenFilter {
	private final GermanLemmatizerProgram glp;
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final KeywordAttribute keywordAttr = addAttribute(KeywordAttribute.class);
  private static GermanLemmatizer gl;
  public GermanLemmatizerFilter(TokenStream input,GermanLemmatizer modelIn,GermanLemmatizerProgram glpIn) throws Exception {
    super(input);
    this.glp = glpIn;
    this.gl = modelIn;
  }
  /** Returns the next input Token, after being stemmed */
  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      if (!keywordAttr.isKeyword()) {
        char termBuffer[] = termAtt.buffer();
        termAtt.copyBuffer(termBuffer, 0, termAtt.length());
        char term[] = new char[termAtt.length()];
        for(int x = 0; x < termAtt.length(); x++){
        	term[x]= termAtt.charAt(x);
        }
        this.gl.lemmatize(term,this.glp);
        //nach der lemmatisierung hat sich ja term und länge geändert!
        final char finalTerm[] = this.glp.getCurrentBuffer();
        //Systen.err.println("Lemma: "+finalTerm.)
        final int newLength = this.glp.getCurrentBufferLength();
        if (finalTerm != termBuffer)
          termAtt.copyBuffer(finalTerm, 0, newLength);
        else
          termAtt.setLength(newLength);
      }
      return true;
    } else {
      return false;
    }
  }
}

The constructor

public GermanLemmatizerFilter(TokenStream input,GermanLemmatizer modelIn,GermanLemmatizerProgram glpIn)

takes the TokenStream (see usage!) the Lemmatizer Model (MAte-Tool) and the GermanLemmatizerProgram (see below) as a parameter.

It checks the length of the token and corrects it. Then it builds a string from the char token stream and looks up the lemma for it:

this.gl.lemmatize(term,this.glp);

Here the GermanLemmatizer instance is called. The class looks like the following:

public class GermanLemmatizer {
 	private static Lemmatizer l;
	public GermanLemmatizer(String modelPath) {
    	l = new Lemmatizer(modelPath);
	}
    /**
     * Factory method for loading a POS tagger.
     */
    public void lemmatize(char termBuffer[],GermanLemmatizerProgram glp){
    	String term="";
    	for (char c : termBuffer){
    		term=term.concat(Character.toString(c));
    	}
    	SentenceData09 i = new SentenceData09();
    	i.init(new String[] {term});
        // lemmatize a sentence; the result is stored in the stenenceData09 i 
        i=l.apply(i);
        char finalTerm[] = null;
        for (String s : i.plemmas){
        		finalTerm = s.toCharArray();
        }
        glp.setCurrent(finalTerm, finalTerm.length);
    }
};

When the Class is instantiated, the model is loaded from the given path. The function lmmatize lookes up the correct lemma for the given token, using the GermanLemmatizerProgramm class, which is implemented as follows:

public class GermanLemmatizerProgram {
    public GermanLemmatizerProgram() {
		current = new char[8];
	      setCurrent("");
	}
	private static final Object[] EMPTY_ARGS = new Object[0];

    /*public GermanLemmatizerProgram()
    {
      current = new char[8];
      setCurrent("");
    }*/
    /**
     * Set the current string.
     */
    public void setCurrent(String value)
    {
      current = value.toCharArray();
      cursor = 0;
      limit = value.length();
      limit_backward = 0;
      bra = cursor;
      ket = limit;
    }

    /**
     * Get the current string.
     */
    public String getCurrent()
    {
      return new String(current, 0, limit);
    }
    
    /**
     * Set the current string.
     * @param text character array containing input
     * @param length valid length of text.
     */
    public void setCurrent(char text[], int length) {
      current = text;
      cursor = 0;
      limit = length;
      limit_backward = 0;
      bra = cursor;
      ket = limit;
    }
    /**
     * Get the current buffer containing the stem.
     * 

* NOTE: this may be a reference to a different character array than the * one originally provided with setCurrent, in the exceptional case that * stemming produced a longer intermediate or result string. *

*

* It is necessary to use {@link #getCurrentBufferLength()} to determine * the valid length of the returned buffer. For example, many words are * stemmed simply by subtracting from the length to remove suffixes. *

* @see #getCurrentBufferLength() */ public char[] getCurrentBuffer() { return current; } /** * Get the valid length of the character array in * {@link #getCurrentBuffer()}. * @return valid length of the array. */ public int getCurrentBufferLength() { return limit; } // current string private char current[]; protected int cursor; protected int limit; protected int limit_backward; protected int bra; protected int ket; };

This example follows the template for the Snowball Lemmatizer quite strictly.

Synonymy extension

The goal is to add all possible synonyms during indexing (in contrast to finding synonyms during search) to avoid multiple computations of the possible synonyms.

This tutorial is based on the approach described in "Lucene in Action" (source code can be found here: http://www.manning.com/hatcher3/). It was altered in a few points:
it supports Lucene 4.6 and higher
it looks up synonyms from GermaNet (Version 6 and higher)
it stores synonyms to the index

Jar-Files needed:

  • GermaNet licene (http://www.sfs.uni-tuebingen.de/lsd/)
  • GermaNet API matching your GermaNet Version (here: 6) (http://www.sfs.uni-tuebingen.de/lsd/tools.shtml#APIs)
  • Lucene (4.6.0 - http://archive.apache.org/dist/lucene/java/4.6.0/) :
    • lucene-analyzers-common-4.6.0.jar
    • lucene-core-4.6.0.jar

    You can also use the maven pom-file found in the package here, to let maven take care of most dependencies. Only GermaNet and the GermaNet API need to be downloaded separately (see here)
    The jar-file consists of the following classes:

    java-classes

    To use the SynonymFilter, one has to first instantiate an isntance of the Class TestSynonymEngine using the path to GermaNet as a parameter.

    public class TestSynonymEngine implements SynonymEngine {
    	private static HashMap map = new HashMap();
    
    	public SynonymEngine start(String path) {
    		GermaNet gnet = null;
    		try {
    			try {
    				// path to GermaNet!
    				gnet = new GermaNet(path);
    			} catch (IOException e) {
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    		} catch (XMLStreamException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    		List allSynsets = gnet.getSynsets();
    		for (Synset s : allSynsets) {
    				List allWords = s.getAllOrthForms();
    				processSynonyms(allWords);			
    		}
    		return this;
    	}
    
    	public void processSynonyms(List allWords) {
    		for (String w1 : allWords) {
    			w1 = w1.toLowerCase();
    			for (String w2 : allWords) {
    				ArrayList synonyms = new ArrayList();
    				if (w1 != w2) {
    					/*if (w2.matches("bau") && allWords.contains("knast")){
    						System.err.println("ignoriere bau als synonym für knast");
    					}
    					else{*/
    						synonyms.add(w2.toLowerCase());
    					//}
    				}
    				if (!map.containsKey(w1)) {
    					String[] synonymArray = new String[synonyms.size()];
    					int i = 0;
    					while (i < synonyms.size()) {
    						synonymArray[i] = synonyms.get(i).toLowerCase();
    						i++;
    					}
    					map.put(w1, synonymArray);
    				} else {
    					String[] words1 = map.get(w1);
    					ArrayList words1AL = new ArrayList();
    					for (String w3 : words1) {
    						words1AL.add(w3);
    					}
    					words1AL.addAll(synonyms);
    					String[] synonymArray = new String[words1AL.size()];
    					int i = 0;
    
    					while (i < words1AL.size()) {
    						synonymArray[i] = words1AL.get(i);
    						i++;
    					}
    					map.put(w1, synonymArray);
    				}
    			}
    		}
    	}
    
    	@Override
    	public String[] getSynonyms(String s) {
    		if (map.get(s) != null) {
    			// System.err.println("looking up Synonym for "+s+", found "+map.get(s).length);
    			return map.get(s);
    		} else {
    			// System.err.println("looking up Synonym for "+s+", found 0");
    			// s.toCharArray()
    			String[] array = { s };
    			return array;
    		}
    	}
    	public HashMap getMap() {
    		
    			return map;
    	}
    
    }

    This class implements the interface SynonymEngine, and implements to neccessary function getSynonyms.

    What the TestSynonymEngine class does, basically, is loading the GermaNet synsets. Then iterates over all Synsets and builds a Map that contains each Lemma found in the Synsets containing the other members of the synset as a value of the map. Theses values are hence lists of synonymous words. The system can later add these alternatives to the index at the corresponding position.

    In processing the tokens found in the texts (see here) the class SynonymFilter is called. It takes as parmeters the instance of the SynonymsEngine described above, and a TokenStream.

    public class SynonymFilter extends TokenFilter {
      public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";
    
      private Stack synonymStack;
      private SynonymEngine engine;
      private AttributeSource.State current;
      //CharTermAttribute substitutes the older TermAttribute
      private final CharTermAttribute termAtt;
      private final PositionIncrementAttribute posIncrAtt;
    
      public SynonymFilter(TokenStream in, SynonymEngine engine) {
        super(in);
        synonymStack = new Stack();                     //#1 
        this.engine = engine;
    
        this.termAtt = addAttribute(CharTermAttribute.class);
        this.posIncrAtt = addAttribute(PositionIncrementAttribute.class);
      }
    
      public boolean incrementToken() throws IOException {
        if (synonymStack.size() > 0) {                          //#2
          String syn = synonymStack.pop(); 						//#2
          restoreState(current);								//#2
          //the following two line are new, due to the usage of CharTermAttribute
          termAtt.setEmpty();									//#2.1
          termAtt.append(syn);									//#2.2
          posIncrAtt.setPositionIncrement(0);                   //#3
          return true;
        }
    
        if (!input.incrementToken())                            //#4  
          return false;
    
        if (addAliasesToStack()) {                              //#5 
          current = captureState();                             //#6
        }
    
        return true;                                            //#7
      }
    
      private boolean addAliasesToStack() throws IOException {
        String[] synonyms = engine.getSynonyms(termAtt.toString()); //#8
        if (synonyms == null) {
          return false;
        }
        for (String synonym : synonyms) {                       //#9
          synonymStack.push(synonym);
        }
        return true;
      }
    }

    This class follows the implemantation found in "Lucene in Action" but is adapted to the newer Version of Lucene (4.6 and higher). For more information on the function see "Lucene in Action".

    POS-Filter

    If you are working with Lucene you will probably use a stop word list. Most likely, you will not search for pronouns (he/she/it) for example. Instead of using a predefined stop word list, you can also look at the word categories, i.e. the parts of speech, of the words and exclude those word from the index, that are not interested in.
    The approach describes here is language independent. All you need to have is the right model for the language you are working with and pass the path to that model to the software. Then you create a list of those POS your don't need. Models can be found here.

    Admittedly, the POS-tagging used in this package is quite basic and does not take context into account. Still, it is good enough to filter more unwanted words than the available stop word list.
    I strongly recommend to use maven to deal with the dependencies. The .jar-file also contains a examplary pom.xml file which you can use in your own project. Maven will take care of most dependencies (see here).

    We are adding a new kind of Attribute, POS-Attribute, that can be assigned to a token. The interface is quite simple. PartOfSpeechAtt.java looks like this:

    public interface PartOfSpeechAttribute extends Attribute {
      
        public void setPartOfSpeech(String pos);
      
        public String getPartOfSpeech();
      }

    the actual implementation looks as follows:

    public final class PartOfSpeechAttributeImpl extends AttributeImpl 
    implements PartOfSpeechAttribute {
    
    private String pos = "";
    
    public void setPartOfSpeech(String pos) {
    this.pos = pos;
    }
    
    public String getPartOfSpeech() {
    return pos;
    }
    
    @Override
    public void clear() {
    pos ="";
    }
    
    @Override
    public void copyTo(AttributeImpl target) {
    ((PartOfSpeechAttribute) target).setPartOfSpeech(pos);
    }
    
    }

    It implements the functionality one needs to add, read and alter the attribute.

    The next class implements the incrementToken method.

    public class PartOfSpeechTaggingFilter extends TokenFilter {
        PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
        CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
        MaxentTagger tagger;
        public PartOfSpeechTaggingFilter(TokenStream in,  MaxentTagger taggerIn) throws IOException {
          super(in);
          //TaggerConfig config = new TaggerConfig("-model", modelFile);
          // The final argument suppresses a "loading" message on stderr.
          tagger=taggerIn;
        }
        
        public boolean incrementToken() throws IOException {
          if (!input.incrementToken()) {
        	  return false;
          }
          else{
        	  OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);
          	  CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
          	  PartOfSpeechAttribute posAttribute = addAttribute(PartOfSpeechAttribute.class);
        	  String posTagged = tagger.tagString(charTermAttribute.toString().toLowerCase());
              String pos = posTagged.replaceAll(".*_(.*)?\\s?$", "$1");
              pos = pos.replaceAll("\\(|\\)", "");
              pos=pos.trim();
              posAttribute.setPartOfSpeech(pos);
        	  return true;
          }
          
        }
        public boolean incrementTokenSpecial() throws IOException {
            if (!input.incrementToken()) {
          	  return false;
            }
          	 return true;
          }
      }

    and is used to process the token stream and add the attribute, i.e. the tagged POS. Neede parameters are a MaxentTagger, instantiated with the right POS model, and the token stream. On how to call and use the class, see here.

    And finally we need another class to filter out the attribute we don't need. This is done as follows:

    private final ArrayList stopPOS;
      //private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
      private final PartOfSpeechAttribute posAtt=addAttribute(PartOfSpeechAttribute.class);
      /**
       * Constructs a filter which removes words from the input TokenStream that are
       * of a given POS.
       * 
       * @param matchVersion
       *          Lucene version to enable correct Unicode 4.0 behavior in the stop
       *          set if Version > 3.0.  See above for details.
       * @param in
       *          Input stream
       * @param stopPOS
       *          A {@link CharArraySet} representing the POS to be filtered.
       */
      public POSFilterOut(Version matchVersion, TokenStream in, ArrayList stopPOS) {
        super(matchVersion, in);
        this.stopPOS = stopPOS;
        System.err.println(this.stopPOS.size()+" POSs excluded");
        for (String s : this.stopPOS)
        	System.err.println("Will ignore POS \""+s+"\" for indexding.");
      }
      /**
       * Returns the next input Token whose term() is not a stop word.
       */
      @Override
      protected boolean accept() {
    	  //this line does not accept regular expressions in the POS, which is desirable
    	  //return !stopPOS.contains(posAtt.getPartOfSpeech());
    	  boolean b = true;
    	  for (String pos1 : stopPOS){
    		  if (posAtt.getPartOfSpeech().matches(pos1)){
    			  b=false;
    			  //uncomment for evaluation purposes 
    			  //System.err.println(posAtt.getPartOfSpeech()+" for "+termAtt.toString());
    		  }
    	  }
    	  return b;
      }
    
    }
    

    This class is instantiated using the Lucene Version constant, the TokenStream that contains the tokens, as well as a list of POS to be filtered out.

    All these classes are used when