Synthetic languages, such as German, are marked by the usage of morphemes, rathern than prepositions for example, to mark the function or the relations of a word in a sentence. Either stemming or lemmatization can be used to overcome problems in information retrieval arising from declension (of nouns, adjectives, pronouns: for example die schönen Häuser) and conjugation (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)).
Stemming shortens a word down to it’s stem. For the three forms of sprechen above this leads to the common stem sprech. Lemmatization leads for all three forms to the common lemma sprechen. The Lemmatization has two advantages over the stemming: When people conduct a websearch, tehy are likely to look for a lemma. If the index is already lemmatized, there is no need for further treatment of the search term (although this is not always true, one should therefor lemmatize both index and search terms!). But a lemmatized word can be used to look a word in a dictionary and possibly find synonyms or a definition.
In the following I will describe how to include Lemmatization in the lucene indexing procedure. While there are some implemantations for English, there is no handy solution for German. I will use the mate-tool lemmatizer.
Let’s start by looking at the Main routine of the attached package.
You can also use the maven pom-file found in the package here, to let maven take care of most dependencies. The Mate-tool can not (yet) be found in the newest version on Maven (see here)
The jar-file

public class Main {
	private static LemmatizeAnalyzer lemmAnalyzer =
			new LemmatizeAnalyzer();
	public static void main(String args[]) throws IOException, InstantiationException, IllegalAccessException, ClassNotFoundException, SQLException, XMLStreamException, ParseException{
		Version matchVersion = Version.LUCENE_46;
		Directory index = new SimpleFSDirectory(new File("index"));
		//configure the indexwriter to use the synonymAnalyzer!
		//IndexWriterConfig  config = new IndexWriterConfig(matchVersion,
		//		synonymAnalyzer);
		IndexWriterConfig  config = new IndexWriterConfig(matchVersion,
		IndexWriter w = new IndexWriter(index, config);
		//new document
		Document doc = new Document();
		//add field to the document
		doc.add(new Field("content", "Ein schönes, altes Haus. Es sprach zu ihm.", Field.Store.YES,Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));
		//add document to the writer
		//save changes to the writer
		//close writer
		System.err.println("done analyzing and writing index.");

The lines

private static LemmatizeAnalyzer lemmAnalyzer =
			new LemmatizeAnalyzer();

initialize the Analyzer we need to add the lemmas to our index. The index writer is created using this analyzer:

IndexWriterConfig  config = new IndexWriterConfig(matchVersion,

When we add a new document to our index, it will be analyzed according to the lemma analyzer:

Document doc = new Document();
doc.add(new Field("content", "Ein schönes, altes Haus. Es sprach zu ihm.", Field.Store.YES,Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));

But what exactly happens when the LemmaAnalyzer is used?

public class LemmatizeAnalyzer extends Analyzer {
		private Version version = Version.LUCENE_46;
		public LemmatizeAnalyzer() {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new StandardTokenizer(version, reader);
		    TokenStream filter = new StandardFilter(version, source);
		     filter = new StopFilter(version, filter, GermanAnalyzer.getDefaultStopSet());
filter = new LowerCaseFilter(version, filter);
			 try {
					filter = new GermanLemmatizerFilter(filter,"/path/to/ger-tagger+lemmatizer+morphology+graph-based-3.6/lemma-ger-3.6.model");
				} catch (Exception e) {
					// TODO Auto-generated catch block

		     return new TokenStreamComponents(source, filter);

First, the StandardTokenizer is used to devide the string of words into tokens. Then we apply some more filter (The StandardFilter, a StopFilter (to filter stopwords), and the LowerCaseFilter (to transform all words to lower cases; this improves the results of the lemmatizer!).

filter = new GermanLemmatizerFilter(filter,"/path/to/ger-tagger+lemmatizer+morphology+graph-based-3.6/lemma-ger-3.6.model");

creates the new GermanLemmatizer using the given model (download here).

The code of the GermanLemmatizerFilter is only slightly more difficult. It’s main routine is the incrementToken() method, a standadrd method Filters in Lucene implement.

public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      if (!keywordAttr.isKeyword()) {
        char termBuffer[] = termAtt.buffer();
        termAtt.copyBuffer(termBuffer, 0, termAtt.length());
        char term[] = new char[termAtt.length()];
        for(int x = 0; x < termAtt.length(); x++){
        	term[x]= termAtt.charAt(x);
        //after the lemmatization, term and length changed!
        final char finalTerm[] = this.glp.getCurrentBuffer();
        final int newLength = this.glp.getCurrentBufferLength();
        if (finalTerm != termBuffer)
          termAtt.copyBuffer(finalTerm, 0, newLength);
      return true;
    } else {
      return false;

It checks the length of the token and corrects it. Then it builds a string from the char token stream and looks up the lemma for it:,this.glp);

The method lemmatize works as follows:

public void lemmatize(char termBuffer[],GermanLemmatizerProgram glp){
    	String term="";
    	for (char c : termBuffer){
    	SentenceData09 i = new SentenceData09();
    	i.init(new String[] {term});
        // lemmatize a sentence; the result is stored in the stenenceData09 i 
        char finalTerm[] = null;
        for (String s : i.plemmas){
        		finalTerm = s.toCharArray();
        glp.setCurrent(finalTerm, finalTerm.length);

It uses the Mate-tool lemmatizer:

SentenceData09 i = new SentenceData09();
    	i.init(new String[] {term});

init accepts an array of strings. The only string in our case is the term itself. The corresponding lemma is saved in the SentenceData09 object. The result is transformed to a char array and set as current token.
Now the the index contains the original text and the correpsonding lemmas.
To test the results we use the searcher:

public static void main(String args[]) throws Exception{
		//path to the index
		String indexDirectory = "index";
		File indexFiles=new File(indexDirectory);
		IndexReader indexReader =;
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);
		//search term
		String query = "schön";
		StandardAnalyzer analyzer = new StandardAnalyzer(LUCENE_VERSION);
		//if necessary, different fields can be searched
		String[] felder = new String[] {"content"};
		MultiFieldQueryParser queryParser = new MultiFieldQueryParser(LUCENE_VERSION, felder, analyzer);
		Query luceneQuery = queryParser.parse(query);
		//collect results
		TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(100, true);, topScoreDocCollector);

		System.out.println("Suchanfrage: "+query+" über die Felder "+Arrays.toString(felder));

		for(ScoreDoc scoreDoc: topScoreDocCollector.topDocs().scoreDocs) {
			// getting the document
			Document document=indexSearcher.doc(scoreDoc.doc);
			//getting the content
			String content = document.get("content");
			//printing the content

Eventhough the search term is

String query = "schön";

the result should look like this:

Suchanfrage: schön über die Felder [content]
Ein schönes, altes Haus. Es sprach zu ihm.

The whole code (and all dependencies) can be downloaded in form of an eclipse project.