The goal is to add all possible synonyms during indexing (in contrast to finding synonyms during search) to avoid multiple computations of the possible synonyms.

This tutorial is based on the approach described in „Lucene in Action“ (source code can be found here: http://www.manning.com/hatcher3/). It was altered in a few points:
it supports Lucene 4.6.0
it looks up synonyms from GermaNet
it looks stores synonyms to the index

Jar-Files needed:

  • GermaNet licene (http://www.sfs.uni-tuebingen.de/lsd/)
  • GermaNet API matching your GermaNet Version (here: 6) (http://www.sfs.uni-tuebingen.de/lsd/tools.shtml#APIs)
  • Lucene (4.6.0 – http://archive.apache.org/dist/lucene/java/4.6.0/) :
    • lucene-analyzers-common-4.6.0.jar
    • lucene-core-4.6.0.jar

    You can also use the maven pom-file found in the package here, to let maven take care of most dependencies. Only GermaNet and the GermaNet API need to be downloaded separately (see here)
    The jar-file consists of the following classes:

    java-classes

    The Main-Class constructs the Lucene index. The LuceneSearcher class searches the index. The other classes are more interesting since they are used to find and store synonyms.

    First, one has to declare the Analyzer one wants to use to analyze the text added to the index. This is done in Main:

    private static SynonymAnalyzerExample synonymAnalyzer =
    			new SynonymAnalyzerExample(new TestSynonymEngine().start("/path/to/GermaNet/GermaNet6.0/"));
    	

    In the next step, the index-writer is defined using the synonymAnalyzer

    Version matchVersion = Version.LUCENE_46;
    Directory index = new SimpleFSDirectory(new File("index"));
    //configure the indexwriter to use the synonymAnalyzer!
    IndexWriterConfig config = new IndexWriterConfig(matchVersion,
    synonymAnalyzer);
    IndexWriter w = new IndexWriter(index, config);

    Now we can add documents to the index that will be analyzed according to the code below

    w.deleteAll();
    //new document
    Document doc = new Document();
    //add field to the document
    doc.add(new Field("content", "Gehe in das Gefängnis, begib dich direkt dort hin, ziehe nicht über Los und ziehe keine 4000 ein", Field.Store.YES,Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));
    //add document to the writer
    w.addDocument(doc);
    //save changes to the writer
    w.commit();
    //close writer
    w.close();

    The really interesting part is what the analyzer does. Let’s look at the definition of the SynonymAnalyzer class

    public class SynonymAnalyzer extends Analyzer {
        private SynonymEngine engine;
        public SynonymAnalyzer(SynonymEngine engine) {
            this.engine = engine;
        }
        protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    			Tokenizer source = new StandardTokenizer(version, reader);
    		    TokenStream filter = new StandardFilter(version, source);
    		    filter = new StopFilter(version, filter, GermanAnalyzer.getDefaultStopSet());
    		    
    		    //first run lower case filter, then run lemmatizer. improves the results
    		    filter = new LowerCaseFilter(version, filter);
    			 try {
    				filter = new SynonymFilter(
    						 new GermanLemmatizerFilter(filter,"/path/to//lemma-ger-3.6.model"),engine);
    			} catch (Exception e) {
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    		     return new TokenStreamComponents(source, filter);
    		}
    }

    When analyzing the content one is adding to a field in the Lucene index, the method createComponents(String fieldName, Reader reader) is called (this is also the point at which the original source code was not compatible with newer version of Lucene). Before one can lookup synonyms, one has to find the lemma of a word. Only this lemma can be looked up in GermaNet. The filter used and of special interest here is the SynonymFilter. The Synonym filter adds the synonyms to the words it finds. Here, some adjustments were needed to make the original source code compatible with Lucene 4.6.0. You can find these adjustments in the source code.

    The TestSynonymEngine from LIA was extended to not manually assign synonyms, but to use synonyms defined in WordNet.

    First, GermaNet is loaded:

     public SynonymEngine start(String path){
    	  GermaNet gnet = null;
    		  try {
    			  //path to GermaNet!
    			  gnet = new GermaNet(path);
    		} catch (FileNotFoundException | XMLStreamException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    ...
    }

    Then a HashMap is initiated that will hold the target words and their synonyms:

    private static HashMap<String, String[]> map = new HashMap<String, String[]>();

    To fill this HashMap, we iterate over all synsets in GermaNet and add the words contained in them as synonyms of each other:

    List allSynsets = gnet.getSynsets();
    for (Synset s:allSynsets){
        List allWords = s.getAllOrthForms();
        for (String w1 : allWords){
            for (String w2 : allWords){
                ArrayList synonyms = new ArrayList();
                if (w1!=w2){
                    synonyms.add(w2);
                }
                if (!map.containsKey(w1)){
                    String[] synonymArray= new String[synonyms.size()];
                    int i=0;
                    while (i<synonyms.size()){
                        synonymArray[i]=synonyms.get(i);
                        i++;
                    }
                    map.put(w1, synonymArray);
                }
                else{
                    String[] words1 = map.get(w1);
                    ArrayList words1AL = new ArrayList();
                    for (String w3:words1){
                        words1AL.add(w3);
                    }
                    words1AL.addAll(synonyms);
                    String[] synonymArray= new String[words1AL.size()];
                    int i=0;
                    while (i<words1AL.size()){
                        synonymArray[i]=words1AL.get(i);
                        i++;
                    }
                    map.put(w1, synonymArray);
                }
            }
        }
    }

    This is all one has to do. If you run the indexing, every input is analyzed and synonyms for every word (if the word is contained in GermaNet) are added to the index. If you do a FuzzySearch now, not only the search terms themselves are found, but also their possible synonyms.

    The class LuceneSearcher does exactly this. First, we load the index and start the search. Then the results are printed.

    class LuceneSearcher {
        private static Version LUCENE_VERSION = Version.LUCENE_46;
        public static void main(String args[]) throws Exception{
            //path to the index
            String indexDirectory = "index";
            File indexFiles=new File(indexDirectory);
            IndexReader indexReader = DirectoryReader.open(FSDirectory.open(indexFiles));
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);
            //search term, synonym of one of the words in the content
            //the ~ indicates a fuzzy-search and is needed to find synonyms and not only the original term!
            String query = "Justizanstalt~";
            GermanAnalyzer de_an = new GermanAnalyzer(LUCENE_VERSION);
            //if necessary, different fields can be searched
            String[] felder = new String[] {"content"};
            //
            MultiFieldQueryParser queryParser = new MultiFieldQueryParser(LUCENE_VERSION, felder, de_an);
            Query luceneQuery = queryParser.parse(query);
            //collect results
            TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(100, true);
            indexSearcher.search(luceneQuery, topScoreDocCollector);
            System.out.println("Suchanfrage: "+query+" über die Felder "+Arrays.toString(felder));
            System.out.println("Gefunden:");
            for(ScoreDoc scoreDoc: topScoreDocCollector.topDocs().scoreDocs) {
                // getting the document
                Document document=indexSearcher.doc(scoreDoc.doc);
                //getting the content
                String content = document.get("content");
                //printing the score
                System.err.println("scoreDoc.doc: "+scoreDoc.doc);
                //printing the content
                System.err.println(content);
            }
            indexReader.close();
        }
    }