IF you are working with Lucene you will probably use a stop word list. Most likely, you will not search for pronouns (he/she/it) for example. Instead of using a predefined stop word list, you can also look at the word categories, i.e. the parts of speech, of the words and exclude those word from the index, that are not interested in.
The approach describes here is language independent. All you need to have is the right model for the language you are working with and pass the path to that model to the software. Then you create a list of those POS your don’t need. Models can be found here.

Admittedly, the POS-tagging used in this package is quite basic and does not take context into account. Still, it is good enough to filter more unwanted words than the available stop word list.
I strongly recommend to use maven to deal with the dependencies. The .jar-file also contains a examplary pom.xml file which you can use in your own project. Maven will take care of most dependencies (see here).
We are adding a new Kind of Attribute, POS-Attribute. The implemantation is quite simple. PartOfSpeechAtt.java looks like this:

public interface PartOfSpeechAttribute extends Attribute {
    public void setPartOfSpeech(String pos);
    public String getPartOfSpeech();

and is used to define the Attribute. Furthermore, one need the implemtation of the Attribute

public final class PartOfSpeechAttributeImpl extends AttributeImpl 
implements PartOfSpeechAttribute {

private String pos = "";

public void setPartOfSpeech(String pos) {
this.pos = pos;

public String getPartOfSpeech() {
return pos;

public void clear() {
pos ="";

public void copyTo(AttributeImpl target) {
((PartOfSpeechAttribute) target).setPartOfSpeech(pos);


The next class implement the incrementToken method.

public class PartOfSpeechTaggingFilter extends TokenFilter {
    PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    MaxentTagger tagger;
    public PartOfSpeechTaggingFilter(TokenStream in,  MaxentTagger taggerIn) throws IOException {
      //TaggerConfig config = new TaggerConfig("-model", modelFile);
      // The final argument suppresses a "loading" message on stderr.
    public boolean incrementToken() throws IOException {
      if (!input.incrementToken()) {
    	  return false;
    	  OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);
      	  CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
      	  PartOfSpeechAttribute posAttribute = addAttribute(PartOfSpeechAttribute.class);
    	  String posTagged = tagger.tagString(charTermAttribute.toString().toLowerCase());
          String pos = posTagged.replaceAll(".*_(.*)?\\s?$", "$1");
          pos = pos.replaceAll("\\(|\\)", "");
    	  return true;
    public boolean incrementTokenSpecial() throws IOException {
        if (!input.incrementToken()) {
      	  return false;
      	 return true;

and is used to process the token stream and add the attribute, i.e. the tagged POS.
And finally we need another class to filter out the attribute we don’t need. This is done as follows:

public final class POSFilterOut extends FilteringTokenFilter {

  private final ArrayList stopPOS;
  //private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final PartOfSpeechAttribute posAtt=addAttribute(PartOfSpeechAttribute.class);
  public POSFilterOut(Version matchVersion, TokenStream in, ArrayList stopPOS) {
    super(matchVersion, in);
    this.stopPOS = stopPOS;
    System.err.println(this.stopPOS.size()+" POSs excluded");
    for (String s : this.stopPOS)
    	System.err.println("Will ignore POS \""+s+"\" for indexding.");
   * Returns the next input Token whose term() is not a stop word.
  protected boolean accept() {
	  //this line does not accept regular expressions in the POS, which is desirable
	  //return !stopPOS.contains(posAtt.getPartOfSpeech());
	  boolean b = true;
	  for (String pos1 : stopPOS){
		  if (posAtt.getPartOfSpeech().matches(pos1)){
			  //uncomment for evaluation purposes 
			  //System.err.println(posAtt.getPartOfSpeech()+" for "+termAtt.toString());
	  return b;

Only those words who’s POS are not in the list are accepted and will be used in indexing. To use the filter one write a new Analyzer:

class SynonymAnalyzerExample extends Analyzer {
	private SynonymEngine engine;
	private MaxentTagger tagger;
	private static GermanLemmatizer gl;
	private Version version = Version.LUCENE_48;
	private final GermanLemmatizerProgram glp;
	private ArrayList excludePOS;

	public SynonymAnalyzerExample(SynonymEngine engine, String POSmodelFile,
			String lemmatizerModelString) {
		this.engine = engine;
		// defining these variables here and reusing them, saves some resources
		// when using an Analyzer Wrapper
		this.tagger = new MaxentTagger(POSmodelFile, new TaggerConfig("-model",
				POSmodelFile), false);
		SynonymAnalyzerExample.gl = new GermanLemmatizer(lemmatizerModelString);
		this.glp = new GermanLemmatizerProgram();
		this.excludePOS = new ArrayList();
		// Add the names of POSs to be ignored corresponding to the pos-tag-set
		// used! Regular expression can be used

	protected TokenStreamComponents createComponents(String fieldName,
			Reader reader) {
		Tokenizer source = new StandardTokenizer(version, reader);
		TokenStream filter = new StandardFilter(version, source);
		filter = new StopFilter(version, filter,

		// first run lower case filter, then run lemmatizer. improves the results
		filter = new LowerCaseFilter(version, filter);
		try {
			filter = new SynonymFilter(new GermanLemmatizerFilter(filter, gl,
					glp), engine);
		} catch (Exception e) {
			// TODO Auto-generated catch block
		try {
			filter = new PartOfSpeechTaggingFilter(filter, tagger);
		} catch (IOException e1) {
			// TODO Auto-generated catch block
		filter = new POSFilterOut(version, filter, excludePOS);
		return new TokenStreamComponents(source, filter);