comirva.web.indexing
Class IncludeTermsFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by comirva.web.indexing.IncludeTermsFilter
All Implemented Interfaces:
Closeable

public final class IncludeTermsFilter
extends org.apache.lucene.analysis.TokenFilter

Includes only words occurring in a word list from a token stream.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
IncludeTermsFilter(boolean enablePositionIncrements, org.apache.lucene.analysis.TokenStream in, Set<?> includeWords)
          Constructs a filter which includes only words in a list.
IncludeTermsFilter(boolean enablePositionIncrements, org.apache.lucene.analysis.TokenStream input, Set<?> includeWords, boolean ignoreCase)
          Construct a token stream filtering the given input.
 
Method Summary
 boolean getEnablePositionIncrements()
           
static boolean getEnablePositionIncrementsVersionDefault(org.apache.lucene.util.Version matchVersion)
          Returns version-dependent default for enablePositionIncrements.
 boolean incrementToken()
          Returns the next input Token whose term() is not a stop word.
static Set<Object> makeStopSet(List<?> stopWords)
          Builds a Set from an array of stop words, appropriate for passing into the StopFilter constructor.
static Set<Object> makeStopSet(List<?> stopWords, boolean ignoreCase)
           
static Set<Object> makeStopSet(String... stopWords)
          Builds a Set from an array of stop words, appropriate for passing into the StopFilter constructor.
static Set<Object> makeStopSet(String[] stopWords, boolean ignoreCase)
           
 void setEnablePositionIncrements(boolean enable)
          If true, this StopFilter will preserve positions of the incoming tokens (ie, accumulate and set position increments of the removed stop tokens).
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

IncludeTermsFilter

public IncludeTermsFilter(boolean enablePositionIncrements,
                          org.apache.lucene.analysis.TokenStream input,
                          Set<?> includeWords,
                          boolean ignoreCase)
Construct a token stream filtering the given input. If includeWords is an instance of CharArraySet (true if makeStopSet() was used to construct the set) it will be directly used and ignoreCase will be ignored since CharArraySet directly controls case sensitivity.

If stopWords is not an instance of CharArraySet, a new CharArraySet will be constructed and ignoreCase will be used to specify the case sensitivity of that set.

Parameters:
enablePositionIncrements - true if token positions should record the removed stop words
input - Input TokenStream
includeWords - A Set of Strings or char[] or any other toString()-able set representing the stopwords
ignoreCase - if true, all words are lower cased first

IncludeTermsFilter

public IncludeTermsFilter(boolean enablePositionIncrements,
                          org.apache.lucene.analysis.TokenStream in,
                          Set<?> includeWords)
Constructs a filter which includes only words in a list. TokenStream that are named in the Set.

Parameters:
enablePositionIncrements - true if token positions should record the removed stop words
in - Input stream
includeWords - A Set of Strings or char[] or any other toString()-able set representing the stopwords
See Also:
makeStopSet(java.lang.String[])
Method Detail

makeStopSet

public static final Set<Object> makeStopSet(String... stopWords)
Builds a Set from an array of stop words, appropriate for passing into the StopFilter constructor. This permits this stopWords construction to be cached once when an Analyzer is constructed.

See Also:
passing false to ignoreCase

makeStopSet

public static final Set<Object> makeStopSet(List<?> stopWords)
Builds a Set from an array of stop words, appropriate for passing into the StopFilter constructor. This permits this stopWords construction to be cached once when an Analyzer is constructed.

Parameters:
stopWords - A List of Strings or char[] or any other toString()-able list representing the stopwords
Returns:
A Set (CharArraySet) containing the words
See Also:
passing false to ignoreCase

makeStopSet

public static final Set<Object> makeStopSet(String[] stopWords,
                                            boolean ignoreCase)
Parameters:
stopWords - An array of stopwords
ignoreCase - If true, all words are lower cased first.
Returns:
a Set containing the words

makeStopSet

public static final Set<Object> makeStopSet(List<?> stopWords,
                                            boolean ignoreCase)
Parameters:
stopWords - A List of Strings or char[] or any other toString()-able list representing the stopwords
ignoreCase - if true, all words are lower cased first
Returns:
A Set (CharArraySet) containing the words

incrementToken

public final boolean incrementToken()
                             throws IOException
Returns the next input Token whose term() is not a stop word.

Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

getEnablePositionIncrementsVersionDefault

public static boolean getEnablePositionIncrementsVersionDefault(org.apache.lucene.util.Version matchVersion)
Returns version-dependent default for enablePositionIncrements. Analyzers that embed StopFilter use this method when creating the StopFilter. Prior to 2.9, this returns false. On 2.9 or later, it returns true.


getEnablePositionIncrements

public boolean getEnablePositionIncrements()
See Also:
setEnablePositionIncrements(boolean).

setEnablePositionIncrements

public void setEnablePositionIncrements(boolean enable)
If true, this StopFilter will preserve positions of the incoming tokens (ie, accumulate and set position increments of the removed stop tokens). Generally, true is best as it does not lose information (positions of the original tokens) during indexing.

When set, when a token is stopped (omitted), the position increment of the following token is incremented.

NOTE: be sure to also set QueryParser.setEnablePositionIncrements(boolean) if you use QueryParser to create queries.