comirva.web.indexing
Class HTMLAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
comirva.web.indexing.HTMLAnalyzer
- All Implemented Interfaces:
- Closeable
public class HTMLAnalyzer
- extends org.apache.lucene.analysis.Analyzer
A lucence word analyzer tailored to HTML files. Only terms in a dictionary are considered.
Fields inherited from class org.apache.lucene.analysis.Analyzer |
overridesTokenStreamMethod |
Constructor Summary |
HTMLAnalyzer()
Builds an analyzer with all words included. |
HTMLAnalyzer(String dictionaryFile)
Builds an analyzer that only includes terms in dictionaryFile. |
Methods inherited from class org.apache.lucene.analysis.Analyzer |
close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, reusableTokenStream, setOverridesTokenStreamMethod, setPreviousTokenStream |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTH
- Default maximum allowed token length
- See Also:
- Constant Field Values
includeWords
public static Set<String> includeWords
HTMLAnalyzer
public HTMLAnalyzer()
- Builds an analyzer with all words included.
HTMLAnalyzer
public HTMLAnalyzer(String dictionaryFile)
- Builds an analyzer that only includes terms in dictionaryFile.
tokenStream
public org.apache.lucene.analysis.TokenStream tokenStream(String fieldName,
Reader reader)
- Specified by:
tokenStream
in class org.apache.lucene.analysis.Analyzer
setMaxTokenLength
public void setMaxTokenLength(int length)
- Set maximum allowed token length. If a token is seen
that exceeds this length then it is discarded. This
setting only takes effect the next time tokenStream or
reusableTokenStream is called.
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
setMaxTokenLength(int)