Class ICUTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class ICUTokenizer extends org.apache.lucene.analysis.Tokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the
ICUTokenizerConfig
- See Also:
ICUTokenizerConfig
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given Reader.ICUTokenizer(Reader input, ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
end()
boolean
incrementToken()
void
reset()
void
reset(Reader input)
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
-
-
-
-
Constructor Detail
-
ICUTokenizer
public ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given Reader.The default script-specific handling is used.
- Parameters:
input
- Reader containing text to tokenize.- See Also:
DefaultICUTokenizerConfig
-
ICUTokenizer
public ICUTokenizer(Reader input, ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.- Parameters:
input
- Reader containing text to tokenize.config
- Tailored BreakIterator configuration
-
-
Method Detail
-
incrementToken
public boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classorg.apache.lucene.analysis.TokenStream
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classorg.apache.lucene.analysis.TokenStream
- Throws:
IOException
-
reset
public void reset(Reader input) throws IOException
- Overrides:
reset
in classorg.apache.lucene.analysis.Tokenizer
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classorg.apache.lucene.analysis.TokenStream
- Throws:
IOException
-
-