Class CompoundWordTokenFilterBase

  • All Implemented Interfaces:
    Closeable, AutoCloseable
    Direct Known Subclasses:
    DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter

    public abstract class CompoundWordTokenFilterBase
    extends TokenFilter
    Base class for decomposition token filters.

    You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!