Class CJKTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    @Deprecated
    public final class CJKTokenizer
    extends org.apache.lucene.analysis.Tokenizer
    Deprecated.
    Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
    CJKTokenizer is designed for Chinese, Japanese, and Korean languages.

    The tokens returned are every two adjacent characters with overlap match.

    Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

    Additionally, the following is applied to Latin text (such as English):
    • Text is converted to lowercase.
    • Numeric digits, '+', '#', and '_' are tokenized as letters.
    • Full-width forms are converted to half-width forms.
    For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
    • Field Summary

      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
    • Constructor Summary

      Constructors 
      Constructor Description
      CJKTokenizer​(Reader in)
      Deprecated.
      Construct a token stream processing the given input.
      CJKTokenizer​(org.apache.lucene.util.AttributeSource.AttributeFactory factory, Reader in)
      Deprecated.
       
      CJKTokenizer​(org.apache.lucene.util.AttributeSource source, Reader in)
      Deprecated.
       
    • Method Summary

      All Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      void end()
      Deprecated.
       
      boolean incrementToken()
      Deprecated.
      Returns true for the next token in the stream, or false at EOS.
      void reset()
      Deprecated.
       
      void reset​(Reader reader)
      Deprecated.
       
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        close, correctOffset
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
    • Constructor Detail

      • CJKTokenizer

        public CJKTokenizer​(Reader in)
        Deprecated.
        Construct a token stream processing the given input.
        Parameters:
        in - I/O reader
      • CJKTokenizer

        public CJKTokenizer​(org.apache.lucene.util.AttributeSource source,
                            Reader in)
        Deprecated.
      • CJKTokenizer

        public CJKTokenizer​(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
                            Reader in)
        Deprecated.
    • Method Detail

      • incrementToken

        public boolean incrementToken()
                               throws IOException
        Deprecated.
        Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Returns:
        false for end of stream, true otherwise
        Throws:
        IOException - - throw IOException when read error
        happened in the InputStream
      • end

        public final void end()
        Deprecated.
        Overrides:
        end in class org.apache.lucene.analysis.TokenStream
      • reset

        public void reset()
                   throws IOException
        Deprecated.
        Overrides:
        reset in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • reset

        public void reset​(Reader reader)
                   throws IOException
        Deprecated.
        Overrides:
        reset in class org.apache.lucene.analysis.Tokenizer
        Throws:
        IOException