public class WordSeparatorAnalyzer
extends Analyzer
CJKTokenizer
to tokenize it if it is. CJKTokenizer
tokenizes based
on bigrams, so a string like "ABCD" will be tokenized to ["A", "AB", "BC",
"CD", "D"]. If the string is not CJK, we assume that it uses standard latin
word separators. For latin text, this uses a slightly-customized
LetterTokenizer and passes tokens through StandardFilter and
LowerCaseFilter. The LetterTokenizer is customized to use the same word
separators as ST-BTI.Constructor and Description |
---|
WordSeparatorAnalyzer()
Create a new WordSeparatorAnalyzer that always tries to detect CJK.
|
WordSeparatorAnalyzer(boolean detectCjk)
Create a new WordSeparatorAnalyzer.
|
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
normalize(java.lang.String tokenizeString)
Transforms to lowercase and replaces all word separators with spaces.
|
static java.lang.String |
removeDiacriticals(java.lang.String input)
Removes all diacritical marks from the input.
|
static java.util.List<java.lang.String> |
tokenList(java.lang.String tokenizeString)
Returns a list of tokens for a string.
|
TokenStream |
tokenStream(java.lang.String fieldName,
java.io.Reader reader)
Constructs a tokenizer that can tokenize CJK or latin text.
|
public WordSeparatorAnalyzer(boolean detectCjk)
detectCjk
- If true, will attempt to detect and segment CJK. If false, assumes all text
can be segmented using word separators.public WordSeparatorAnalyzer()
public TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
fieldName
- Ignored.reader
- A stream to tokenize. mark() and reset() support is not needed.TokenStream
that represents the tokenization of the data in reader.public static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
public static java.lang.String normalize(java.lang.String tokenizeString)
public static java.lang.String removeDiacriticals(java.lang.String input)