public final class Tokenizers extends Object
The created tokenizers are immutable and thread-safe provided all their components are also immutable and thread-safe.
| Modifier and Type | Method and Description |
|---|---|
static Tokenizer |
chain(List<Tokenizer> tokenizers)
Chains tokenizers together.
|
static Tokenizer |
chain(Tokenizer tokenizer,
Tokenizer... tokenizers)
Chains tokenizers together.
|
static Tokenizer |
filter(Tokenizer tokenizer,
com.google.common.base.Predicate<String> predicate)
Constructs a new filtering tokenizer.
|
static Tokenizer |
pattern(Pattern pattern)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
pattern.split(input,-1). |
static Tokenizer |
pattern(String regex)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
Pattern.compile(regex).split(input,-1). |
static Tokenizer |
qGram(int q)
Returns a q-gram tokenizer for a variable
q. |
static Tokenizer |
qGramWithFilter(int q)
Returns a q-gram tokenizer for a variable
q.The tokenizer will
return an empty collection if the input is empty or shorter then
q. |
static Tokenizer |
qGramWithPadding(int q)
Returns a q-gram tokenizer for a variable
q. |
static Tokenizer |
qGramWithPadding(int q,
String padding)
Returns a q-gram tokenizer for a variable
q. |
static Tokenizer |
qGramWithPadding(int q,
String startPadding,
String endPadding)
Returns a q-gram tokenizer for a variable
q.The q-gram is
extended beyond the length of the string with padding. |
static Tokenizer |
transform(Tokenizer tokenizer,
com.google.common.base.Function<String,String> function)
Constructs a new transforming tokenizer.
|
static Tokenizer |
whitespace()
Returns a tokenizer that splits a string into tokens around whitespace.
|
public static Tokenizer pattern(Pattern pattern)
pattern.split(input,-1).pattern - to split the the string aroundpublic static Tokenizer pattern(String regex)
Pattern.compile(regex).split(input,-1).regex - to split the the string aroundpublic static Tokenizer qGram(int q)
q. The tokenizer will
return an empty collection if the input is empty. A collection with the
original input is returned for tokens shorter then q.
The tokenizer takes care to split the string on Unicode code points, not separating valid surrogate pairs.
q - size of the tokenspublic static Tokenizer qGramWithFilter(int q)
q.The tokenizer will
return an empty collection if the input is empty or shorter then
q.
The tokenizer takes care to split the string on Unicode code points, not separating valid surrogate pairs.
q - size of the tokenspublic static Tokenizer qGramWithPadding(int q)
q. The input is padded
with q-1 special characters before being tokenized. Uses
# as the default padding.
The tokenizer takes care to split the string on Unicode code points, not separating valid surrogate pairs.
q - size of the tokenspublic static Tokenizer qGramWithPadding(int q, String padding)
q. The q-gram is
extended beyond the length of the string with padding.
The tokenizer takes care to split the string on Unicode code points, not separating valid surrogate pairs.
q - size of the tokenspadding - padding to pad start and end of string withpublic static Tokenizer qGramWithPadding(int q, String startPadding, String endPadding)
q.The q-gram is
extended beyond the length of the string with padding.
The tokenizer takes care to split the string on Unicode code points, not separating valid surrogate pairs.
q - size of the tokensstartPadding - padding to pad start of string withendPadding - padding to pad end of string withpublic static Tokenizer whitespace()
To create tokenizer that returns leading and trailing empty tokens use
Tokenizers.pattern("\\s+")
public static Tokenizer transform(Tokenizer tokenizer, com.google.common.base.Function<String,String> function)
tokenizer - delegate tokenizerfunction - to transform tokenspublic static Tokenizer chain(List<Tokenizer> tokenizers)
If only a single tokenizer is provided, that tokenizer is returned.
tokenizers - a non-empty list of tokenizerspublic static Tokenizer chain(Tokenizer tokenizer, Tokenizer... tokenizers)
If only a single tokenizer is provided, that tokenizer is returned.
tokenizer - the first tokenizertokenizers - a the other tokenizerspublic static Tokenizer filter(Tokenizer tokenizer, com.google.common.base.Predicate<String> predicate)
predicate are removed.tokenizer - delegate tokenizerpredicate - for tokens to keepCopyright © 2014–2016. All rights reserved.