T - type of the tokenpublic final class SimonWhite<T> extends Object implements MultisetMetric<T>, MultisetDistance<T>
similarity(a,b) = 2 * ∣a ∩ b∣ / (∣a∣ + ∣b∣)
distance(a,b) = 1 - similarity(a,b)
Implementation based on the ideas as outlined in How to Strike a Match by Simon White. To create the described metric use:
import static org.simmetrics.StringMetricBuilder.with;
...
with(new SimonWhite<String>())
.tokenize(Tokenizers.qGram(2))
.build();
The Dice similarity coefficient is identical to Simon White, but unlike Simon
White the occurrence (cardinality) of an entry is not taken into account.
E.g. [hello, world] and [hello, world, hello, world] would be
identical when compared with Dice but are dissimilar when Simon White is
used.
This class is immutable and thread-safe.
Dice,
Wikipedia - Sørensen–Dice coefficient| Constructor and Description |
|---|
SimonWhite() |
| Modifier and Type | Method and Description |
|---|---|
float |
compare(com.google.common.collect.Multiset<T> a,
com.google.common.collect.Multiset<T> b)
Measures the similarity between multisets a and b.
|
float |
distance(com.google.common.collect.Multiset<T> a,
com.google.common.collect.Multiset<T> b)
Measures the distance between multisets a and b.
|
String |
toString() |
public float compare(com.google.common.collect.Multiset<T> a, com.google.common.collect.Multiset<T> b)
MultisetMetric
Results are undefined if a and b are based on different
equivalence relations (as HashMultiset and TreeMultiset
are).
public float distance(com.google.common.collect.Multiset<T> a, com.google.common.collect.Multiset<T> b)
MultisetDistance0.0 indicates that a
and b are similar.
Results are undefined if a and b are based on different
equivalence relations (as HashMultiset and TreeMultiset
are).
Copyright © 2014–2016. All rights reserved.