Class MatchService


  • public class MatchService
    extends Object
    Entry Point for Fuzzy Matching. This class provides different ways to accept Documents for primarily 3 use case

    1. De-duplication of data - Where for a given list of documents it finds duplicates 2. Check duplicate for a new data - Where it checks for a new Document a duplicate is present in existing list 3. Check duplicates for bulk inserts - Similar to 2, where a list of new Documents is checked against existing

    This also has similar implementation to aggregate results in different formats.

    • Constructor Detail

      • MatchService

        public MatchService()
    • Method Detail

      • applyMatch

        public Map<Document,​List<Match<Document>>> applyMatch​(List<Document> documents)
        Use this for De-duplication of data, where for a given list of documents it finds duplicates Data is aggregated by a given Document
        Parameters:
        documents - the list of documents to match against
        Returns:
        a map containing the grouping of each document and its corresponding matches
      • applyMatch

        public Map<Document,​List<Match<Document>>> applyMatch​(List<Document> documents,
                                                                    List<Document> matchWith)
        Use this to check duplicates for bulk inserts, where a list of new Documents is checked against existing list Data is aggregated by a given Document
        Parameters:
        documents - the list of documents to match from
        matchWith - the list of documents to match against
        Returns:
        a map containing the grouping of each document and its corresponding matches
      • applyMatch

        public Map<Document,​List<Match<Document>>> applyMatch​(Document document,
                                                                    List<Document> matchWith)
        Use this to check duplicate for a new record, where it checks whether a new Document is a duplicate in existing list Data is aggregated by a given Document
        Parameters:
        document - the document to match
        matchWith - the list of documents to match against
        Returns:
        a map containing the grouping of each document and its corresponding matches
      • applyMatchByDocId

        public Map<String,​List<Match<Document>>> applyMatchByDocId​(Document document,
                                                                         List<Document> matchWith)
        Use this to check duplicate for a new record, where it checks whether a new Document is a duplicate in existing list Data is aggregated by a given Document Id
        Parameters:
        document - the document to match
        matchWith - the list of documents to match against
        Returns:
        a map containing the grouping of each document id and its corresponding matches
      • applyMatchByDocId

        public Map<String,​List<Match<Document>>> applyMatchByDocId​(List<Document> documents)
        Use this for De-duplication of data, where for a given list of documents it finds duplicates Data is aggregated by a given Document Id
        Parameters:
        documents - the list of documents to match against
        Returns:
        a map containing the grouping of each document id and its corresponding matches
      • applyMatchByDocId

        public Map<String,​List<Match<Document>>> applyMatchByDocId​(List<Document> documents,
                                                                         List<Document> matchWith)
        Use this to check duplicates for bulk inserts, where a list of new Documents is checked against existing list Data is aggregated by a given Document Id
        Parameters:
        documents - the list of documents to match from
        matchWith - the list of documents to match against
        Returns:
        a map containing the grouping of each document id and its corresponding matches
      • applyMatchByGroups

        public Set<Set<Match<Document>>> applyMatchByGroups​(List<Document> documents)
        Use this for De-duplication of data, where for a given list of documents it finds duplicates Data is aggregated by a given Document Id
        Parameters:
        documents - the list of documents to match against
        Returns:
        a set containing the grouping of all relevant matches. So if A matches B, and B matches C. They will be grouped together