You might think matching and duplicate data detection is a straightforward exercise, and many times, it is. Want to see if a data file contains duplicate account numbers? That’s easy. A middle-schooler could write a routine that accomplishes this task. It involves no variables other than simple issues like formatting or leading zero suppression.
But what if you’re analyzing data files of customer contact information constructed in different time periods, by different organizations, or with different rules? This data may have names, customer behavior, history, and other contact information in various formats. The data values are probably inconsistent from file to file. A standard match routine will not recognize that James Arnold, Jim Arnold, JR Arnold, Ross Arnold, Junior Arnold, and Arnold James could be the same person.
For more sophisticated matching, you’ll need software developed by data scientists, and the routines may use the deterministic or probabilistic methods of match detection — or both!
Deterministic matching seeks equal values for data fields from one data record to another. This may sound like the simple account number matching example we mentioned above, but sophisticated deterministic matching uses scoring to decide how strong a match it has made. The software will also account for the presence or absence of data values. This is more sophisticated than a simple byte-by-byte comparison.
One hundred percent positive matches occur when the values of all inspected data fields are the same in both data records. When data fields exist in both compared records but the values are different, the software will decide the records do not-match exactly and will assign a weighted score value depending on the strength of the match.
Combined matching and non-matching data fields ultimately control the score for a pair of data records with field to field scoring which uses word or phrase similarity, noise word removal, cross-field comparisons, and weighted scoring of fields contributing to the overall record score.
Users decide the thresholds for taking action. If the score falls below the threshold the matching software will not merge the data. High scores may be considered positive matches and cause the data to be combined. Scores in between the high and low thresholds may be tagged for manual review.
With probabilistic matching, the software computes a matching score that determines the probability of a match. To use our example from above, matching “James Arnold” with “Jim Arnold” would yield a higher score than matching “James Arnold” with “Junior Arnold”. “Jim” is a common nickname for “James” but “Junior” is not. We can’t rule out the possibility however, without additional data. If social security numbers for James and Junior are different, the software won’t make the match. Contrarily, if supporting information such as matching birthdates, spouse names, or street addresses exist, the match score for “Junior Arnold” could rise.
Great probabilistic matching software can even unscramble names. If a data record listed a misspelled customer name as “James Anrold” the software would recognize the likely character transposition and score the match to “James Arnold” appropriately. Probabilistic approaches can also remove noise words that can detract from the matching process, allowing the software to identify all the possible matches.
To be most effective, the probabilistic method considers many data fields. The more pieces of data the software compares, the more accurate the results. Probabilistic matching is sometimes referred to as “fuzzy matching” because it includes educated guesses, not exact matches. A scoring system helps software avoid matching records where the ambiguity is too high.
Great Matching Takes Both
In most complex matching scenarios, data scientists combine deterministic and probabilistic matching to make data merging decisions. The two methods complement one another.
Laypersons may believe they should rely only on deterministic matches because it’s more of a sure thing, but they do not understand that probabilistic matching methods can add value to a deterministic-based task. Adding probabilistic methods expands the scope of the matching or consolidation project.
Consider a case where the primary match criterion is a data field well-suited to deterministic matching, such as an email address. If some data sources do not include email addresses for all records, deterministic-only routines might skip valuable information from that data source. Consequently, an organization might lose data such as internet browsing patterns or customer buying history, simply because the data records containing this information lacked an email address matching the master record.
By adding probabilistic matching, the software can compare several data elements, even if the data values vary, and match the records with an acceptable level of certainty. Important customer data will be retained, allowing the organization to use this information to enhance future customer experiences and run more effective marketing campaigns.
Understanding how matching works is important in evaluating your data quality requirements and selecting the right tools for the job. Employing both deterministic and probabilistic matching methods, the best software generates results consistent with your requirements for data matching, duplicate recognition, and data consolidation.