Netrics Netrics Matching Engine
by David M. Raab
DM News
June, 2007
Most direct marketers think of data matching in terms of
merge/purge: a way to identify and remove duplicate names across multiple lists. But merge/purge is rarely a concern in the
larger world of data processing, There, matching
is a component of customer data integration (identifying data in different
systems that belong to the same customer) and master data management (consolidating
data relating to all kinds of entities).
Matching is also part of search applications that help users find
people, products, documents, locations and other entities even when they don’t
have complete or fully accurate information.
These are complex applications with many moving parts:
multi-table data structures, relationship hierarchies, data acquisition,
indexing, ranking and display. But matching
remains a critical core function.
The specific purpose of matching is to find records that
refer to the same entity, even though the records themselves are different. In a strict sense, matching involves direct
comparisons of data strings. But in the
real world, this is often supplemented by external reference data such as a
list of all known products or all the names used by a business. This external data often allows connections
that could never be inferred from strings alone, such as the fact that the John
Jones who used to live in
In practice, parsing and standardization based on
external knowledge are critical to successful name and address matching. But even the most sophisticated
knowledge-based processing cannot remove all errors in a set of data. In fact, standardization and parsing can
introduce errors of their own. To make
matters worse, external knowledge may not be available once you move beyond
well-understood structures like mailing addresses. So, in the end, there is always a need to
compare two strings and decide whether they are similar enough to call them a
match.
What differentiates matching engines is how they make
this comparison. Simple matching systems
often create a “match key” by extracting a few significant digits (say, first
name initial, first three consonants in the last name, house number, city and
state) and allowing a match if these are the same. Other systems use phonetic standardization
such as Soundex to compensate for spelling errors. Some allow a match if strings have no more
than a specified number or percentage of differences among the characters. Still others apply statistical techniques
that take into account not only the similarity of the strings, but how common they
are: so a common name like David Jones not be considered a likely match for David
James, while an unusual name like Zydrunas Ilgauskas might match with Sid Iglakis. Often the systems assign separate match
scores for different elements and then use weights or rules to assign a match score
for the record as a whole.
Netrics Matching Engine (Netrics, 609-683-4002, www.netrics.com) applies a mathematical
technique called “bipartite graph matching” to measure the similarity of
strings. The general idea is to mimic
human decisions by finding similar sequences of letters, even if they occur at
different locations within two strings.
This can compensate for data entry errors and deal with information that
has not been parsed into separate fields.
It also means the method can be applied to problems other than name and
address matching. Netrics says its
approach is more accurate than simpler methods such as matchkeys and Soundex,
and more efficient than character-difference comparisons.
Like other matching engines, the Netrics engine returns a
score that shows the similarity of the strings it compares. The system can also highlight matching blocks
of text, making it easier for people to review why the system found a
similarity.
Netrics also provides a Decision Engine that can use similarity
scores to decide whether a pair of records is considered a match. The Decision Engine starts with examples of
known matches and non-matches. With
name and address records, these would typically be parsed into separate
elements, although they could also be unparsed text blocks. The sample records are run through the
Matching Engine and then the Decision Engine, which infers the decision rules
(basically, weights and cut-off ranges for element similarity scores) that
distinguish matches from non-matches. The
system automatically adjusts its rules until its own decisions are acceptably
consistent with the “correct” answers provided as part of the input. Users can provide additional examples of
particular types of matches if the system performs poorly at identifying them. A couple thousand sample pairs are typically
required for training. The Netrics
approach is considerably easier than having users specify the rules explicitly.
Netrics is used both to search for individual records in
a reference file and for batch deduplication such as merge/purge. It loads the data into system memory, which
allows quick performance. The system has
been tested on databases with hundreds of millions of records, returning as
many as 25 matches per second. The
product was released in 2000 and has more than 100 installations, mostly in
healthcare and government agencies. About
half the installations involve name and address matching, while the balance
involve other types of data. The
software is usually purchased through business partners, such as applications
providers and systems integrators, who incorporate it into products they
deliver to their clients. Pricing is
based on the number of processors in the host computer, starting at $50,000 for
a two-processor server.
*
* *
David M. Raab is president of Client X Client, a
consulting and software firm specializing in customer value optimization. He can be reached at draab@clientxclient.com
and blogs at http://customerexperiencematrix.blogspot.com.