data matching multiple columns with fuzzy matching criteria

I'm doing a similar project to this: Data matching algorithm

Where, I have one data frame (dataset1) with customer details, without a special unique ID, which is then matched against data frame 2 (dataset2) which has the same fields along with a special unique ID.

Example: Dataset1 columns include:

Title, Last name, First name, Middle/other name, address, DOB, sex, health care number

dataset2 columns include:

Unique ID, Title, Last name, First name, Middle/other name, address, DOB, sex, health care number

My plan:

My data set 1 is ~500,000 rows w/ 11 or so columns - the row count can be split if required. Data set 2 is ~2,000,000

Based on matching criteria from the business eg. >5 columns must match, and potentially a fuzzy component eg names that are hyphenated (inc. other special characters) in one data set but not in the other.

Based on the approximate match, this ID, and the fields from data set 2 is copied into data set 1 in adjacent columns, along with an overall match %.

Then overall match % ranges can be extracted and inspected by the business.

I plan on using Panda with Fuzzywuzzy. I'm having trouble planning my technical approach on the logic of my code.

SHould I look through the first field for matches, then filter down on the next column, and so on? I think so, but how does the loop operate moving across the columns in the two comparative data sets? do I need to write each operation of comparison back into the data set for storage or can it be kept else where?

the alternative that might work, is that I concatenate all the columns into a single column, and those are matched, and returned with the matched ID, other fields and the matching %.

I'm looking for the best direction to head into, along with general logic on the best way to do it.

1 answer

  • answered 2018-03-21 03:36 emican

    In general: try, observe, adjust and repeat.

    It may be worth exploring any small adjustments to standardize and make data as consistent as possible before using Fuzzywuzzy.

    • Some data sets could use 555-555-5555, 0, or - for an empty phone number or Jan 1, 1970 for blank DOB. Dataframe.replace({column:{to_replace:replace_val}) can help clean those up.

    • Python's dateutil.parser is helpful with standardizing a variety of date formats.

    • Try wrapping the fuzz ratio and removing punctuation from last name before analyzing.

    • Addresses: Pandas.column.str.upper() can be used to apply consistent case. Try without punctuation, too.

    • Concatenating all fields may be a good first pass for exact matches. 2nd pass could evaluate less important fields like Sex and Title for missing data.

    In general, you may want to approach in layers, remove direct matches followed by the most certain matches, reevaluate until the amount of effort required for the next wave approaches what you consider diminishing returns for this project.

    If you approach by column, you may want to remove the certain matches then start with last name and then create a list of choices of the next most deterministic column and use fuzzywuzzy's process.extract to assist with the decision tree. There may also be a better approach that I'm not aware of!