I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.

Any similarity algorithm will do (soundex, Levenshtein, difflib's).

Say one DataFrame has the following data:

``````df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

letter
one        a
too        b
three      c
fours      d
five       e
``````

Then I want to get the resulting DataFrame

``````       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
``````

## Solution 1

Similar to @locojay suggestion, you can apply `difflib`'s `get_close_matches` to `df2`'s index and then apply a `join`:

``````In [23]: import difflib

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]:
letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]:
number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
``````

.

If these were columns, in the same vein you could apply to the column then `merge`:

``````df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
``````

# Using `fuzzywuzzy`

Since there are no examples with the `fuzzywuzzy` package, here's a function I wrote which will return all matches based on a threshold you can set as a user:

Example datframe

``````df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry
``````

Function for fuzzy matching

``````def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m

m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2

return df_1
``````

Using our function on the dataframes: #1

``````from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry
``````

Using our function on the dataframes: #2

``````df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

Col1  matches
0  Microsoft  Mcrsoft
2     Amazon   Amason
3        IBM
``````

# Installation:

Pip

``````pip install fuzzywuzzy
``````

Anaconda

``````conda install -c conda-forge fuzzywuzzy
``````

## Solution 3

I have written a Python package which aims to solve this problem:

`pip install fuzzymatcher`

You can find the repo here and docs here.

Basic usage:

Given two dataframes `df_left` and `df_right`, which you want to fuzzy join, you can write the following:

``````from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
``````

Or if you just want to link on the closest match:

``````fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
``````

## Solution 4

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

This is how I would do it with Jaro-Winkler from the jellyfish package:

``````def get_closest_match(x, list_strings):

best_match = None
highest_jw = 0

for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)

if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string

return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)
``````

Output:

``````    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e
``````

## For a general approach: `fuzzy_merge `

For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses `difflib.get_close_matches` along with `merge` in order to mimic the functionality of pandas' `merge` but with fuzzy matching:

``````import difflib

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
``````

Here are some use cases with two sample dataframes:

``````print(df1)

key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e
``````

With the above example, we'd get:

``````fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
``````

And we could do a left join with:

``````fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN
``````

For a right join, we'd have all non-matching keys in the left dataframe to `None`:

``````fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e
``````

Also note that `difflib.get_close_matches` will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in `df2` to say:

``````print(df2)

letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e
``````

We'd get an `index out of range` error:

``````df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
``````

IndexError: list index out of range

In order to solve this the above function `get_closest_match` will return the closest match by indexing the list returned by `difflib.get_close_matches` only if it actually contains any matches.

## Solution 6

http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...

I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

## Solution 7

I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.

use the below command to install

``````pip install fuzzymatcher
``````

Below is the sample Code (already submitted by RobinL above)

``````from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
``````

Errors you may get

1. ZeroDivisionError: float division by zero---> Refer to this link to resolve it
2. OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll from here and replace the DLL file in your python or anaconda DLLs folder.

Pros :

1. Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
3. Gives you score for all the closest matches for the same record. you choose whats the cutoff score.

cons:

1. Original package installation is buggy
2. Required C++ and visual studios installed too
3. Wont work for 64 bit anaconda/Python

## Solution 8

There is a package called `fuzzy_pandas` that can use `levenshtein`, `jaro`, `metaphone` and `bilenco` methods. With some great examples here

``````import pandas as pd
import fuzzy_pandas as fpd

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)

``````
``````  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag
``````

## Solution 9

As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying `get_close_matches`, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.

``````def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN
``````

## Solution 10

You can use d6tjoin for that

``````import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
``````

``` index number index_right letter 0 one 1 one a 1 two 2 too b 2 three 3 three c 3 four 4 fours d 4 five 5 five e ```

It has a variety of additional features such as:

• check join quality, pre and post join
• customize similarity function, eg edit distance vs hamming distance
• specify max distance
• multi-core compute

For details see

# Using `thefuzz`

Using SeatGeek's great package `thefuzz`, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.

Sample data

``````df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})

col_a   col_b
0   one     1
1   two     2
2   three   3
3   four    4
4   five    5

df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})

col_a   col_b
0   one     a
1   too     b
2   three   c
3   fours   d
4   five    e
``````

Function used to do the matching

``````def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
):
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit)            # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
)

# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left)     # Convert list of matches to rows
df_matches[
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index)       # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True)      # Drop column of match tuples

# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
if df_matches.index.name:
index_name = df_matches.index.name     # Stash index name
else:
index_name = 'index'        # Default used by pandas
df_matches.reset_index(inplace=True)
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True)       # The previous index has now become a column: rename for ease of reference

# Drop matches below threshold
df_matches.drop(
df_matches.loc[df_matches['match_score'] < threshold].index,
inplace=True
)

return df_matches
``````

Use function and merge data

``````import pandas as pd
from thefuzz import process

df_matches = fuzzy_match(
df1,
df2,
'col_a',
'col_a',
threshold=60,
limit=1
)

df_output = df1.merge(
df_matches,
how='left',
left_index=True,
right_on='df_left_id'
).merge(
df2,
how='left',
left_on='df_right_id',
right_index=True,
suffixes=['_df1', '_df2']
)

df_output.set_index('df_left_id', inplace=True)       # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table

df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']]      # Drop columns used in the matching
df_output.index.name = 'id'

id  col_a_df1   col_b_df1   col_b_df2
0   one         1           a
1   two         2           b
2   three       3           c
3   four        4           d
4   five        5           e
``````

Tip: Fuzzy matching using `thefuzz` is much quicker if you optionally install the `python-Levenshtein` package too.

## Solution 12

I have used `fuzzywuzz` in a very minimal way whilst matching the existing behaviour and keywords of `merge` in `pandas`.

Just specify your accepted `threshold` for matching (between `0` and `100`):

``````from fuzzywuzzy import process

def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):

def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None

match, score, *_ = process.extract(x, df[column], limit=1)[0]

if score >= threshold:
return match

else:
return None

if on is not None:
left_on = on
right_on = on

# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)

merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')

del merged_df['tmp']

return merged_df

``````

Try it out using the example data:

``````df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})

df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

fuzzy_merge(df, df2, on='Key', threshold=80)
``````

## Solution 13

For more complex use cases to match rows with many columns you can use `recordlinkage` package. `recordlinkage` provides all the tools to fuzzy match rows between `pandas` data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

## Solution 14

if the join axis is numeric this could also be used to match indexes with a specified tolerance:

``````def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values

diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1)  # min. of each column