Three methods of batch fuzzy matching

Author: Xiao Xiaoming

Article Directory


Sometimes some data has a certain corresponding relationship, but the connection field is missing, and it is necessary to manually find out the matching data to establish the relationship. Here, I show several ideas for fuzzy matching to deal with data of different magnitudes.

Of course, sort-based fuzzy matching (similar to the fuzzy matching mode of Excel's VLOOKUP function) also belongs to the category of fuzzy matching, but it is too simple to be discussed in this article.

This article mainly discusses fuzzy matching of character strings based on company names or addresses.

Use edit distance algorithm for fuzzy matching

The basic idea of ​​fuzzy matching is to calculate the similarity between each string and the target string, and take the string with the highest similarity as the result of fuzzy matching with the target string.

For calculating the similarity between strings, the most common idea is to use the edit distance algorithm.

Below we have 28 names to find the most similar names from the database (390 data):

import pandas as pd

excel = pd.ExcelFile("所有客户.xlsx")
data = excel.parse(0)
find = excel.parse(1)
display(data.head())
print(data.shape)
display(find.head())
print(find.shape)

Edit distance algorithm refers to the minimum number of editing operations required to convert two strings from one to the other. The permitted editing operations include replacing one character with another, inserting a character, and deleting a character.
Generally speaking, the smaller the editing distance, the less the number of operations, and the greater the similarity between the two strings.

Create a function to calculate the edit distance:

def minDistance(word1: str, word2: str):
    '编辑距离的计算函数'
    n = len(word1)
    m = len(word2)
    # 有一个字符串为空串
    if n * m == 0:
        return n + m
    # DP 数组
    D = [[0] * (m + 1) for _ in range(n + 1)]
    # 边界状态初始化
    for i in range(n + 1):
        D[i][0] = i
    for j in range(m + 1):
        D[0][j] = j
    # 计算所有 DP 值
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            left = D[i - 1][j] + 1
            down = D[i][j - 1] + 1
            left_down = D[i - 1][j - 1]
            if word1[i - 1] != word2[j - 1]:
                left_down += 1
            D[i][j] = min(left, down, left_down)
    return D[n][m]

For the analysis of the above code, please refer to Leetcode Solution: https://leetcode-cn.com/problems/edit-distance/solution/bian-ji-ju-chi-by-leetcode-solution/

Traverse each searched name, calculate the edit distance between it and all customer names in the database, and take the customer name with the smallest edit distance:

result = []
for name in find.name.values:
    a = data.user.apply(lambda user: minDistance(user, name))
    user = data.user[a.argmin()]
    result.append(user)
find["result"] = result
find

After testing, it was found that some addresses did not work well.

Let's take 2 results as the address of the Huaihe Road store in Xixian County, Xinyang, and look at the top 10 addresses with the smallest edit distance and edit distance:

a = data.user.apply(lambda user: minDistance(user, '河南美锐信阳息县淮河路分店'))
a = a.nsmallest(10).reset_index()
a.columns = ["名称", "编辑距离"]
a.名称 = data.user[a.名称].values
a
a = data.user.apply(lambda user: minDistance(user, '河南美锐信阳潢川四中分店'))
a = a.nsmallest(10).reset_index()
a.columns = ["名称", "编辑距离"]
a.名称 = data.user[a.名称].values
a

As you can see, there are still the results we want among the top ten names with the smallest edit distance.

Use fuzzywuzzy for batch fuzzy matching

Through the above code, we have basically understood the basic principle of batch fuzzy matching through the edit distance algorithm. However, it is more complicated to write the code of the edit distance algorithm by yourself, and it is more troublesome to convert to similarity for analysis. If you already have ready-made wheels, you don't need to write it yourself.

The fuzzywuzzy library is a library developed based on the edit distance algorithm, and the value is quantified as a similarity score, which will be much better than the algorithm that we wrote without targeted optimization. It can be installed by pip install FuzzyWuzzy.

For the fuzzywuzzy library, it mainly contains the fuzz module and the process module. The fuzz module is used to calculate the similarity between two strings, which is equivalent to encapsulation and optimization of the above code. The process module can directly extract the required results.

fuzz module

from fuzzywuzzy import fuzz

Simple matching (Ratio):

a = data.user.apply(lambda user: fuzz.ratio(user, '河南美锐信阳潢川四中分店'))
a = a.nlargest(10).reset_index()
a.columns = ["名称", "相似度"]
a.名称 = data.user[a.名称].values
a

Partial Ratio:

a = data.user.apply(lambda user: fuzz.partial_ratio(user, '河南美锐信阳潢川四中分店'))
a = a.nlargest(10).reset_index()
a.columns = ["名称", "相似度"]
a.名称 = data.user[a.名称].values
a

Obviously the ratio() function of fuzzywuzzy library is much more accurate than the edit distance algorithm written by myself.

process module

The process module is a further encapsulation, which can directly obtain the highest similarity value and similarity:

from fuzzywuzzy import process

Extract extracts multiple pieces of data:

users = data.user.to_list()
a = process.extract('河南美锐信阳潢川四中分店', users, limit=10)
a = pd.DataFrame(a, columns=["名称", "相似度"])
a

From the results, the process module seems to combine the results of simple matching (Ratio) and non-perfect matching (Partial Ratio) of the fuzz module at the same time.

When we only need to return a piece of data, it is more convenient to use extractOne:

users = data.user.to_list()
find["result"] = find.name.apply(lambda x: process.extractOne(x, users)[0])
find

It can be seen that the accuracy rate has been greatly improved compared with the previously written edit distance algorithm, but the matching results of individual names are still not good.

View these two inaccurate addresses:

process.extract('许湾乡许湾村焦艳芳卫生室', users)
[('小寨沟村卫生室', 51),
 ('周口城乡一体化焦艳芳一体化卫生室', 50),
 ('西华县皮营乡楼陈村卫生室', 42),
 ('叶县邓李乡杜杨村第二卫生室', 40),
 ('汤阴县瓦岗乡龙虎村东卫生室', 40)]
process.extract('河南美锐信阳息县淮河路分店', users)
[('信阳息县淮河路店', 79),
 ('河南美锐大药房连锁有限公司息县淮河路分店', 67),
 ('河南美锐大药房连锁有限公司息县大河文锦分店', 53),
 ('河南美锐大药房连锁有限公司息县千佛庵东路分店', 51),
 ('河南美锐大药房连锁有限公司息县包信分店', 50)]

For such a problem, I do not have a perfect solution. My personal suggestion is to add the n names with the highest similarity to the result list, and then manually filter them later:

result = find.name.apply(lambda x: next(zip(*process.extract(x, users, limit=3)))).apply(pd.Series)
result.rename(columns=lambda i: f"匹配{i+1}", inplace=True)
result = pd.concat([find.drop(columns="result"), result], axis=1)
result

Although there may be individual correct results, none of these 5 results, but overall it saves a lot of time for manual screening.

Overall code

from fuzzywuzzy import process
import pandas as pd

excel = pd.ExcelFile("所有客户.xlsx")
data = excel.parse(0)
find = excel.parse(1)
users = data.user.to_list()
result = find.name.apply(lambda x: next(
    zip(*process.extract(x, users, limit=3)))).apply(pd.Series)
result.rename(columns=lambda i: f"匹配{i+1}", inplace=True)
result = pd.concat([find, result], axis=1)
result

Use Gensim for batch fuzzy matching

Introduction to Gensim

Gensim supports multiple topic model algorithms including TF-IDF, LSA, LDA, and word2vec, supports streaming training, and provides API interfaces for common tasks such as similarity calculation and information retrieval.

basic concepts:

  • Corpus: A set of original texts used to unsupervisedly train the hidden layer structure of text topics. There is no need to manually label additional information in the corpus. In Gensim, Corpus is usually an iterable object (such as a list). Each iteration returns a sparse vector that can be used to express text objects.
  • Vector: A list composed of a set of text features. It is the internal expression of a text in Gensim.
  • Sparse Vector (SparseVector): You can omit the extra 0 elements in the vector. At this time, each element in the vector is a tuple of (key, value)
  • Model: is an abstract term. Defines the transformation of two vector spaces (that is, transforming from one vector expression of the text to another vector expression).

Installation: pip install gensim

Official website: https://radimrehurek.com/gensim/

Under what circumstances should NLP be used for batch fuzzy matching? That is when the database data is too large, for example, reaching tens of thousands of levels:

import pandas as pd

data = pd.read_csv("所有客户.csv", encoding="gbk")
find = pd.read_csv("被查找的客户.csv", encoding="gbk")
display(data.head())
print(data.shape)
display(find.head())
print(find.shape)

At this time, if you still use edit distance or fuzzywuzzy brute force traversal calculation, it is estimated that the result cannot be calculated in 1 hour, but it only takes a few seconds to calculate the result using the NLP artifact Gensim.

Use bag-of-words model to directly perform batch similarity matching

First, we need to segment the original text to get the feature list of each title:

import jieba

data_split_word = data.user.apply(jieba.lcut)
data_split_word.head(10)
0        [珠海, 广药, 康鸣, 医药, 有限公司]
1              [深圳市, 宝安区, 中心医院]
2         [中山, 火炬, 开发区, 伴康, 药店]
3           [中山市, 同方, 医药, 有限公司]
4    [广州市, 天河区, 元岗金, 健民, 医药, 店]
5       [广州市, 天河区, 元岗居, 健堂, 药房]
6          [广州市, 天河区, 元岗润佰, 药店]
7        [广州市, 天河区, 元岗, 协心, 药房]
8        [广州市, 天河区, 元岗, 心怡, 药店]
9         [广州市, 天河区, 元岗永亨堂, 药店]
Name: user, dtype: object

Next, build an index dictionary of corpus features, and convert the original expression of text features into the expression of sparse vectors corresponding to the bag-of-words model:

from gensim import corpora

dictionary = corpora.Dictionary(data_split_word.values)
data_corpus = data_split_word.apply(dictionary.doc2bow)
data_corpus.head()
0             [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]
1                             [(5, 1), (6, 1), (7, 1)]
2          [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
3                   [(0, 1), (3, 1), (13, 1), (14, 1)]
4    [(0, 1), (15, 1), (16, 1), (17, 1), (18, 1), (...
Name: user, dtype: object

In this way, the sparse vector corresponding to each name (here, the bow vector) is obtained, and each element of the vector represents the number of times a word appears in the name.

So far we can build the similarity matrix:

from gensim import similarities

index = similarities.SparseMatrixSimilarity(data_corpus.values, num_features=len(dictionary))

Then do the same processing on the searched name to perform similarity batch matching:

find_corpus = find.name.apply(jieba.lcut).apply(dictionary.doc2bow)
sim = index[find_corpus]
find["result"] = data.user[sim.argmax(axis=1)].values
find.head(30)

It can be seen that the calculation speed of this model is very fast, and the accuracy rate seems to be higher than fuzzywuzzy overall, but the matching result of fuzzywuzzy on the 308 factory branch of Henan Meirui Pharmacy Chain Co., Ltd. is correct.

Batch similarity matching after using TF-IDF subject vector transformation

The Corpus we used before were all sparse matrices of word frequency vectors. Now we transform it into a TF-IDF model and then construct a similarity matrix:

from gensim import models

tfidf = models.TfidfModel(data_corpus.to_list())
index = similarities.SparseMatrixSimilarity(
    tfidf[data_corpus], num_features=len(dictionary))

The searched name is also treated in the same way:

sim = index[tfidf[find_corpus]]
find["result"] = data.user[sim.argmax(axis=1)].values
find.head(30)

It can be seen that the Jiao Yanfang clinic in Xuwan Village, Xuwan Township is matched correctly, but the Huaihe Road branch of Xinyangxi County, Henan Meirui is matched incorrectly. This is because in the TF-IDF model, because Meirui is in many data The right to be demoted appears.

If only the TF-IDF conversion is performed on the database, and the searched name only uses the word frequency vector, what is the matching effect?

from gensim import models

tfidf = models.TfidfModel(data_corpus.to_list())
index = similarities.SparseMatrixSimilarity(
    tfidf[data_corpus], num_features=len(dictionary))
sim = index[find_corpus]
find["result"] = data.user[sim.argmax(axis=1)].values
find.head(30)

It can be seen that in addition to the database of Ailian Baozhilin Pharmacy, which does not contain the correct name, there is still an incorrect match for the 308 factory branch of Henan Meirui Pharmacy Chain Co., Ltd. This is because it cannot be recognized that the semantics of 308 is equal to three zero eight. If there are many types of data, we can first convert the data to be searched from lowercase numbers to uppercase numbers (to keep consistent with the database), and then word segmentation:

trantab = str.maketrans("0123456789", "零一二三四五六七八九")
find_corpus = find.name.apply(lambda x: dictionary.doc2bow(jieba.lcut(x.translate(trantab))))

sim = index[find_corpus]
find["result"] = data.user[sim.argmax(axis=1)].values
find.head(30)

After this treatment, the 308 factory branch was also correctly matched, and other similar problems can be converted using this idea.

Although after the above processing, the matching accuracy rate is almost 100%, it does not mean that other types of data will also have such a high accuracy rate. It is necessary to analyze and convert the data according to the specific situation. There is no perfect way to deal with batch fuzzy matching. For this kind of problems, we cannot completely trust the results of program matching, and manual secondary checks are required unless a certain error rate can be accepted.

For the convenience of our manual screening, we can save the top N data with the highest similarity to the result. Here we take three as examples:

Get the largest 3 results at the same time

Below we add the 3 values ​​with the highest similarity to the results:

result = []
for corpus in find_corpus.values:
    sim = pd.Series(index[corpus])
    result.append(data.user[sim.nlargest(3).index].values)
result = pd.DataFrame(result)
result.rename(columns=lambda i: f"匹配{i+1}", inplace=True)
result = pd.concat([find.drop(columns="result"), result], axis=1)
result.head(30)

Complete code

from gensim import corpora, similarities, models
import jieba
import pandas as pd

data = pd.read_csv("所有客户.csv", encoding="gbk")
find = pd.read_csv("被查找的客户.csv", encoding="gbk")

data_split_word = data.user.apply(jieba.lcut)
dictionary = corpora.Dictionary(data_split_word.values)
data_corpus = data_split_word.apply(dictionary.doc2bow)
trantab = str.maketrans("0123456789", "零一二三四五六七八九")
find_corpus = find.name.apply(
    lambda x: dictionary.doc2bow(jieba.lcut(x.translate(trantab))))

tfidf = models.TfidfModel(data_corpus.to_list())
index = similarities.SparseMatrixSimilarity(
    tfidf[data_corpus], num_features=len(dictionary))

result = []
for corpus in find_corpus.values:
    sim = pd.Series(index[corpus])
    result.append(data.user[sim.nlargest(3).index].values)
result = pd.DataFrame(result)
result.rename(columns=lambda i: f"匹配{i+1}", inplace=True)
result = pd.concat([find, result], axis=1)
result.head(30)

to sum up

This article first shares the concept of edit distance and how to use edit distance for similarity fuzzy matching. Then introduced the wheel fuzzwuzzy based on this algorithm. The package is better and it is also very convenient to use. However, when the database size reaches more than 10,000 entries, the efficiency drops extremely, especially when the data volume reaches more than 100,000. There will be no results in the sky. Therefore, Gensim is used to calculate the corresponding tf-idf vector after word segmentation to calculate the similarity. The calculation time is reduced from a few hours to a few seconds, and the accuracy rate has also been greatly improved, which can deal with most batch similarity fuzzy matching problems.

I am Xiaoming, welcome to leave a comment below to express your opinion.