Finding Similar Type of Names | Fuzzy Search in SQL
Fuzzy Search :
A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately. Typing mistake is a very common mistake that reviewer does while capturing the names which leads to inconsistency in Data. But sometimes, we need to search or match this inaccurate data anyway. For example, users should match existing customer records rather than creating unwanted duplicates.
Problem Statement :
Finding all kind of error types that described in the below table e.g. Spelling Mistake, Incomplete Name, Formatting Issue, Suffix Missing.
FirstName | LastName | FirstName | LastName | Error Type |
---|---|---|---|---|
Benjamin | Sheriff | Benjamin | Sherrif | Spelling Mistake |
Cocoa | Fairchil | Cocoa | Fairchild | Spelling Mistake |
Danuzio | Pinheiro | Danuzio | Pinherio | Spelling Mistake |
Jeff | Voight | Jeff | Voigt | Spelling Mistake |
Marie | Prendergast | Marie | Prendergrast | Spelling Mistake |
Michael | Heinrich | Michael | Heinricy | Spelling Mistake |
Nghia | Le | Nghia | Lee | Spelling Mistake |
Leonardo | Madrigal | Leonardo | Madrigal Del Valle | Incomplete Name |
Yolanda | Balcazar | Yolanda | Balcazar-Rodriguez | Incomplete Name |
J | Garcia Martinez | J | Garcia-Martinez | Formatting Issue |
Lauren | Lord – Tuley | Lauren | Lord-Tuley | Formatting Issue |
Virginia | Pulley Alberts | Virginia | Pulley-Alberts | Formatting Issue |
Connie | Marks | Connie | Marks Iv | Suffix Missing |
Daniel | Ladd | Daniel | Ladd Jr | Suffix Missing |
Rodrigo | Delgado | Rodrigo | Delgado Jr | Suffix Missing |
Thomas | Griffin | Thomas | Griffin Iii | Suffix Missing |
Wayne | Nance | Wayne | Nance Ii | Suffix Missing |
Approach to solve :
SOUNDEX() Function can find the inconsistency in names. SOUNDEX() can evaluate the similarity of two names. SOUNDEX() only works well when we do have 1 or 2 tokens. Names generally contain 1-2 tokens, so it works fine with Name. But if we want to find similar names for the company then it’s not that useful as it contains multiple tokens.
Data :
Benjamin Sheriff and Benjamin Sherrif where FirstName is same but any of the LastName has a spelling error. Likewise, there are many errors in this data like formatting issues, incomplete names.
FirstName | LastName |
---|---|
Benjamin | Sheriff |
Benjamin | Sherrif |
Virginia | Pulley Alberts |
Virginia | Pulley-Alberts |
Geoffery | Bown |
Geoffrey | Bown |
Anthoni | Castoldi |
Anthony | Castoldi |
Rodrigo | Delgado |
Rodrigo | Delgado Jr |
Leonardo | Madrigal |
Leonardo | Madrigal Del Valle |
Mitchel | Myers |
Mitchell | Myers |
Code :
Case-1: FirstName is exact the same but LastName is a similar type.
Case-2: LastName is exact the same but FirstName is a similar type.
Let’s discuss it one by one as follows.
Case-1 :
FirstName is exact the same but LastName is similar type :
select distinct ss.firstname,ss.lastname,sd.firstname,sd.lastname from load as ss, load as sd where ss.firstName=sd.firstName and SOUNDEX(ss.lastName)=SOUNDEX(sd.lastname) and left(ss.lastname,2)=left(sd.lastname,2) and ss.lastName<>sd.lastName
Output :
FirstName | LastName | FirstName | LastName |
---|---|---|---|
Benjamin | Sheriff | Benjamin | Sherrif |
Benjamin | Sherrif | Benjamin | Sheriff |
Leonardo | Madrigal | Leonardo | Madrigal Del Valle |
Leonardo | Madrigal Del Valle | Leonardo | Madrigal |
Rodrigo | Delgado | Rodrigo | Delgado Jr |
Rodrigo | Delgado Jr | Rodrigo | Delgado |
Virginia | Pulley Alberts | Virginia | Pulley-Alberts |
Virginia | Pulley-Alberts | Virginia | Pulley Alberts |
Case-2 :
LastName is exact same but FirstName is similar type :
select distinct ss.lastname,ss.firstname,sd.lastname,sd.firstname from load as ss, load as sd where ss.lastname=sd.lastname and SOUNDEX(ss.firstname)=SOUNDEX(sd.firstname) and left(ss.firstname,2)=left(sd.firstname,2) and ss.firstname<>sd.firstname
Output :
S.No. | FirstName | LastName | FirstName | LastName |
---|---|---|---|---|
1. | Anthoni | Castoldi | Anthony | Castoldi |
2. | Anthony | Castoldi | Anthoni | Castoldi |
3. | Geoffery | Bown | Geoffrey | Bown |
4. | Geoffrey | Bown | Geoffery | Bown |
5. | Mitchel | Myers | Mitchell | Myers |
6. | Mitchell | Myers | Mitchel | Myers |
Reference :
Github Link – https://github.com/SuryaSD/Finding-Similar-Types-of-Names-Fuzzy-Search
Please Login to comment...