Finding Similar Type of Names | Fuzzy Search in SQL
Fuzzy Search :
A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately. Typing mistake is a very common mistake that reviewer does while capturing the names which leads to inconsistency in Data. But sometimes, we need to search or match this inaccurate data anyway. For example, users should match existing customer records rather than creating unwanted duplicates.
Problem Statement :
Finding all kind of error types that described in the below table e.g. Spelling Mistake, Incomplete Name, Formatting Issue, Suffix Missing.
FirstName LastName FirstName LastName Error Type Benjamin Sheriff Benjamin Sherrif Spelling Mistake Cocoa Fairchil Cocoa Fairchild Spelling Mistake Danuzio Pinheiro Danuzio Pinherio Spelling Mistake Jeff Voight Jeff Voigt Spelling Mistake Marie Prendergast Marie Prendergrast Spelling Mistake Michael Heinrich Michael Heinricy Spelling Mistake Nghia Le Nghia Lee Spelling Mistake Leonardo Madrigal Leonardo Madrigal Del Valle Incomplete Name Yolanda Balcazar Yolanda Balcazar-Rodriguez Incomplete Name J Garcia Martinez J Garcia-Martinez Formatting Issue Lauren Lord – Tuley Lauren Lord-Tuley Formatting Issue Virginia Pulley Alberts Virginia Pulley-Alberts Formatting Issue Connie Marks Connie Marks Iv Suffix Missing Daniel Ladd Daniel Ladd Jr Suffix Missing Rodrigo Delgado Rodrigo Delgado Jr Suffix Missing Thomas Griffin Thomas Griffin Iii Suffix Missing Wayne Nance Wayne Nance Ii Suffix Missing
Approach to solve :
SOUNDEX() Function can find the inconsistency in names. SOUNDEX() can evaluate the similarity of two names. SOUNDEX() only works well when we do have 1 or 2 tokens. Names generally contain 1-2 tokens, so it works fine with Name. But if we want to find similar names for the company then it’s not that useful as it contains multiple tokens.
Benjamin Sheriff and Benjamin Sherrif where FirstName is same but any of the LastName has a spelling error. Likewise, there are many errors in this data like formatting issues, incomplete names.
FirstName LastName Benjamin Sheriff Benjamin Sherrif Virginia Pulley Alberts Virginia Pulley-Alberts Geoffery Bown Geoffrey Bown Anthoni Castoldi Anthony Castoldi Rodrigo Delgado Rodrigo Delgado Jr Leonardo Madrigal Leonardo Madrigal Del Valle Mitchel Myers Mitchell Myers
Case-1: FirstName is exact the same but LastName is a similar type.
Case-2: LastName is exact the same but FirstName is a similar type.
Let’s discuss it one by one as follows.
FirstName is exact the same but LastName is similar type :
select distinct ss.firstname,ss.lastname,sd.firstname,sd.lastname from load as ss, load as sd where ss.firstName=sd.firstName and SOUNDEX(ss.lastName)=SOUNDEX(sd.lastname) and left(ss.lastname,2)=left(sd.lastname,2) and ss.lastName<>sd.lastName
FirstName LastName FirstName LastName Benjamin Sheriff Benjamin Sherrif Benjamin Sherrif Benjamin Sheriff Leonardo Madrigal Leonardo Madrigal Del Valle Leonardo Madrigal Del Valle Leonardo Madrigal Rodrigo Delgado Rodrigo Delgado Jr Rodrigo Delgado Jr Rodrigo Delgado Virginia Pulley Alberts Virginia Pulley-Alberts Virginia Pulley-Alberts Virginia Pulley Alberts
LastName is exact same but FirstName is similar type :
select distinct ss.lastname,ss.firstname,sd.lastname,sd.firstname from load as ss, load as sd where ss.lastname=sd.lastname and SOUNDEX(ss.firstname)=SOUNDEX(sd.firstname) and left(ss.firstname,2)=left(sd.firstname,2) and ss.firstname<>sd.firstname
S.No. FirstName LastName FirstName LastName 1. Anthoni Castoldi Anthony Castoldi 2. Anthony Castoldi Anthoni Castoldi 3. Geoffery Bown Geoffrey Bown 4. Geoffrey Bown Geoffery Bown 5. Mitchel Myers Mitchell Myers 6. Mitchell Myers Mitchel Myers
Github Link – https://github.com/SuryaSD/Finding-Similar-Types-of-Names-Fuzzy-Search
Attention reader! Don’t stop learning now. Learn SQL for interviews using SQL Course by GeeksforGeeks.