Python | Grouping similar substrings in list
Sometimes we have an application in which we require to group common prefix strings into one such that further processing can be done according to the grouping. This type of grouping is useful in the cases of Machine Learning and Web Development. Let’s discuss certain ways in which this can be done.
Method #1 : Using lambda + itertools.groupby() + split()
The combination of above three functions help us achieve the task. The split method is key as it defines the separator by which grouping has to be performed. The groupby function does the grouping of elements.
Python3
# Python3 code to demonstrate # group similar substrings # using lambda + itertools.groupby() + split() from itertools import groupby # initializing list test_list = [ 'geek_1' , 'coder_2' , 'geek_4' , 'coder_3' , 'pro_3' ] # sort list # essential for grouping test_list.sort() # printing the original list print ( "The original list is : " + str (test_list)) # using lambda + itertools.groupby() + split() # group similar substrings res = [ list (i) for j, i in groupby(test_list, lambda a: a.split( '_' )[ 0 ])] # printing result print ( "The grouped list is : " + str (res)) |
The original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3'] The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]
Method #2 : Using lambda + itertools.groupby() + partition()
The similar task can also be performed replacing the split function with the partition function. This is more efficient way to perform this task as it uses the iterators and hence internally quicker.
Python3
# Python3 code to demonstrate # group similar substrings # using lambda + itertools.groupby() + partition() from itertools import groupby # initializing list test_list = [ 'geek_1' , 'coder_2' , 'geek_4' , 'coder_3' , 'pro_3' ] # sort list # essential for grouping test_list.sort() # printing the original list print ( "The original list is : " + str (test_list)) # using lambda + itertools.groupby() + partition() # group similar substrings res = [ list (i) for j, i in groupby(test_list, lambda a: a.partition( '_' )[ 0 ])] # printing result print ( "The grouped list is : " + str (res)) |
The original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3'] The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]
Method #3 : Using index() and find() methods
Python3
# Python3 code to demonstrate # group similar substrings # initializing list test_list = [ 'geek_1' , 'coder_2' , 'geek_4' , 'coder_3' , 'pro_3' ] print ( "The original List is : " + str (test_list)) x = [] for i in test_list: x.append(i[:i.index( "_" )]) x = list ( set (x)) res = [] for i in x: a = [] for j in test_list: if (j.find(i)! = - 1 ): a.append(j) res.append(a) # printing result print ( "The grouped list is : " + str (res)) |
The grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]
Method #4 : Using startswith()
Python3
# initializing list test_list = [ 'geek_1' , 'coder_2' , 'geek_4' , 'coder_3' , 'pro_3' ] # printing the original list print ( "The original list is : " + str (test_list)) # using startswith in a list comprehension res = [[item for item in test_list if item.startswith(prefix)] for prefix in set ([item[:item.index( "_" )] for item in test_list])] # printing result print ( "The grouped list is : " + str (res)) #This code is contributed by Edula Vinay Kumar Reddy |
The original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3'] The grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]
This approach first creates a list of all the unique prefixes in the original list using a list comprehension and the set function. It then uses another list comprehension to create a list of lists, where each inner list contains all the elements in the original list that start with the corresponding prefix.
This approach is more concise and readable than the third method using index and find, and it is also more efficient than the first and second methods using lambda, itertools.groupby, and either split or partition
This approach has a time complexity of O(n), as it iterates through the list test_list twice, once to create the list of unique prefixes and once to create the grouped list. It also has a space complexity of O(n), as it creates two additional lists, one containing the unique prefixes and one containing the grouped list.
This means that the time and space complexity of this approach are linear with respect to the size of the input list test_list.
Please Login to comment...