Python | Grouping similar substrings in list

Last Updated : 11 Apr, 2023

Sometimes we have an application in which we require to group common prefix strings into one such that further processing can be done according to the grouping. This type of grouping is useful in the cases of Machine Learning and Web Development. Let’s discuss certain ways in which this can be done.

Method #1 : Using lambda + itertools.groupby() + split()
The combination of above three functions help us achieve the task. The split method is key as it defines the separator by which grouping has to be performed. The groupby function does the grouping of elements.

Steps by step approach:

Import the groupby function from the itertools module.
Initialize a list of strings test_list with some elements.
Sort the test_list in ascending order using the sort() method. This is necessary for grouping later.
Print the original test_list.
Use a list comprehension to iterate over the groups of elements in test_list grouped by the first substring before the _ character.
In the groupby() function, test_list is iterable, and the lambda function lambda a: a.split(‘_’)[0] returns the first substring before the _ character in each element of the list. This is used to group the elements.
Convert each group into a list and append it to the result list res.
Print the result list res.

Below is the implementation of the above approach:

Python3

# Python3 code to demonstrate
# group similar substrings
# using lambda + itertools.groupby() + split()
from itertools import groupby
 
# initializing list 
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
 
# sort list 
# essential for grouping
test_list.sort()
 
# printing the original list 
print ("The original list is : " + str(test_list))
 
# using lambda + itertools.groupby() + split()
# group similar substrings
res = [list(i) for j, i in groupby(test_list,
                  lambda a: a.split('_')[0])]
 
# printing result
print ("The grouped list is : " + str(res))

Output

The original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]

Time complexity: O(nlogn), where n is the length of the input list.
Auxiliary space: O(n), where n is the length of the input list.

Method #2 : Using lambda + itertools.groupby() + partition()
The similar task can also be performed replacing the split function with the partition function. This is more efficient way to perform this task as it uses the iterators and hence internally quicker.

Python3

# Python3 code to demonstrate
# group similar substrings
# using lambda + itertools.groupby() + partition()
from itertools import groupby
 
# initializing list 
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
 
# sort list 
# essential for grouping
test_list.sort()
 
# printing the original list 
print ("The original list is : " + str(test_list))
 
# using lambda + itertools.groupby() + partition()
# group similar substrings
res = [list(i) for j, i in groupby(test_list,
              lambda a: a.partition('_')[0])]
 
# printing result
print ("The grouped list is : " + str(res))

Output

The original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]

Time complexity: O(n log n) (due to sorting the list).
Auxiliary space: O(n) (for creating the result list “res”).

Method #3 : Using index() and find() methods

Python3

# Python3 code to demonstrate
# group similar substrings
 
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
print("The original List is : "+ str(test_list))
x=[]
for i in test_list:
    x.append(i[:i.index("_")])
x=list(set(x))
res=[]
for i in x:
    a=[]
    for j in test_list:
        if(j.find(i)!=-1):
            a.append(j)
    res.append(a)
 
# printing result
print ("The grouped list is : " + str(res))

Output

The grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]

Time complexity: O(n^2), where ‘n’ is the length of the input list ‘test_list’.
Auxiliary space: O(n), where ‘n’ is the length of the input list ‘test_list’.

Method #4 : Using startswith()

Python3

# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
 
# printing the original list
print("The original list is : " + str(test_list))
 
# using startswith in a list comprehension
res = [[item for item in test_list if item.startswith(prefix)] for prefix in set([item[:item.index("_")] for item in test_list])]
 
# printing result
print("The grouped list is : " + str(res))
#This code is contributed by Edula Vinay Kumar Reddy

Output

The original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]

Time Complexity: O(n), as it iterates through the list test_list twice, once to create the list of unique prefixes and once to create the grouped list. It also has a space complexity of O(n), as it creates two additional lists, one containing the unique prefixes and one containing the grouped list.
Auxiliary Space: O(n)

Method #5: Using a dictionary to group similar substrings

Use a dictionary to group the substrings that have the same prefix. The keys of the dictionary will be the prefixes, and the values will be lists containing the substrings with that prefix. Here’s an example implementation:

Python3

test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
grouped = {}
for s in test_list:
    prefix = s.split('_')[0]
    if prefix not in grouped:
        grouped[prefix] = []
    grouped[prefix].append(s)
 
res = list(grouped.values())
print(res)

Output

[['geek_1', 'geek_4'], ['coder_2', 'coder_3'], ['pro_3']]

Time complexity: O(n*k), where n is the length of the input list and k is the maximum length of the prefix.
Auxiliary space: O(n*k), as the dictionary may contain all n elements of the input list, and the length of each value list may be up to n.

Method #6: Using a loop and a dictionary

Step-by-step approach:

Initialize the list of strings.
Create an empty dictionary to store the groups.
Iterate over each string in the list.
Extract the substring before the underscore using the split() method.
Check if the key exists in the dictionary. If it does, append the string to the list under the key. If it doesn’t, create a new list with the string under the key.
Convert the dictionary to a list of lists using the values() method.
Print the original list and the grouped list.

Python3

# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
 
# creating an empty dictionary
d = {}
 
# iterating over each string in the list
for s in test_list:
    # extracting the substring before the underscore
    key = s.split('_')[0]
    # adding the string to the dictionary under the key
    if key in d:
        d[key].append(s)
    else:
        d[key] = [s]
 
# converting the dictionary to a list of lists
res = list(d.values())
 
# printing the original list
print("The original list is : " + str(test_list))
 
# printing the result
print("The grouped list is : " + str(res))

Output

The original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
The grouped list is : [['geek_1', 'geek_4'], ['coder_2', 'coder_3'], ['pro_3']]

Time complexity: This approach has a time complexity of O(n), where n is the number of strings in the list. The loop iterates over each string in the list once, and the time complexity of dictionary operations is usually considered to be constant time.
Auxiliary space: This approach uses a dictionary to store the groups, so the auxiliary space complexity is O(k*n), where k is the average size of the groups and n is the number of strings in the list.

Method #7: Using numpy method:

Algorithm :

Initialize the input list test_list.
Get the unique prefixes from the input list using np.unique.
Group the elements in test_list by prefix using a list comprehension.
Print the grouped list res.

Python3

import numpy as np
 
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# printing the original list
print("The original list is : " + str(test_list))
# Get unique prefixes
prefixes = np.unique([item.split('_')[0] for item in test_list])
 
# Group elements by prefix
res = [[item for item in test_list if item.startswith(prefix)] for prefix in prefixes]
 
# printing result
print("The grouped list is : " + str(res))
 
#This code is contributed by Jyothi pinjala.

Output:

The original list is : [‘geek_1’, ‘coder_2’, ‘geek_4’, ‘coder_3’, ‘pro_3’]
The grouped list is : [[‘coder_2’, ‘coder_3’], [‘geek_1’, ‘geek_4’], [‘pro_3’]]

Time complexity:

The np.unique function has a time complexity of O(n log n) or O(n) depending on the implementation used.
The list comprehension inside the res list has a time complexity of O(n^2), where n is the length of the input list.
Therefore, the overall time complexity of the algorithm is O(n^2).
Auxiliary Space:

The space complexity of the algorithm is O(n) because we store the input list, the prefixes, and the grouped list in memory.

Suggest improvement

Python | Identical Strings Grouping

Share your thoughts in the comments