Python | Tokenizing strings in list of strings
Last Updated :
02 Jan, 2023
Sometimes, while working with data, we need to perform the string tokenization of the strings that we might get as an input as list of strings. This has a usecase in many application of Machine Learning. Let’s discuss certain ways in which this can be done.
Method #1 : Using list comprehension + split()
We can achieve this particular task using list comprehension to traverse for each strings from list of strings and split function performs the task of tokenization.
test_list = [ 'Geeks for Geeks' , 'is' , 'best computer science portal' ]
print ( "The original list : " + str (test_list))
res = [sub.split() for sub in test_list]
print ( "The list after split of strings is : " + str (res))
|
Output :
The original list : [‘Geeks for Geeks’, ‘is’, ‘best computer science portal’]
The list after split of strings is : [[‘Geeks’, ‘for’, ‘Geeks’], [‘is’], [‘best’, ‘computer’, ‘science’, ‘portal’]]
Method #2 : Using map() + split()
This is yet another method in which this particular task can be solved. In this method, we just perform the similar task as above, just we use map function to bind the split logic to the entire list.
test_list = [ 'Geeks for Geeks' , 'is' , 'best computer science portal' ]
print ( "The original list : " + str (test_list))
res = list ( map ( str .split, test_list))
print ( "The list after split of strings is : " + str (res))
|
Output :
The original list : [‘Geeks for Geeks’, ‘is’, ‘best computer science portal’]
The list after split of strings is : [[‘Geeks’, ‘for’, ‘Geeks’], [‘is’], [‘best’, ‘computer’, ‘science’, ‘portal’]]
Method #3 : Using re
To use the re module to tokenize the strings in a list of strings, you can do the following:
import re
test_list = [ 'Geeks for Geeks' , 'is' , 'best computer science portal' ]
print ( "Original list:" , test_list)
res = [re.split( ' ' , s) for s in test_list]
print ( "Tokenized list:" , res)
|
Output :
Original list: [‘Geeks for Geeks’, ‘is’, ‘best computer science portal’]
Tokenized list: : [[‘Geeks’, ‘for’, ‘Geeks’], [‘is’], [‘best’, ‘computer’, ‘science’, ‘portal’]]
Share your thoughts in the comments
Please Login to comment...