Open In App

Python – Remove Non-English characters Strings from List

Given a List of Strings, perform removal of all Strings with non-english characters.

Input : test_list = [‘Good| ????’, ‘??Geeks???’] 
Output : [] 
Explanation : Both contain non-English characters 



Input : test_list = [“Gfg”, “Best”] 
Output : [“Gfg”, “Best”] 
Explanation : Both are valid English words.

Method #1 : Using regex + findall() + list comprehension



In this, we create a regex of unicodes and check for occurrence in String List, extract each String without unicode using findall().

Below is the implementation of the above approach:




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
# Using regex + findall() + list comprehension
import re
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for"'??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# using findall() to neglect unicode of Non-English alphabets
res = [idx for idx in test_list if not re.findall("[^\u0000-\u05C0\u2100-\u214F]+", idx)]
 
# printing result
print("The extracted list : " + str(res))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good| ????', 'for', '??Geeks???']

Time complexity: O(n*k), where n is the length of the input list and k is the average length of the strings in the list.
Auxiliary space: O(m), where m is the length of the output list.

Method #2 : Using regex + search() + filter() + lambda

In this, we search for only English alphabets in String, and extract only those that have those. We use filter() + lambda to perform the task of passing filter functionality and iteration.




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
# Using regex + search() + filter() + lambda
import re
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for"'??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# using search() to get only those strings with alphabets
res = list(filter(lambda ele: re.search("[a-zA-Z\s]+", ele) is not None, test_list))
 
# printing result
print("The extracted list : " + str(res))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good| ????', 'for', '??Geeks???']

Time Complexity: O(n)
Auxiliary Space: O(n)

Method #3: Using for loop




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
loweralphabets="abcdefghijklmnopqrstuvwxyz"
upperalphabets="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
x=loweralphabets+upperalphabets
res=[]
for i in test_list:
    a=""
    for j in i:
        if j in x:
            a+=j
    res.append(a)
             
# printing result
print("The extracted list : " + str(res))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time complexity: O(n*m), where n is the length of the input list and m is the maximum length of a string in the list.
Auxiliary space: O(n*m), as we are creating a new list to store the filtered strings.

Method 4: Using the unicodedata library

Step-by-step approach:

Below is the implementation of the above approach:
 




import unicodedata
 
def is_english(c):
    return c.isalpha() and unicodedata.name(c).startswith(('LATIN', 'COMMON'))
 
def remove_non_english(lst):
    output = []
    for s in lst:
        filtered = filter(is_english, list(s))
        english_str = ''.join(filtered)
        output.append(english_str)
    return output
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# printing result
print("The extracted list : " + str(remove_non_english(test_list)))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time complexity: O(nk) where n is the length of the input list and k is the length of the longest string in the input list. 
Auxiliary space: O(nk) since we are storing the filtered strings in the output list.

Method #5: Using the ord() function

use the ord() function to determine if a character is an English alphabet. English alphabets have ASCII values ranging from 65 to 90 for uppercase letters and 97 to 122 for lowercase letters.

Here’s the step-by-step approach:

  1. Define a function is_english(c) that takes a character as input and returns True if the character is an English alphabet and False otherwise. We can use the ord() function to get the ASCII value of the character and compare it with the ASCII values of English alphabets.
  2. Define a function remove_non_english(lst) that takes a list of strings as input and returns a list of strings with non-English characters removed. We can iterate through each string in the input list and iterate through each character in the string. If a character is English, we add it to a new string. If not, we skip it. We append the new string to an output list.
  3. Initialize a list test_list with some sample input strings.
  4. Call the remove_non_english() function with the test_list as input.
  5. Print the original and extracted lists.




def is_english(c):
    ascii_value = ord(c)
    return (ascii_value >= 65 and ascii_value <= 90) or (ascii_value >= 97 and ascii_value <= 122)
 
def remove_non_english(lst):
    output = []
    for s in lst:
        english_str = ""
        for c in s:
            if is_english(c):
                english_str += c
        output.append(english_str)
    return output
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# printing result
print("The extracted list : " + str(remove_non_english(test_list)))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time Complexity: O(n*m), where n is the number of strings in the input list and m is the length of the longest string in the list.
Auxiliary Space: O(n*m), where n is the number of strings in the input list and m is the length of the longest string in the list. 

Method #6: Using the translate() method

Step-by-step approach:

Below is the implementation of the above approach:




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# create a translation table to remove non-English characters
non_english = str.maketrans("", "", "0123456789!@#$%^&*()_+-=[]{}\\|;:'\",./<>?`~¡¢£¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ")
 
# initialize an empty list to store modified strings
result = []
 
# iterate over each string in the test_list
for string in test_list:
    # apply the translation table to remove non-English characters
    modified_string = string.translate(non_english)
    # append the modified string to the result list
    result.append(modified_string)
 
# print the resulting list
print("The extracted list : " + str(result))

Output
The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good ', 'for', 'Geeks']

Time complexity: O(n*m), where n is the length of the test_list and m is the maximum length of a string in the list. 
Auxiliary space: O(n*m), where n is the length of the test_list and m is the maximum length of a string in the list. 


Article Tags :