Pandas is a data analysis library for Python that has exploded in popularity over the past years. In technical terms, pandas is an in memory nosql database, that has sql-like constructs, basic statistical and analytic support, as well as graphing capability .One common task in data analysis is searching for substrings within a dataset, and Pandas offers efficient tools to accomplish this.
In this article, we will explore the ways by which we can check for a substring in a Pandas DataFrame column.
Check for a Substring in a DataFrame Column
Below are some of the ways by which check for a substring in a Pandas DataFrame column in Python:
- Using str.contains() method
- Using Regular Expressions
- apply() function
- List Comprehension with ‘in’ Operator
Check For a Substring in a Pandas Dataframe using str.contains() method
In this example, a pandas DataFrame is created with employee information. A new column, ‘NameContainsSubstring,’ is added, indicating whether the substring ‘an’ is present in each ‘Name’ entry using the str.contains
method.
import pandas as pd
data = {
'EmployeeID' : [ 101 , 102 , 103 , 104 ],
'Name' : [ 'Aman' , 'Bhavna' , 'Madhav' , 'Rohan' ],
'Department' : [ 'HR' , 'IT' , 'Finance' , 'Marketing' ],
'Salary' : [ 60000 , 75000 , 90000 , 65000 ]
} df = pd.DataFrame(data)
# Checking for substring 'an' in the 'Name' column substring = 'an'
df[ 'NameContainsSubstring' ] = df[ 'Name' ]. str .contains(substring)
filtered_df = df[df[ 'NameContainsSubstring' ]]
print (filtered_df)
|
Output:
EmployeeID Name Department Salary NameContainsSubstring
0 101 Aman HR 60000 True
3 104 Rohan Marketing 65000 True
Check For A Substring In A Pandas Dataframe Using Regular Expressions
In this example, a pandas DataFrame is created with employee information. A new column, ‘NameContainsPattern,’ is added, indicating whether the regular expression pattern ‘ma’ is present in each ‘Name’ entry.
In this example, the str.contains
method is used with the regex=True
parameter to interpret the pattern as a regular expression. The negative lookahead ensures that ‘ma’ is not immediately followed by the end of the string.
import pandas as pd
data = {
'EmployeeID' : [ 101 , 102 , 103 , 104 ],
'Name' : [ 'aman' , 'bhavna' , 'madhav' , 'rohan' ],
'Department' : [ 'HR' , 'IT' , 'Finance' , 'Marketing' ],
'Salary' : [ 60000 , 75000 , 90000 , 65000 ]
} df = pd.DataFrame(data)
# regular expression pattern with negative lookahead pattern = r 'ma(?!$)'
df[ 'NameContainsPattern' ] = df[ 'Name' ]. str .contains(pattern, regex = True )
filtered_df = df[df[ 'NameContainsPattern' ]]
print (filtered_df)
|
Output:
EmployeeID Name Department Salary NameContainsPattern
0 101 aman HR 60000 True
2 103 madhav Finance 90000 True
Check For A Substring In A Pandas Dataframe Using apply() function
In this example, a pandas DataFrame is created with employee information, including ‘EmployeeID’, ‘Name’, ‘Department’, and ‘Salary’. A new column, ‘NameContainsSubstring,’ is added, indicating whether the substring ‘av’ is present in each ‘Name’ entry using the apply() method with a lambda function.
import pandas as pd
# Creating a relevant 4-column DataFrame data = {
'EmployeeID' : [ 101 , 102 , 103 , 104 ],
'Name' : [ 'Aman' , 'Bhavna' , 'Madhav' , 'Rohan' ],
'Department' : [ 'HR' , 'IT' , 'Finance' , 'Marketing' ],
'Salary' : [ 60000 , 75000 , 90000 , 65000 ]
} df = pd.DataFrame(data)
# Checking for substring 'av' in the 'Name' column and adding a new column substring = 'av'
df[ 'NameContainsSubstring' ] = df[ 'Name' ]. apply ( lambda x: substring in x)
filtered_df = df[df[ 'NameContainsSubstring' ]]
print (filtered_df)
|
Output:
EmployeeID Name Department Salary NameContainsSubstring
1 102 Bhavna IT 75000 True
2 103 Madhav Finance 90000 True
Check For A Substring In A Pandas Dataframe Using List Comprehension with ‘in’ Operator
In this example, let’s check whether the substring is present in each department key using list comprehension.
import pandas as pd
data = {
'EmployeeID' : [ 101 , 102 , 103 , 104 ],
'Name' : [ 'Aman' , 'Bhavna' , 'Madhav' , 'Rohan' ],
'Department' : [ 'HR' , 'IT' , 'Finance' , 'Marketing' ],
'Salary' : [ 60000 , 75000 , 90000 , 65000 ]
} df = pd.DataFrame(data)
# Checking for substring substring = 'Finance'
df[ 'NameContainsSubstring' ] = [substring in Department for Department in df[ 'Department' ]]
filtered_df = df[df[ 'NameContainsSubstring' ]]
print (filtered_df)
|
Output:
EmployeeID Name Department Salary NameContainsSubstring
2 103 Madhav Finance 90000 True