Open In App

Filter pandas DataFrame by substring criteria

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Introduction

Pandas is a popular Python library for data analysis and manipulation. The DataFrame is one of the key data structures in Pandas, providing a way to store and work with structured data in a tabular format. DataFrames are useful for organizing and storing data in a consistent format, allowing you to perform operations on the data such as filtering, grouping, and aggregation.

DataFrames can be created from various sources, such as a list of lists, a dictionary, or an existing DataFrame. Here is an example of how to create a DataFrame from a list of dictionaries:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28,
        "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33,
        "Address": "245 Pine Rd,West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25,
        "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
print(df)


Output:

  Name Age Address
0 John Smith 35 123 Main St, New York, NY 10001
1  Jane Doe 28 456 Park Ave, Newark, NJ 70004
2 Joe Schmo  55 789 Broad Way, Jersey City, NJ 07306
3  Sally Smith 42 321 Maple St, Hoboken, NJ 07030
4  Bob Johnson 28 654 Cedar Blvd, Union City, NJ 07087
5 Sue Johnson  29 912 Oak St, Weehawken, NJ 07086
6   Bill Williams  33 245 Pine Rd, West New York, NJ 07093
7  Mary Johnson 25 369 Birch Ave, Guttenberg, NJ 07093
8  Tom Williams 44 159 Willow St, Hoboken, NJ 07030

                                            
Filter the Dataframe

Now suppose you need to find the list of users with the age of 28. You can do the same by writing the following code:

Python3




df = df.loc[df["Age"] == 28]
  
print(df)


Output:

This would filter the DataFrame to only include rows where the “Age” is 28 and produce the following output:

  Name Age Address
1  Jane Doe  28 456 Park Ave, Newark, NJ 70004
4 Bob Johnson 28 654 Cedar Blvd, Union City, NJ 07087

 

This seems pretty easy. So what if we need to find the users with addresses in New York? Since the address column contains information like street, city, zip code, etc. we need to use the substring filter operations for the same

Method 1:  Using loc with str.contains                      

To filter the DataFrame using a substring in the “Address” column, you can use the .loc[] method and specify the desired substring in the filter criteria using the .str.contains() method:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York"
df = df.loc[df["Address"].str.contains("New York")]
  
print(df)


Output:

This would filter the DataFrame to only include rows where the “Address” column contains the substring “New York” and produce the output as follows:

  Name Age Address
0  John Smith 35 123 Main St, New York, NY 10001
6 Bill Williams 33  245 Pine Rd, West New York, NJ 07093

Method 2:  Using `query` method               

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .query() method and specify the desired substring in the filter criteria:       

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28,
        "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33,
        "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25,
        "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York"
df = df.query("Address.str.contains('New York', case=False)")
  
print(df)


Output:

  Name Age Address
0 John Smith 35 123 Main St, New York, NY 10001
6  Bill Williams  33 245 Pine Rd, West New York, NJ 07093

Method 3:  Using `apply` method                      

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .apply() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Define a custom function to check if the Address contains the substring "New York"
def contains_new_york(address):
    return "New York" in address
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York"
df = df[df["Address"].apply(contains_new_york)]
  
print(df)


Output:

  Name Age Address
0 John Smith 35 123 Main St, New York, NY 10001
6  Bill Williams  33 245 Pine Rd, West New York, NJ 07093

Method 4:  Using map method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .map() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Define a custom function to check if the Address contains the substring "New York"
def contains_new_york(address):
    return "New York" in address
  
# Map the custom function to the Address column to create a new column containing the result of the function
df = df[df["Address"].map(contains_new_york)]
  
print(df)


Output:

  Name Age Address
0 John Smith 35 123 Main St, New York, NY 10001
6  Bill Williams  33 245 Pine Rd, West New York, NJ 07093

Method 5:  Using `filter` method                   

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .filter() method and specify the desired substring in the filter criteria using the like operator:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Filter the DataFrame to only include the mentioned columns
df = df.filter(like="Address")
  
print(df)


Output:

Address
123 Main St, New York, NY 10001
456 Park Ave, Newark, NJ 70004
789 Broad Way, Jersey City, NJ 07306
321 Maple St, Hoboken, NJ 07030
654 Cedar Blvd, Union City, NJ 07087
912 Oak St, Weehawken, NJ 07086
245 Pine Rd, West New York, NJ 07093
369 Birch Ave, Guttenberg, NJ 07093
159 Willow St, Hoboken, NJ 07030

 The .filter() method is similar to the .query() method in that it allows you to specify filter criteria. However, the .filter() method only filters columns, whereas the .query() method can filter both columns and rows.

Method 6:  Using `isin` method                 

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .isin() method and specify the desired substring in the filter criteria:

Python3




import pandas as pd
  
# List of dictionaries containing data
data = [
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"},
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"},
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"},
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"},
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"},
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"},
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"},
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"},
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"},
]
  
# Create the DataFrame
df = pd.DataFrame(data)
  
# Filter the DataFrame to only include rows where the Address contains the mentioned string
df = df[df["Address"].isin(["123 Main St, New York, NY 10001"])]
  
print(df)


Output:

  Name Age Address
0 John Smith 35 123 Main St, New York, NY 10001

The .isin() method allows you to specify a list of values that the “Address” column should contain in order for the row to be included in the resulting DataFrame. In this case, the list contains only the value “123 Main St, New York, NY 10001”, so only rows where the “Address” column contains that exact value would be included in the resulting DataFrame.     

To filter a Pandas DataFrame using a substring in any specific column data, you can use one of several methods, including the .loc[], .query(), .filter(), .isin(), .apply(), and .map() methods. The specific method you choose will depend on your personal preference and the specific requirements of your project.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads