Cleaning Strings in Pandas

❮ Previous Next ❯

One of the most helpful things about pandas is that it has various methods and attributes regarding dealing with text data, which certainly helps in Natural Language Processing (NLP).
This is enhanced by the ability to access any type of string method and apply it directly to an entire array of data.
In this tutorial, we are going to learn how to trim white space, split strings into columns, and replace text in strings.

Let's have a look at our DataFrame that contains some string data to work with:

import pandas as pd

df = pd.DataFrame.from_dict({
    'Name': ['Shivansh, Gupta', 'Sonia, Abhel', 'Soumya, Gupta', 'Vasu, Tiwari', 'Aravind, Mishra'],
    'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
    'Location': ['Mumbai', 'New Delhi', 'Hyderabad', 'Bangalore', 'Mumbai'],
    'Favorite Color': ['   green  ', 'red', '  yellow', 'blue', 'purple  ']
})

print(df)

# Returns 

#               Name    Region   Location Favorite Color
# 0  Shivansh, Gupta  Region A     Mumbai        green  
# 1     Sonia, Abhel  Region A  New Delhi            red
# 2    Soumya, Gupta  Region B  Hyderabad         yellow
# 3     Vasu, Tiwari  Region C  Bangalore           blue
# 4  Aravind, Mishra  Region D     Mumbai       purple

We can see that our DataFrame has some messy string data! For example, In the Name column, we have multiple points and in the Region column, we have an unnecessary word Region in prefix also, In the column Favorite Color we have additional whitespace in color texts.

Removing White Space in Pandas

We can remove the additional White space from the text in pandas. Pandas Come with the front and back methods to remove whitespace from strings but in the above data especially in the 'Favorite Color' column we will be using the .strip() method to remove whitespace from both sides.

df['Favorite Color'] = df['Favorite Color'].str.strip()

print(df)

# Returns

#               Name    Region   Location Favorite Color
# 0  Shivansh, Gupta  Region A     Mumbai          green
# 1     Sonia, Abhel  Region A  New Delhi            red
# 2    Soumya, Gupta  Region B  Hyderabad         yellow
# 3     Vasu, Tiwari  Region C  Bangalore           blue
# 4  Aravind, Mishra  Region D     Mumbai         purple

As you can see we have successfully removed whitespace from Favorite Color.

Replacing Text using Pandas

In our DataFrame, we have a column named Region which has the word 'Region' in it which seems unnecessary.

We are going to use string replace() method:

df['Region'] = df['Region'].str.replace('Region ', '')

print(df)

# Returns

#               Name Region   Location Favorite Color
# 0  Shivansh, Gupta      A     Mumbai        green  
# 1     Sonia, Abhel      A  New Delhi            red
# 2    Soumya, Gupta      B  Hyderabad         yellow
# 3     Vasu, Tiwari      C  Bangalore           blue
# 4  Aravind, Mishra      D     Mumbai       purple

We have successfully replaced the word Region.

Splitting Strings into Columns in Pandas

We can see in our DataFrame that we have a column named 'Name' which has first and last name but with ',' character in between them.

Now, we are going to make two new columns based on first and last name and will remove the original 'Name' Column: Here we are going to use string .split() method which will split the string based on the character we gave as an argument and we need to pass in the expand=True argument, in order to instruct Pandas to split the values into separate items. From there, we can assign the values into two columns:


df['Favorite Color'] = df['Favorite Color'].str.strip()
df[['First Name', 'Last Name']] = df['Name'].str.split(',', expand=True)
df['Region'] = df['Region'].str.replace('Region ', '')
df = df.drop(['Name'], axis=1)
print(df)

# Returns

#   Region   Location Favorite Color First Name Last Name
# 0      A     Mumbai          green   Shivansh     Gupta
# 1      A  New Delhi            red      Sonia     Abhel
# 2      B  Hyderabad         yellow     Soumya     Gupta
# 3      C  Bangalore           blue       Vasu    Tiwari
# 4      D     Mumbai         purple    Aravind    Mishra

Along with these string methods, we can also use some String Case methods in pandas like:

.upper() will convert a string to all upper case
.lower() will convert a string to all lower case
.title() will convert a string to title case

❮ Previous Next ❯

Pandas Tutorials | Projects | Interview Questions Tutorial

Pandas Tutorials | Projects | Interview Questions Tutorial

Cleaning Strings in Pandas

Removing White Space in Pandas

Replacing Text using Pandas

Splitting Strings into Columns in Pandas

Important Links

Pandas Tutorials | Projects | Interview Questions Tutorial

Pandas Tutorials | Projects | Interview Questions Tutorial

Introduction to Pandas

Cleaning Data with Pandas

Tables in Pandas

Interview Questions in Pandas

Cleaning Strings in Pandas

Removing White Space in Pandas

Replacing Text using Pandas

Splitting Strings into Columns in Pandas

Important Links