Cleaning Strings in Pandas
- One of the most helpful things about pandas is that it has various methods and attributes regarding dealing with text data, which certainly helps in Natural Language Processing (NLP).
- This is enhanced by the ability to access any type of string method and apply it directly to an entire array of data.
- In this tutorial, we are going to learn how to trim white space, split strings into columns, and replace text in strings.
- Let's have a look at our DataFrame that contains some string data to work with:
import pandas as pd df = pd.DataFrame.from_dict({ 'Name': ['Shivansh, Gupta', 'Sonia, Abhel', 'Soumya, Gupta', 'Vasu, Tiwari', 'Aravind, Mishra'], 'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'], 'Location': ['Mumbai', 'New Delhi', 'Hyderabad', 'Bangalore', 'Mumbai'], 'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple '] }) print(df) # Returns # Name Region Location Favorite Color # 0 Shivansh, Gupta Region A Mumbai green # 1 Sonia, Abhel Region A New Delhi red # 2 Soumya, Gupta Region B Hyderabad yellow # 3 Vasu, Tiwari Region C Bangalore blue # 4 Aravind, Mishra Region D Mumbai purple
We can see that our DataFrame has some messy string data! For example, In the Name column, we have multiple points and in the Region column, we have an unnecessary word Region in prefix also, In the column Favorite Color we have additional whitespace in color texts.
Removing White Space in Pandas
- We can remove the additional White space from the text in pandas. Pandas Come with the front and back methods to remove whitespace from strings but in the above data especially in the 'Favorite Color' column we will be using the .strip() method to remove whitespace from both sides.
df['Favorite Color'] = df['Favorite Color'].str.strip() print(df) # Returns # Name Region Location Favorite Color # 0 Shivansh, Gupta Region A Mumbai green # 1 Sonia, Abhel Region A New Delhi red # 2 Soumya, Gupta Region B Hyderabad yellow # 3 Vasu, Tiwari Region C Bangalore blue # 4 Aravind, Mishra Region D Mumbai purple
As you can see we have successfully removed whitespace from Favorite Color.
Replacing Text using Pandas
- In our DataFrame, we have a column named Region which has the word 'Region' in it which seems unnecessary.
- We are going to use string replace() method:
df['Region'] = df['Region'].str.replace('Region ', '') print(df) # Returns # Name Region Location Favorite Color # 0 Shivansh, Gupta A Mumbai green # 1 Sonia, Abhel A New Delhi red # 2 Soumya, Gupta B Hyderabad yellow # 3 Vasu, Tiwari C Bangalore blue # 4 Aravind, Mishra D Mumbai purple
We have successfully replaced the word Region.
Splitting Strings into Columns in Pandas
- We can see in our DataFrame that we have a column named 'Name' which has first and last name but with ',' character in between them.
- Now, we are going to make two new columns based on first and last name and will remove the original 'Name' Column: Here we are going to use string .split() method which will split the string based on the character we gave as an argument and we need to pass in the expand=True argument, in order to instruct Pandas to split the values into separate items. From there, we can assign the values into two columns:
df['Favorite Color'] = df['Favorite Color'].str.strip() df[['First Name', 'Last Name']] = df['Name'].str.split(',', expand=True) df['Region'] = df['Region'].str.replace('Region ', '') df = df.drop(['Name'], axis=1) print(df) # Returns # Region Location Favorite Color First Name Last Name # 0 A Mumbai green Shivansh Gupta # 1 A New Delhi red Sonia Abhel # 2 B Hyderabad yellow Soumya Gupta # 3 C Bangalore blue Vasu Tiwari # 4 D Mumbai purple Aravind Mishra
Along with these string methods, we can also use some String Case methods in pandas like:
.upper()
will convert a string to all upper case.lower()
will convert a string to all lower case.title()
will convert a string to title case