Instagram
youtube
Facebook
Twitter

Cleaning Strings in Pandas

  • One of the most helpful things about pandas is that it has various methods and attributes regarding dealing with text data, which certainly helps in Natural Language Processing (NLP).
  • This is enhanced by the ability to access any type of string method and apply it directly to an entire array of data.
  • In this tutorial, we are going to learn how to trim white space, split strings into columns, and replace text in strings.
  • Let's have a look at our DataFrame that contains some string data to work with:
    import pandas as pd
    
    df = pd.DataFrame.from_dict({
        'Name': ['Shivansh, Gupta', 'Sonia, Abhel', 'Soumya, Gupta', 'Vasu, Tiwari', 'Aravind, Mishra'],
        'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
        'Location': ['Mumbai', 'New Delhi', 'Hyderabad', 'Bangalore', 'Mumbai'],
        'Favorite Color': ['   green  ', 'red', '  yellow', 'blue', 'purple  ']
    })
    
    print(df)
    
    # Returns 
    
    #               Name    Region   Location Favorite Color
    # 0  Shivansh, Gupta  Region A     Mumbai        green  
    # 1     Sonia, Abhel  Region A  New Delhi            red
    # 2    Soumya, Gupta  Region B  Hyderabad         yellow
    # 3     Vasu, Tiwari  Region C  Bangalore           blue
    # 4  Aravind, Mishra  Region D     Mumbai       purple
    

    We can see that our DataFrame has some messy string data! For example, In the Name column, we have multiple points and in the Region column, we have an unnecessary word Region in prefix also, In the column Favorite Color we have additional whitespace in color texts.


Removing White Space in Pandas

  • We can remove the additional White space from the text in pandas. Pandas Come with the front and back methods to remove whitespace from strings but in the above data especially in the 'Favorite Color' column we will be using the .strip() method to remove whitespace from both sides.
    df['Favorite Color'] = df['Favorite Color'].str.strip()
    
    print(df)
    
    # Returns
    
    #               Name    Region   Location Favorite Color
    # 0  Shivansh, Gupta  Region A     Mumbai          green
    # 1     Sonia, Abhel  Region A  New Delhi            red
    # 2    Soumya, Gupta  Region B  Hyderabad         yellow
    # 3     Vasu, Tiwari  Region C  Bangalore           blue
    # 4  Aravind, Mishra  Region D     Mumbai         purple
    

    As you can see we have successfully removed whitespace from Favorite Color.


Replacing Text using Pandas

  • In our DataFrame, we have a column named Region which has the word 'Region' in it which seems unnecessary.
  • We are going to use string replace() method:
    df['Region'] = df['Region'].str.replace('Region ', '')
    
    print(df)
    
    # Returns
    
    #               Name Region   Location Favorite Color
    # 0  Shivansh, Gupta      A     Mumbai        green  
    # 1     Sonia, Abhel      A  New Delhi            red
    # 2    Soumya, Gupta      B  Hyderabad         yellow
    # 3     Vasu, Tiwari      C  Bangalore           blue
    # 4  Aravind, Mishra      D     Mumbai       purple  
    

    We have successfully replaced the word Region.


Splitting Strings into Columns in Pandas

  • We can see in our DataFrame that we have a column named 'Name' which has first and last name but with ',' character in between them.
  • Now, we are going to make two new columns based on first and last name and will remove the original 'Name' Column: Here we are going to use string .split() method which will split the string based on the character we gave as an argument and we need to pass in the expand=True argument, in order to instruct Pandas to split the values into separate items. From there, we can assign the values into two columns:
    
    df['Favorite Color'] = df['Favorite Color'].str.strip()
    df[['First Name', 'Last Name']] = df['Name'].str.split(',', expand=True)
    df['Region'] = df['Region'].str.replace('Region ', '')
    df = df.drop(['Name'], axis=1)
    print(df)
    
    # Returns
    
    #   Region   Location Favorite Color First Name Last Name
    # 0      A     Mumbai          green   Shivansh     Gupta
    # 1      A  New Delhi            red      Sonia     Abhel
    # 2      B  Hyderabad         yellow     Soumya     Gupta
    # 3      C  Bangalore           blue       Vasu    Tiwari
    # 4      D     Mumbai         purple    Aravind    Mishra

Along with these string methods, we can also use some String Case methods in pandas like:

  • .upper() will convert a string to all upper case
  • .lower() will convert a string to all lower case
  • .title() will convert a string to title case