
Working with Duplicates

Duplicate Data

  • In Data Sets Duplicates can be entered for a variety of reasons. 
  • Sometimes it is valid and can also cause errors in your data analysis.
  •  This is why it is essential to deal with and detect duplicates in the data.
  • Let's understand this situation with a code demonstration:
    import pandas as pd
    df = pd.DataFrame.from_dict({
        'Name': ['Nikita', 'Katrina', 'Evan', 'Kygo', 'Kavya', 'Anne'],
        'Age': [33, 32, 40, 57, 33, 32],
        'Location': ['Mumbai', 'London', 'New York', 'Atlanta', 'Mumbai', 'Paris'],
        'Date Modified': ['08-06-2022', '01-02-2022', '08-12-2022', '09-12-2022', '01-01-2022', '12-09-2022']
    # Returns
    #      Name  Age  Location Date Modified
    # 0   Nikita   33    Mumbai    08-06-2022
    # 1  Katrina   32    London    01-02-2022
    # 2     Evan   40  New York    08-12-2022
    # 3     Kygo   57   Atlanta    09-12-2022
    # 4    Kavya   33    Mumbai    01-01-2022
    # 5     Anne   32     Paris    12-09-2022

Identifying Duplicates:

  • In Pandas, we have a helpful method, duplicated() Which allows you to identify duplicate records in a dataset.
  • This method returns boolean values about whether duplicate records exist or not.
    # Identifying Duplicate Records in a Pandas DataFrame
    # 0    False
    # 1    False
    # 2    False
    # 3    False
    # 4    False
    # 5    False
    # dtype: bool

    However, it may not seem useful.

Removing Duplicates in Pandas DataFrame

  • Pandas come with an easy method to remove duplicate records using the .drop_duplicates().
  • Let's check this method on our data:
        subset=None,            # Which columns to consider 
        keep='first',           # Which duplicate record to keep
        inplace=False,          # Whether to drop in place
        ignore_index=False      # Whether to relabel the index
    df = df.drop_duplicates('Location')
    #      Name  Age  Location Date Modified
    # 0   Nikita   33    Mumbai    08-06-2022
    # 1  Katrina   32    London    01-02-2022
    # 2     Evan   40  New York    08-12-2022
    # 3     Kygo   57   Atlanta    09-12-2022
    # 5     Anne   32     Paris    12-09-2022

    We can see that this returned a DataFrame where only all items are unique based on the 'Location' Column.