Working with Duplicates
Duplicate Data
- In Data Sets Duplicates can be entered for a variety of reasons.
- Sometimes it is valid and can also cause errors in your data analysis.
- This is why it is essential to deal with and detect duplicates in the data.
- Let's understand this situation with a code demonstration:
import pandas as pd df = pd.DataFrame.from_dict({ 'Name': ['Nikita', 'Katrina', 'Evan', 'Kygo', 'Kavya', 'Anne'], 'Age': [33, 32, 40, 57, 33, 32], 'Location': ['Mumbai', 'London', 'New York', 'Atlanta', 'Mumbai', 'Paris'], 'Date Modified': ['08-06-2022', '01-02-2022', '08-12-2022', '09-12-2022', '01-01-2022', '12-09-2022'] }) print(df) # Returns # Name Age Location Date Modified # 0 Nikita 33 Mumbai 08-06-2022 # 1 Katrina 32 London 01-02-2022 # 2 Evan 40 New York 08-12-2022 # 3 Kygo 57 Atlanta 09-12-2022 # 4 Kavya 33 Mumbai 01-01-2022 # 5 Anne 32 Paris 12-09-2022
Identifying Duplicates:
- In Pandas, we have a helpful method,
duplicated()
Which allows you to identify duplicate records in a dataset. - This method returns boolean values about whether duplicate records exist or not.
# Identifying Duplicate Records in a Pandas DataFrame print(df.duplicated()) # 0 False # 1 False # 2 False # 3 False # 4 False # 5 False # dtype: bool
However, it may not seem useful.
Removing Duplicates in Pandas DataFrame
- Pandas come with an easy method to remove duplicate records using the
.drop_duplicates()
. - Let's check this method on our data:
df.drop_duplicates( subset=None, # Which columns to consider keep='first', # Which duplicate record to keep inplace=False, # Whether to drop in place ignore_index=False # Whether to relabel the index )
df = df.drop_duplicates('Location') print(df) #Returns # Name Age Location Date Modified # 0 Nikita 33 Mumbai 08-06-2022 # 1 Katrina 32 London 01-02-2022 # 2 Evan 40 New York 08-12-2022 # 3 Kygo 57 Atlanta 09-12-2022 # 5 Anne 32 Paris 12-09-2022
We can see that this returned a DataFrame where only all items are unique based on the 'Location' Column.