Pandas Drop Duplicates Based on Condition
- Authors
- Name
- Brent
In data analysis and manipulation, it is common to have duplicate data that needs to be cleaned.
In such cases, Pandas library in Python provides a convenient way to drop duplicate values.
However, sometimes, we might not want to drop all the duplicate values, but only those that meet certain conditions.
In this blog post, we will discuss how to drop duplicates based on condition in Pandas.
How To Drop Duplicates Based On Condition in Pandas
Pandas provide a function named drop_duplicates()
to drop duplicates from a DataFrame.
By default, this function removes all duplicate values, but we can also drop duplicates based on certain conditions.
For instance, let's say we have a DataFrame with the following data:
Name Age Grade
0 John 25 A
1 Jane 27 B
2 John 25 C
3 Mark 28 A
4 Jane 27 B
Here, we have two duplicate values with the same name and age, but different grades.
If we want to drop duplicates based on the name and age columns, we can use the subset argument in the drop_duplicates()
function:
df.drop_duplicates(subset=['Name', 'Age'], keep='first')
This will keep the first instance of each duplicated row and drop the rest, resulting in the following DataFrame:
Name Age Grade
0 John 25 A
1 Jane 27 B
3 Mark 28 A
Note that the keep argument takes values 'first', 'last' or False. If keep='first', the first occurrence of the duplicate values will be kept, and if keep='last', the last occurrence will be kept.
If keep=False, all duplicates will be dropped.
Drop Duplicates Based On A Condition Using Boolean indexing
Another way to drop duplicates based on conditions is to use boolean indexing.
We can create a boolean mask that indicates the duplicate values and then use it to index our DataFrame.
Here's an example:
mask = df.duplicated(subset=['Name', 'Age'])
df = df[~mask]
This will result in the same DataFrame as above.
Summary
In conclusion, dropping duplicates based on condition in Pandas can be easily achieved using the drop_duplicates()
function or boolean indexing.
This helps us to keep only the desired records and remove the rest, ensuring that our data is clean and accurate.