Categories
Uncategorized

Pandas Detecting Duplicates

Detecting duplicates on python pandas dataframes.

DataFrame.duplicated(subset=<String>, keep=<String>)

Parameters:

subset: Column names which should be checked for duplicates. On default, all columns get checked.
keep: Three different options, first for marking all duplicates except of the first entry; last for marking all duplicates except of the last entry; False marks all duplicates. Default value is first.

Returns:

Pandas series object containing boolean values with True on duplicate entries.

users = [
    ('John', 'Doe', 26),
    ('John', 'Doe', 2),
    ('Sarah', 'Doe', 33),
    ('Brad', 'Vandame', 49),
    ('John', 'Schmitt', 22),
    ('Chris', 'Serious', 33)
]
df = pd.DataFrame(users, columns=['first_name', 'last_name', 'age']

Selecting duplicates on all columns

All duplicates except the first entry

duplicates_all_columns = df[df.duplicated()]
print(duplicates_all_columns)

All duplicates except the last entry

duplicates_last = df[df.duplicated(keept='last')]
print(duplicates_last)

All duplicates

duplicates_first = df[df.duplicated(keep=False)]
print(duplicates_first)

Selecting duplicates based on specific columns

Based on a single column

duplicates_single_col = df[df.duplicated(['first_name'])]
print(duplicates_single_col)

Based on multiple columns

duplicates_mult_col = df[df.duplicated(['first_name', 'last_name'])] print(duplicates_mult_col)

Leave a Reply

Your email address will not be published. Required fields are marked *