DataFrame.duplicated(subset=<String>, keep=<String>)
Parameters:
subset: Column names which should be checked for duplicates. On default, all columns get checked.
keep: Three different options, first for marking all duplicates except of the first entry; last for marking all duplicates except of the last entry; False marks all duplicates. Default value is first.
Returns:
Pandas series object containing boolean values with True on duplicate entries.
users = [
('John', 'Doe', 26),
('John', 'Doe', 2),
('Sarah', 'Doe', 33),
('Brad', 'Vandame', 49),
('John', 'Schmitt', 22),
('Chris', 'Serious', 33)
]
df = pd.DataFrame(users, columns=['first_name', 'last_name', 'age']
Selecting duplicates on all columns
All duplicates except the first entry
duplicates_all_columns = df[df.duplicated()]
print(duplicates_all_columns)
All duplicates except the last entry
duplicates_last = df[df.duplicated(keept='last')]
print(duplicates_last)
All duplicates
duplicates_first = df[df.duplicated(keep=False)]
print(duplicates_first)
Selecting duplicates based on specific columns
Based on a single column
duplicates_single_col = df[df.duplicated(['first_name'])]
print(duplicates_single_col)
Based on multiple columns
duplicates_mult_col = df[df.duplicated(['first_name', 'last_name'])] print(duplicates_mult_col)