Pandas DataFrame duplicated() Method
Example
Check which rows are duplicated and not:
import pandas as pd
data = {
"name": ["John", "Mary",
"John", "Sally", "Mary"],
"age": [40, 30, 40, 50, 30],
"city": ["Bergen", "Oslo", "Stavanger", "Oslo", "Oslo"]
}
df = pd.DataFrame(data)
s = df.duplicated()
print(s)
Try it Yourself »
Definition and Usage
The duplicated()
method returns a Series
with True and False values that describe which rows in the DataFrame are
duplicated and not.
Use the subset
parameter to specify which
columns to include when looking for duplicates. By default all columns are
included.
By default, the first occurrence of two or more duplicates will be set to False.
Set the keep
parameter to False
to also set the
first occurrence to True.
Syntax
dataframe.duplicated(subset, keep)
Parameters
The parameters are keyword arguments.
Parameter | Value | Description |
---|---|---|
subset | column label(s) | Optional. A String, or a list, of the column names to include when looking for duplicates. Default subset=None (meaning no subset is specified, and all columns should be included. |
keep | 'first' |
Optional, default 'first'. Specifies how to deal with duplicates: 'first' means set the first occurrence to False, the rest to True. 'last' means set the last occurrence to False, the rest to True. False means set all occurrences to True. |
Return Value
A Series with a boolean value for each row in the DataFrame.
More Examples
Example
Only include the columns "name" and "age":
s = df.duplicated(subset=["name", "age"])
print(s)
Try it Yourself »
Example
Set all occurrences of duplicates to True:
s = df.duplicated(keep=False)
print(s)
Try it Yourself »
Copyright 1999-2023 by Refsnes Data. All Rights Reserved.