Return boolean Series denoting duplicate rows. Considering
certain columns is optional. Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates [if any] to mark. False : Mark all duplicates as Boolean series for each duplicated rows. Examples Consider dataset containing ramen rating.
ReturnsSeriesfirst
: Mark duplicates as True
except for the first occurrence.last
: Mark duplicates as True
except for the last occurrence.True
.>>> df = pd.DataFrame[{
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... }]
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated[] 0 False 1 True 2 False 3 False 4 False dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated[keep='last'] 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting keep
on False, all duplicates are True.
>>> df.duplicated[keep=False] 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column[s], use subset
.
>>> df.duplicated[subset=['brand']] 0 False 1 True 2 False 3 True 4 True dtype: bool
View Discussion
Improve Article
Save Article
View Discussion
Improve Article
Save Article
In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use
Dataframe.duplicated[] method of Pandas.
Syntax : DataFrame.duplicated[subset = None, keep = ‘first’]
Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.
keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first’, This considers first value as unique and rest of the same values as duplicate.
- If ‘last’, This considers last value as unique and rest of the same values as duplicate.
- If ‘False’, This considers all of the same values as duplicates.
Returns: Boolean Series denoting duplicate rows.
Let’s create a simple dataframe with a
dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’.
Python3
import
pandas as pd
employees
=
[[
'Stuti'
,
28
,
'Varanasi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Aaditya'
,
25
,
'Mumbai'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Mumbai'
],
[
'Aaditya'
,
40
,
'Dehradun'
],
[
'Seema'
,
32
,
'Delhi'
]
]
df
=
pd.DataFrame[employees,
columns
=
[
'Name'
,
'Age'
,
'City'
]]
df
Output :
Example 1: Select duplicate rows based on all columns.
Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
Python3
import
pandas as pd
employees
=
[[
'Stuti'
,
28
,
'Varanasi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Aaditya'
,
25
,
'Mumbai'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Mumbai'
],
[
'Aaditya'
,
40
,
'Dehradun'
],
[
'Seema'
,
32
,
'Delhi'
]
]
df
=
pd.DataFrame[employees,
columns
=
[
'Name'
,
'Age'
,
'City'
]]
duplicate
=
df[df.duplicated[]]
print
["Duplicate Rows :"]
duplicate
Output :
Example 2: Select duplicate rows based on all columns.
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
Python3
import
pandas as pd
employees
=
[[
'Stuti'
,
28
,
'Varanasi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Aaditya'
,
25
,
'Mumbai'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Mumbai'
],
[
'Aaditya'
,
40
,
'Dehradun'
],
[
'Seema'
,
32
,
'Delhi'
]
]
df
=
pd.DataFrame[employees,
columns
=
[
'Name'
,
'Age'
,
'City'
]]
duplicate
=
df[df.duplicated[keep
=
'last'
]]
print
["Duplicate Rows :"]
duplicate
Output :
Example 3: If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
Python3
import
pandas as pd
employees
=
[[
'Stuti'
,
28
,
'Varanasi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Aaditya'
,
25
,
'Mumbai'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Mumbai'
],
[
'Aaditya'
,
40
,
'Dehradun'
],
[
'Seema'
,
32
,
'Delhi'
]
]
df
=
pd.DataFrame[employees,
columns
=
[
'Name'
,
'Age'
,
'City'
]]
duplicate
=
df[df.duplicated[
'City'
]]
print
["Duplicate Rows based on City :"]
duplicate
Output :
Example 4: Select duplicate rows based on more than one column name.
Python3
import
pandas as pd
employees
=
[[
'Stuti'
,
28
,
'Varanasi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Aaditya'
,
25
,
'Mumbai'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Delhi'
],
[
'Saumya'
,
32
,
'Mumbai'
],
[
'Aaditya'
,
40
,
'Dehradun'
],
[
'Seema'
,
32
,
'Delhi'
]
]
df
=
pd.DataFrame[employees,
columns
=
[
'Name'
,
'Age'
,
'City'
]]
duplicate
=
df[df.duplicated[[
'Name'
,
'Age'
]]]
print
["Duplicate Rows based on Name
and
Age :"]
duplicate
Output :