How do you find duplicates in a dataframe in python?

DataFrame.duplicated[subset=None, keep='first'][source]#

Return boolean Series denoting duplicate rows.

Considering certain columns is optional.

Parameterssubsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates [if any] to mark.

  • first : Mark duplicates as True except for the first occurrence.

  • last : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

ReturnsSeries

Boolean series for each duplicated rows.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame[{
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... }]
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

>>> df.duplicated[]
0    False
1     True
2    False
3    False
4    False
dtype: bool

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.

>>> df.duplicated[keep='last']
0     True
1    False
2    False
3    False
4    False
dtype: bool

By setting keep on False, all duplicates are True.

>>> df.duplicated[keep=False]
0     True
1     True
2    False
3    False
4    False
dtype: bool

To find duplicates on specific column[s], use subset.

>>> df.duplicated[subset=['brand']]
0    False
1     True
2    False
3     True
4     True
dtype: bool

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated[] method of Pandas.
     

    Syntax : DataFrame.duplicated[subset = None, keep = ‘first’]
    Parameters: 
    subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.
    keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’. 
     

    • If ‘first’, This considers first value as unique and rest of the same values as duplicate.
    • If ‘last’, This considers last value as unique and rest of the same values as duplicate.
    • If ‘False’, This considers all of the same values as duplicates.

    Returns: Boolean Series denoting duplicate rows. 
     

    Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’. 
     

    Python3

    import pandas as pd

    employees = [['Stuti', 28, 'Varanasi'],

                ['Saumya', 32, 'Delhi'],

                ['Aaditya', 25, 'Mumbai'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Mumbai'],

                ['Aaditya', 40, 'Dehradun'],

                ['Seema', 32, 'Delhi']

                ]

    df = pd.DataFrame[employees,

                      columns = ['Name', 'Age', 'City']]

    df

    Output : 
     

    Example 1: Select duplicate rows based on all columns. 
    Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
     

    Python3

    import pandas as pd

    employees = [['Stuti', 28, 'Varanasi'],

                ['Saumya', 32, 'Delhi'],

                ['Aaditya', 25, 'Mumbai'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Mumbai'],

                ['Aaditya', 40, 'Dehradun'],

                ['Seema', 32, 'Delhi']

                ]

    df = pd.DataFrame[employees,

                      columns = ['Name', 'Age', 'City']]

    duplicate = df[df.duplicated[]]

    print["Duplicate Rows :"]

    duplicate

    Output : 
     

    Example 2: Select duplicate rows based on all columns. 
    If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
     

    Python3

    import pandas as pd

    employees = [['Stuti', 28, 'Varanasi'],

                ['Saumya', 32, 'Delhi'],

                ['Aaditya', 25, 'Mumbai'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Mumbai'],

                ['Aaditya', 40, 'Dehradun'],

                ['Seema', 32, 'Delhi']

                ]

    df = pd.DataFrame[employees,

                      columns = ['Name', 'Age', 'City']]

    duplicate = df[df.duplicated[keep = 'last']]

    print["Duplicate Rows :"]

    duplicate

    Output : 
     

    Example 3: If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument. 
     

    Python3

    import pandas as pd

    employees = [['Stuti', 28, 'Varanasi'],

                ['Saumya', 32, 'Delhi'],

                ['Aaditya', 25, 'Mumbai'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Mumbai'],

                ['Aaditya', 40, 'Dehradun'],

                ['Seema', 32, 'Delhi']

                ]

    df = pd.DataFrame[employees,

                      columns = ['Name', 'Age', 'City']]

    duplicate = df[df.duplicated['City']]

    print["Duplicate Rows based on City :"]

    duplicate

    Output : 
     

    Example 4: Select duplicate rows based on more than one column name.
     

    Python3

    import pandas as pd

    employees = [['Stuti', 28, 'Varanasi'],

                ['Saumya', 32, 'Delhi'],

                ['Aaditya', 25, 'Mumbai'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Delhi'],

                ['Saumya', 32, 'Mumbai'],

                ['Aaditya', 40, 'Dehradun'],

                ['Seema', 32, 'Delhi']

                ]

    df = pd.DataFrame[employees,

                       columns = ['Name', 'Age', 'City']]

    duplicate = df[df.duplicated[['Name', 'Age']]]

    print["Duplicate Rows based on Name and Age :"]

    duplicate

    Output : 
     


    How do you check if there are duplicates in pandas DataFrame?

    You can use the duplicated[] function to find duplicate values in a pandas DataFrame.

    How do you check for duplicate data in Python?

    Check for duplicates in a list using Set & by comparing sizes.
    Add the contents of list in a set. As set contains only unique elements, so no duplicates will be added to the set..
    Compare the size of set and list. If size of list & set is equal then it means no duplicates in list..

    What is DF duplicated [] SUM []?

    Pandas DataFrame duplicated[] Method The duplicated[] method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.

    How do you find the number of duplicates in a DataFrame in Python?

    You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated[] . The number of True can be counted with sum[] method. If you want to count the number of False [= the number of non-duplicate rows], you can invert it with negation ~ and then count True with sum[] .

    Chủ Đề