Okay…so you might be wondering why the hell you need to remove duplicates in spreadsheets using Python when you can simply click on the Remove Duplicates button in Excel.
And you would be justified in asking this question. However, the remove duplicates button retains the FIRST unique value and removes all the rest.
What if you want to remove ALL duplicates? Or you want to remove all but the LAST duplicate? Or you just want to know which records are duplicates?
While these situations can all be hacked in Excel, they are much easier to handle in Python and here at Love Spreadsheets we are all about using the right tool for the job.
Pre-requisite
If you do not know how to run Python scripts or how to read and write files using Python and Pandas, then go through this tutorial first.
Intro to Reading and Writing Spreadsheets with Python
File
We will be using a sample file for this tutorial. You can download it here
We will be counting a record as duplicate if it has the same values in the Name, Address and Call Date columns. Using this criteria, the file has three sets of duplicates as seen below.
Full Code
See the full code below to follow along
Create the File for Code
Open a text editor and create a file duplicates.py. Save this in the same folder as the Duplicates.xlsx file
Import Library
Line 1
Import the pandas library to read, remove duplicates and write the spreadsheets.
import pandas as pd
Read the File
Line 3
We are going to be reading the spreadsheet using pandas and storing the result in a data frame file_df
file_df = pd.read_excel["Duplicates.xlsx"]
Keep only FIRST record of duplicate set
Lines 6–7
The first case we will go through is the same as Excel, where we want to remove all duplicates in a set except for the first. The code for this is one line with two important parameters.
file_df_first_record = file_df.drop_duplicates[subset=["Name", "Address", "Call Date"], keep="first"]
We call the drop_duplicates method on the file_df that we read in from the spreadsheet. In the drop_duplicates method there are two important parameters.
- subset — list of column names that we consider a record to be duplicate. In this case, we have specified duplicates are records that share the same Name, Address and Call Date.
- keep — this is the record that we should keep for the set of duplicates, and is the parameter we will be changing in the different scenarios.
These are both OPTIONAL. If you don’t specify them, then Pandas treats a record as duplicate on ALL columns and keeps only the FIRST record.
You can read more about this method here.
Now we will write this dataframe without duplicates and only the first row to a new file
file_df_first_record.to_excel["Duplicates_First_Record.xlsx", index=False]
This is what that file looks like
Keep only LAST record of duplicate set
Lines 10–11
This is exactly identical to the step above except we change the keep parameter to last
file_df_last_record = file_df.drop_duplicates[subset=["Name", "Address", "Call Date"], keep="last"]
Then we write this dataframe with only the last row kept from a set of duplicates to a new file
file_df_last_record.to_excel["Duplicates_Last_Record.xlsx", index=False]
This is what the file looks like
Remove ALL records from a set of duplicates
Lines 14–15
Now let’s say we want to remove all records from a set of duplicates. In this case we change the keep parameter to False
file_df_remove_all = file_df.drop_duplicates[subset=["Name", "Address", "Call Date"], keep=False]
Now we write this dataframe with all duplicate sets removed to a file
file_df_remove_all.to_excel["Duplicates_All_Removed.xlsx", index=False]
This is what that file looks like
Get Which Records Are Duplicates
Lines 18–20
Okay, so let’s say you want to get a list of all records that belong to a duplicate set. This is not as straightforward of a case but still very doable.
First, we get the a list of which rows contain duplicates or not.
duplicate_row_index = file_df.duplicated[subset=["Name", "Address", "Call Date"], keep=False]
Note: we are using a different method called duplicated. This gives us a list containing True OR False value corresponding to each row in the dataframe/spreadsheet.
In this method, the keep parameter specifies which rows we want to identified as duplicates. Since we pass FALSE, it identifies all rows that belong to a duplicate set. If we had specified first, it would have identified all rows EXCEPT the first row. You can read more about this method here.
Next, we get the actual records from the dataframe. The command below gives us all the rows that were identified as duplicates.
all_duplicate_rows = file_df[duplicate_row_index]
Finally, we write this to a spreadsheet. Here we use index=True because we want to get the row numbers as well.
all_duplicate_rows.to_excel["Duplicate_Rows.xlsx", index=True]
This is what that file looks like
NOTE: Pandas index is 0-based AND considers column headers as separate. So in this file, since we had a header, Row 1 next to apple actually refers to Row 3 in the original file.