How to remove duplicates in excel using python
Show
Okay…so you might be wondering why the hell you need to remove duplicates in spreadsheets using Python when you can simply click on the Remove Duplicates button in Excel. Remove Duplicates Easily in ExcelAnd you would be justified in asking this question. However, the remove duplicates button retains the FIRST unique value and removes all the rest. What if you want to remove ALL duplicates? Or you want to remove all but the LAST duplicate? Or you just want to know which records are duplicates? While these situations can all be hacked in Excel, they are much easier to handle in Python and here at Love Spreadsheets we are all about using the right tool for the job. Pre-requisiteIf you do not know how to run Python scripts or how to read and write files using Python and Pandas, then go through this tutorial first. Intro to Reading and Writing Spreadsheets with Python FileWe will be using a sample file for this tutorial. You can download it here We will be counting a record as duplicate if it has the same values in the Name, Address and Call Date columns. Using this criteria, the file has three sets of duplicates as seen below. Full CodeSee the full code below to follow along Create the File for CodeOpen a text editor and create a file duplicates.py. Save this in the same folder as the Duplicates.xlsx file Import LibraryLine 1 Import the pandas library to read, remove duplicates and write the spreadsheets. import pandas as pd Read the FileLine 3 We are going to be reading the spreadsheet using pandas and storing the result in a data frame file_df file_df = pd.read_excel("Duplicates.xlsx") Keep only FIRST record of duplicate setLines 6–7 The first case we will go through is the same as Excel, where we want to remove all duplicates in a set except for the first. The code for this is one line with two important parameters. file_df_first_record = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep="first") We call the drop_duplicates method on the file_df that we read in from the spreadsheet. In the drop_duplicates method there are two important parameters.
These are both OPTIONAL. If you don’t specify them, then Pandas treats a record as duplicate on ALL columns and keeps only the FIRST record. You can read more about this method here. Now we will write this dataframe without duplicates and only the first row to a new file file_df_first_record.to_excel("Duplicates_First_Record.xlsx", index=False) This is what that file looks like As you can see the first record of each set was keptKeep only LAST record of duplicate setLines 10–11 This is exactly identical to the step above except we change the keep parameter to last file_df_last_record = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep="last") Then we write this dataframe with only the last row kept from a set of duplicates to a new file file_df_last_record.to_excel("Duplicates_Last_Record.xlsx", index=False) This is what the file looks like Now the last records were kept for each setRemove ALL records from a set of duplicatesLines 14–15 Now let’s say we want to remove all records from a set of duplicates. In this case we change the keep parameter to False file_df_remove_all = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep=False) Now we write this dataframe with all duplicate sets removed to a file file_df_remove_all.to_excel("Duplicates_All_Removed.xlsx", index=False)
This is what that file looks like Get Which Records Are DuplicatesLines 18–20 Okay, so let’s say you want to get a list of all records that belong to a duplicate set. This is not as straightforward of a case but still very doable. First, we get the a list of which rows contain duplicates or not. duplicate_row_index = file_df.duplicated(subset=["Name", "Address", "Call Date"], keep=False) Note: we are using a different method called duplicated. This gives us a list containing True OR False value corresponding to each row in the dataframe/spreadsheet. In this method, the keep parameter specifies which rows we want to identified as duplicates. Since we pass FALSE, it identifies all rows that belong to a duplicate set. If we had specified first, it would have identified all rows EXCEPT the first row. You can read more about this method here. Next, we get the actual records from the dataframe. The command below gives us all the rows that were identified as duplicates. all_duplicate_rows = file_df[duplicate_row_index] Finally, we write this to a spreadsheet. Here we use index=True because we want to get the row numbers as well. all_duplicate_rows.to_excel("Duplicate_Rows.xlsx", index=True) This is what that file looks like All records that were part of a duplicate setNOTE: Pandas index is 0-based AND considers column headers as separate. So in this file, since we had a header, Row 1 next to apple actually refers to Row 3 in the original file. How do you remove duplicates from a Excel file in Python?drop_duplicates() Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
How do you eliminate duplicates in Python?Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
How do I eliminate duplicates in Excel?Remove duplicate values. Select the range of cells that has duplicate values you want to remove. Tip: Remove any outlines or subtotals from your data before trying to remove duplicates.. Click Data > Remove Duplicates, and then Under Columns, check or uncheck the columns where you want to remove the duplicates. ... . Click OK.. How do I remove duplicates in a column in Python?Use DataFrame. drop_duplicates() to Remove Duplicate Columns.
|