I have a text file with some 1,200 rows. Some of them are duplicates.
How could I find the duplicate lines in the file [but not worrying about case] and then print out the line's text on the screen, so I can go off and find it? I don't want to delete them or anything, just find which lines they might be.
jww
93k86 gold badges382 silver badges839 bronze badges
asked Oct 17, 2012 at 15:26
4
This is pretty easy with a set:
with open['file'] as f:
seen = set[]
for line in f:
line_lower = line.lower[]
if line_lower in seen:
print[line]
else:
seen.add[line_lower]
answered Oct 17, 2012 at 15:28
mgilsonmgilson
288k60 gold badges601 silver badges675 bronze badges
9
as there are only 1200 lines, so you can also use collections.Counter[]
:
>>> from collections import Counter
>>> with open['data1.txt'] as f:
... c=Counter[c.strip[].lower[] for c in f if c.strip[]] #for case-insensitive search
... for line in c:
... if c[line]>1:
... print line
...
if data1.txt
is something like this:
ABC
abc
aBc
CAB
caB
bca
BcA
acb
output is:
cab
abc
bca
answered Oct 17, 2012 at 15:34
Ashwini ChaudharyAshwini Chaudhary
236k56 gold badges444 silver badges495 bronze badges
0
Finding Case-Insensitive Duplicates
This won't give you line numbers, but it will give you a list of duplicate lines which you can then investigate further. For example:
tr 'A-Z' 'a-z' < /tmp/foo | sort | uniq -d
Example Data File
# /tmp/foo
one
One
oNe
two
three
The pipeline listed above will correctly yield:
one
Finding the Line Numbers
You could then grep for related line numbers like so:
grep --ignore-case --line-number one /tmp/foo
answered Oct 17, 2012 at 15:36
Todd A. JacobsTodd A. Jacobs
78.3k14 gold badges139 silver badges192 bronze badges
Here is a short program in Python to identify the count of duplicate lines in a text file.import tkinter as tk
from tkinter import filedialog
from collections import defaultdict
import pandas as pd
import collections
from pathlib import Path
import os
root= tk.Tk[]
canvas1 = tk.Canvas[root, width = 800, height = 300]
canvas1.pack[]
label1 = tk.Label[root, text='Log Analyser']
label2 = tk.Label[root, text='Import a file...']
label1.config[font=['Arial', 20]]
label2.config[font=['Arial', 10]]
canvas1.create_window[400, 50, window=label1]
canvas1.create_window[200, 180, window=label2]
def getLogFile []:
global df
import_file = filedialog.askopenfilename[]
Counter = 0
with open[import_file, "r+"] as f:
d = f.readlines[]
f.seek[0]
entries = Path[import_file]
fileabspath = os.path.abspath[import_file]
fw= open[fileabspath.replace[entries.name,"Duplicate_Log_Info.txt"],"w+"]
counts = collections.Counter[l.strip[] for l in f]
for line, count in counts.most_common[]:
#print [line, "|"+str[count]]
fw.write[line + "|"+str[count] + "\n"]
label3 = tk.Label[root, text=entries.name + ": Import is successful, Please check the output file - "+ fw.name + "."]
label3.config[font=['Arial', 10]]
canvas1.create_window[400, 220, window=label3]
f.close[]
fw.close[]
browseButton_Excel = tk.Button[text='Choose a file...', command=getLogFile, bg='green', fg='white', font=['helvetica', 12, 'bold']]
canvas1.create_window[400, 180, window=browseButton_Excel]
button3 = tk.Button [root, text='Close', command=root.destroy, bg='green', font=['helvetica', 11, 'bold']]
canvas1.create_window[500, 180, window=button3]
root.mainloop[]
Output:
If you enjoyed this blog post, feel free to share it with your friends!