Find duplicate words in text file python

This snipped code doesn't use the file, but it's easy to test and study. The main difference is that you must load the file and read per line as you did in your example

example_file = """
This is a text file example

Let's see how many time example is typed.

"""
result = {}
words = example_file.split[]
for word in words:
    # if the word is not in the result dictionary, the default value is 0 + 1
    result[word] = result.get[word, 0] + 1
for word, occurence in result.items[]:
    print["word:%s; occurence:%s" % [word, occurence]]

UPDATE:

As suggested by @khachik a better solution is using the Counter.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall[r'\w+', open['hamlet.txt'].read[].lower[]]
>>> Counter[words].most_common[10]
[['the', 1143], ['and', 966], ['to', 762], ['of', 669], ['i', 631],
 ['you', 554],  ['a', 546], ['my', 514], ['hamlet', 471], ['in', 451]]

Python program to find duplicate words in a file:

In this post, we will learn how to find the duplicate words in a file in Python. Python provides different inbuilt methods to work with files. We can use these methods to open a file, read the content of a file and also write content to a file.

We will write a program that takes the path of a file as the input and prints out all duplicate words in that file.

Before moving to the program, let’s check the algorithm first.

Algorithm:

This program will follow the below algorithm:

  • Open the file in read mode.
  • Initialize two empty set. One to hold all words and another to hold all duplicate words. We are using set because it can’t hold duplicate values.
  • Iterate through the lines of the file with a loop.
  • For each line, get the list of words by using split.
  • Iterate through the words of each line by using a loop. Check if the current word is in the first set or not.

    • If yes, add it to the second set as it is a duplicate word.
    • If it is not found, add it to the first set as this is not found before.
  • Once the loops are completed, print the content of the second set, which includes only duplicate words.

Python program:

Let’s write down the program:

words_set = set[]
duplicate_set = set[]

with open['input.txt'] as input_file:
    file_content = input_file.readlines[]

for lines in file_content:
    words = lines.split[]
    for word in words:
        if word in words_set:
            duplicate_set.add[word]
        else:
            words_set.add[word]

for word in duplicate_set:
    print[word]

Here,

  • words_set and duplicate_set are two set to hold the words and duplicate words of the file.
  • The first with block reads the content of the file. The readlines method returns the lines of the file in a list and this value is stored in the file_content variable.
  • The for loop iterates through the lines in the list and gets the words in each line by using split[].
  • The inner for loop iterates through the words of each line. For each word, it checks if it is in words_set or not. If yes, it adds that word to duplicate_set as it is a duplicate. Else, it adds it to words_set.
  • Once the loops are completed, it uses another loop to print the words of duplicate_set.

For example, if the input.txt holds the following text:

hello world
hello universe
hello again
hello world !!

It will print the below output:

Method 2: By using a dictionary:

If you run the above program, each time it will print the output in a different order. Because the order is not maintained in a set. If you want to maintain the order, you can use a dictionary.

Dictionaries are used to hold key-value pairs. For this example, the key will be the word and the value will be its number of occurrences in the file.

The program will iterate through the words and if it is not added to the dictionary, it will add it with value 0. Also, it will increment the value by 1.

To find the duplicate words, it will iterate through the dictionary to find out all words with value greater than 1.

Below is the complete program:

words_dict = {}

with open['input.txt'] as input_file:
    file_content = input_file.readlines[]

for lines in file_content:
    words = lines.split[]
    for word in words:
        if word not in words_dict:
            words_dict[word] = 0
        words_dict[word] += 1

for word, count in words_dict.items[]:
    if count > 1:
        print[word]

If you run this program, it will print the duplicate words in the same order these are found in the file.

You might also like:

  • 11 differences between Python 2 vs Python 3 with examples
  • Learn Python numpy clip method with examples
  • How to find Determinant in Python Numpy
  • Get evenly spaced numbers in an interval using numpy linspace
  • Python string isnumeric method
  • How to use Python string partition method
  • Python numpy floor_divide method explanation with examples
  • How to use logical operator with strings in Python
  • Python Matplotlib horizontal histogram or bar graph
  • Python program to sort lines of a text file alphabetically
  • Read and write the content of a file backward in Python

How do you find duplicate words in a text file Python?

In this post, we will learn how to find the duplicate words in a file in Python..
Open the file in read mode..
Initialize two empty set. ... .
Iterate through the lines of the file with a loop..
For each line, get the list of words by using split..
Iterate through the words of each line by using a loop..

How do I find the most repeated words in a text file Python?

with open[inputFile, 'r'] as filedata:.
Traverse in each line of the file using the for loop..
Use the split[] function [splits a string into a list. ... .
Traverse in the list of words using the for loop..
Use the append[] function [adds the element to the list at the end], to append each word to the list..

How do I find the most frequent words in a text file?

This can be done by opening a file in read mode using file pointer. Read the file line by line. Split a line at a time and store in an array. Iterate through the array and find the frequency of each word and compare the frequency with maxcount.

How do you find duplicate lines in Python?

“python script to find duplicate lines in a file and delete” Code Answer's.
lines_seen = set[] # holds lines already seen..
with open["file.txt", "r+"] as f:.
d = f. readlines[].
f. seek[0].
for i in d:.
if i not in lines_seen:.
f. write[i].

Chủ Đề