How to find common words in python

I want to cross check names from two word documents and then print the common names in the same program. How do I do so? Do I use regex or simply use the in function?

S.S. Anne

14.6k7 gold badges35 silver badges69 bronze badges

asked Aug 13, 2012 at 17:05

Once you have the text out of the Word documents, it's really quite easy:

document_1_text = 'This is document one'
document_2_text = 'This is document two'

document_1_words = document_1_text.split[]
document_2_words = document_2_text.split[]

common = set[document_1_words].intersection[ set[document_2_words] ]
unique = set[document_1_words].symmetric_difference[ set[document_2_words] ]

If you're not sure how to get the text out of Word docs:

from win32com.client import Dispatch

def get_text_from_doc[filename]:
    word = Dispatch['Word.Application']
    word.Visible = False
    wdoc = word.Documents.Open[filename]
    if wdoc:
        return wdoc.Content.Text.strip[]

answered Aug 27, 2012 at 4:38

Matthew TrevorMatthew Trevor

13.8k6 gold badges36 silver badges49 bronze badges

str1 = "Hello world its a demo"

str2 = "Hello world"

str1_words = set[str1.split[]]

str2_words = set[str2.split[]]

common = str1_words & str2_words

output:

common = {'Hello', 'world'}

answered Aug 13, 2019 at 11:21

SumanSuman

611 silver badge2 bronze badges

0

You need to store the words from one document, then go through the words of the second document checking to see if each word was in the previous document. So, if I had two strings instead of documents, I could do this:

a = "Hello world this is a string"
b = "Hello world not like the one before"

Store the words in the string:

d = {}
for word in a.split[]:
  d[word] = true
for word in b.split[]:
  if d[word]:
    print[word]

answered Aug 13, 2012 at 17:40

TheDudeTheDude

3,7282 gold badges27 silver badges49 bronze badges

1

str1 = "Hello world its a demo"
str2 = "Hello world"

for ch in str1.split[]:
  for ch2 in str2.split[]: 
      if ch == ch2:
          print ch

answered Jul 11, 2017 at 10:23

Just came on this thread and didn't see this method, so I just wanted to add that you can do this:

from collections import Counter

foo = "This is a string"
bar = "This string isn't like the one before"

baz = Counter[foo.split[" "]] + Counter[bar.split[" "]]
baz = sorted[baz, reverse=True, key=lambda x: x[1]]

Baz is now a dict that looks like this

Counter[{'This': 2, 'string': 2, 'is': 1, 'a': 1, "isn't": 1, 'like': 1, 'the': 1, 'one': 1, 'before': 1}]

Now you can see that the two strings have "This" and "string" in common

You could also convert all the strings [foo and bar] to lowercase using .lower[] before you use Counter[] on them so that everything is counted equally

answered Apr 19, 2021 at 9:49

Given the data set, we can find k number of most frequent words.

The solution of this problem already present as Find the k most frequent words from a file. But we can solve this problem very efficiently in Python with the help of some high performance modules.

In order to do this, we’ll use a high performance data type module, which is collections. This module got some specialized container datatypes and we will use counter class from this module.


Examples :

Input : "John is the son of John second. 
         Second son of John second is William second."
Output : [['second', 4], ['John', 3], ['son', 2], ['is', 2]]

Explanation :
1. The string will converted into list like this :
    ['John', 'is', 'the', 'son', 'of', 'John', 
     'second', 'Second', 'son', 'of', 'John', 
     'second', 'is', 'William', 'second']
2. Now 'most_common[4]' will return four most 
   frequent words and its count in tuple. 


Input : "geeks for geeks is for geeks. By geeks
         and for the geeks."
Output : [['geeks', 5], ['for', 3]]

Explanation :
most_common[2] will return two most frequent words and their count.

Recommended: Please try your approach on {IDE} first, before moving on to the solution.

Approach :

  1. Import Counter class from collections module.
  2. Split the string into list using split[], it will return the lists of words.
  3. Now pass the list to the instance of Counter class
  4. The function 'most-common[]' inside Counter will return the list of most frequent words from list and its count.

Below is Python implementation of above approach :

from collections import Counter

data_set = "Welcome to the world of Geeks " \

"This portal has been created to provide well written well" \

"thought and well explained solutions for selected questions " \

"If you like Geeks for Geeks and would like to contribute " \

"here is your chance You can write article and mail your article " \

" to contribute at geeksforgeeks org See your article appearing on " \

"the Geeks for Geeks main page and help thousands of other Geeks. " \

split_it = data_set.split[]

Counter = Counter[split_it]

most_occur = Counter.most_common[4]

print[most_occur]

Output :

[['Geeks', 5], ['to', 4], ['and', 4], ['article', 3]]

How do you find common words in a string python?

Practical Data Science using Python.
convert s0 and s1 into lowercase..
s0List := a list of words in s0..
s1List := a list of words in s1..
convert set from words in s0List and s1List, then intersect them to get common words, and return the count of the intersection result..

How do you find common letters in python?

Python Program to Find Common Characters in Two Strings.
Enter two input strings and store it in separate variables..
Convert both of the strings into sets and find the common letters between both the sets..
Store the common letters in a list..
Use a for loop to print the letters of the list..

How do I find the most common letter in a word in Python?

Method 2 : Using collections.Counter[] + max[] The most suggested method that could be used to find all occurrences is this method, this actually gets all element frequency and could also be used to print single element frequency if required. We find maximum occurring character by using max[] on values.

How do I find the most common words?

WordCounter analyzes your text and tells you the most common words and phrases. This tool helps you count words, bigrams, and trigrams in plain text. This is often the first step in quantitative text analysis.

Chủ Đề