Python find similar strings in list

Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello'
for i, item in enumerate(words):
    if lower(item) > lower(word):
      ...

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

When we deal with data from different sources, we often encounter one problem – some strings could be spelled differently, but they have the same meaning. We humans can identify the meanings right away, but machines cannot. This short tutorial will cover how to find similar strings using Python.

Machines don’t really understand the human language…

For example, we have two sentences:

>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 = "the leaf is green, banana yellow, apple red, pumpkin orange"

We humans can tell that s1and s2 have the same meaning. But in the eyes of a machine, they are two different sentences.

>>> s1 == s2
False

So how can we let the machine recognize that these two sentences mean the same thing? Applying machine learning technique is one way, but today we’ll show something much easier to use.

Introducing The Levenshtein Distance

It turns out that there’s a formula called Levenshtein distance that measures the minimum of single-character edits required to change one string into the other string. This distance is named after the Soviet mathematician Vladimir Levenshtein. In this tutorial, we will leave out all the theoretical details, but if you are interested, Wikipedia is a good starting point.

Fuzzy String Matching In Python

The appropriate terminology for finding similar strings is called a fuzzy string matching. We are going to use a library called fuzzywuzzy. Although it has a funny name, it a very popular library for fuzzy string matching. The fuzzywuzzy library can calculate the Levenshtein distance, and it has a few other powerful functions to help us with fuzzy string matching.

Let’s start from something easy, like comparing two words. Then we’ll move on to more complicated scenarios, such as comparing multiple sentences. Here we go!

Let’s first install the library. If you are new to this blog and need help with installing Python and libraries, read this tutorial here.

pip install fuzzywuzzy

Single Word Example

Starting with the simplest example. What do you think the computer views these two words: “Bezos” and “bezos”? Of course, they are not the same. Because of the capital letter “B” does not equal to the lower case “b”!

>>> "Bezos" == "bezos"
False

It’s rather easy to match these two words. we can simply make both words all lower cases (or upper cases), then compare again. We can use the String lower() / upper() methods directly on any given string data.

>>> "Bezos".lower() == "bezos".lower()
True

>>> "Bezos".upper() == "bezos".upper()
True

Multiple Words Example

You might already know that “Jeff” is the short name for “Jeffery”, but apparently machines don’t know that yet.

>>> n1 = "Jeff Bezos"
>>> n2 = "Jeffery Bezos"
>>> n1 == n2
False

Now, using the fuzzy string matching technique, we get a number of 87. The ratio() method calculates a Levenshtein distance ratio. Personally, I interpret these ratios as “probability of similarity”. The higher this ratio, the more similar the two words are.

from fuzzywuzzy import fuzz

>>> fuzz.ratio(n1, n2)
87

Let’s now introduce another variable n3 = “Bezos”, and then calculate the Levenshtein distance ratio with both “Jeff Bezos” and “Jeffery Bezos”.

The matching results are very poor. This is because by definition of the Levenshtein distance, it takes many edits to change “Bezos” to either “Jeff Bezos” or “Jeffery Bezos”.

n3 = "Bezos"
>>> fuzz.ratio(n1, n3)
67
>>> fuzz.ratio(n2, n3)
56

We can use fuzzywuzzy’s partial_ratio() method to help in this situation. The partial_ratio() method tries to match substrings within each sample string. In this case, because “Bezos” exist in both n1 or n2, we get a match value of 100 in both cases.

>>> fuzz.partial_ratio(n1, n3)
100

>>> fuzz.partial_ratio(n2, n3)
100

Sentence Example

Let’s take it a step further and compare two sentences.

>>> s3 = "The leaf is green, and the pumpkin is orange."
>>> s4 = "THE PUMPKIN IS ORANGE, AND THE LEAF IS GREEN"
>>> fuzz.ratio(s3.lower(), s4.lower())
47
>>> fuzz.partial_ratio(s3.lower(),s4.lower())
64

Note one sentence contains both upper and lower case letters, and the other sentence contains only upper case letters. We need to help the ratio function a little bit by converting both sentences into lower case letters.

The performance is still very poor with the ratio() methods. Remember, the higher the ratio, the closer the match. The partial_ratio() method doesn’t help too much in this case either. This is because the two sentences have reversed order, and neither string is a subset of the other.

Do not worry, because the fuzzywuzzy library provides other powerful functions for string comparison!

>>> fuzz.token_sort_ratio(s3, s4)
100

To see why the token_sort_ratio() function gives a perfect match, we need to understand what the function does, which is the following things:

  • “Tokenizes” the sentences. Meaning it breaks up sentences into individual words
  • Converts all words into lower cases
  • Removes punctuations
  • Sorts the string tokens (i.e. individual words) alphabetically
  • Joins the individual words then compare the newly formed strings

It’s clear that by the end of the above process, the two strings s3 and s4 are essentially the same, no wonder we got a perfect match!

Sentence Example – Continued

Now let’s review our original question at the beginning of this tutorial. We’ll first try the token_sort_ratio function.

>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 ="the leaf is green, banana yellow, apple red, pumpkin orange"
>>> fuzz.token_sort_ratio(s3, s4)
77

This time, the token_sort_ratio() gives a ratio of 77. The reason is that the s1 sentence contains a lot more duplicated words than the s2 sentence, such as “and”, “the”, and “is”. Those words appeared several times in the s1 sentence, but only once in the s2 sentence. However, those duplicated words do not provide additional meaning to the first sentence, we need to help the machine filter them.

fuzzywuzzy provides another useful function to solve this problem: token_set_ratio().

This function is very similar to token_sort_ratio(). The only difference between the two token functions is that instead of sorting the individual tokens, it applies a set() to the list of individual words. So what does a set() do in Python? The following example explains it well – a set can contain only unique items. This feature makes set a convenient way to remove duplicated items from an iterable.

>>> li = [1,2,2,3,3,3,4,4,4,4]
>>> set(li)
{1, 2, 3, 4}

Let’s try it on the two strings s1 and s2. Again, a beautifully perfect match! This is because the token_set_ratio() has remove all duplicated words, and what’s left in the two strings matched pretty well.

>>> fuzz.token_set_ratio(s1, s2)
100

What’s next?

Now we are equipped with the knowledge of fuzzy string matching in Python. In the next tutorial, we’ll walk through some examples how to use fuzzy string matching together with pandas and Excel.

How do you find similar strings in Python?

import string def match(a,b): a,b = a. lower(), b. lower() error = 0 for i in string..
Normalized, metric, similarity and distance..
(Normalized) similarity and distance..
Metric distances..
Shingles (n-gram) based similarity and distance..
Levenshtein..
Normalized Levenshtein..
Weighted Levenshtein..
Damerau-Levenshtein..

How do you find similar words in Python?

How to Find Synonyms of a Word with NLTK WordNet and Python?.
Import NLTK.corpus..
Import WordNet from NLTK.Corpus..
Create a list for assigning the synonym values of the word..
Use the “synsets” method..
use the “syn. ... .
Call the synonyms of the word with NLTK WordNet within a set..

How do you find similar strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

How do you check if a string is present in a list of strings in Python?

We can also use count() function to get the number of occurrences of a string in the list. If its output is 0, then it means that string is not present in the list. l1 = ['A', 'B', 'C', 'D', 'A', 'A', 'C'] s = 'A' count = l1. count(s) if count > 0: print(f'{s} is present in the list for {count} times.