Deletion distance between two strings python

I was solving this problem at Pramp and I have trouble figuring out the algorithm for this problem.

On the contrary, you've done a very good job of coming up with a solution. There are ways to improve it though.

As you note, this is just the Longest Common Subsequence problem in a thin disguise. The "deletion distance" between two strings is just the total length of the strings minus twice the length of the LCS.

That is, the LCS of dogs (4 characters) and frogs (5 characters) is ogs (3 characters), so the deletion distance is (4 + 5) - 2 * 3 = 3.

Therefore, all you need to do to solve the problem is to get the length of the LCS, so let's solve that problem.

Your solution is pretty good but the primary problem is that it takes O(mn) time and memory if the strings are of length m and n. You can improve this.

The first thing to notice is that if the strings have a common prefix or suffix then you can automatically eliminate it. That is, the deletion distance for Who let the big dogs out? and Who let the little frogs out? is the same as the deletion distance for big d and little fr. It is very cheap and easy to determine if two strings have a common prefix and suffix, and you go from having an array with 25*29 elements to an array with 5*9 elements, a huge win.

The next thing to notice is: you build the entire m*n array up front, but while you are filling in the array, m[i][j] only ever looks at m[i-1][j-1] or m[i-1][j] or m[i][j-1]. Since you never look at an array line that is two away, you don't ever need more than two lines! That is, you can:

  • allocate and compute the first line
  • allocate and compute the second line given the first line
  • throw away the first line; we'll never use it again
  • allocate and compute the third line from the second line
  • throw away the second line
  • … and so on

You still do O(mn) operations, and you still allocate in total the same amount of memory, but you only have a small amount of it in memory at the same time. If the strings are large, that's a considerable savings.

25 Apr 2018

Overview

The deletion distance of two strings is the minimum number of characters you need to delete in the two strings in order to get the same string. For instance, the deletion distance between heat and hit is 3:

  • By deleting e and a in heat, and i in hit, we get the string ht in both cases.
  • We cannot get the same string from both strings by deleting 2 letters or fewer. Given the strings str1 and str2, write an efficient function deletionDistance that returns the deletion distance between them. Explain how your function works, and analyze its time and space complexities.

Examples:

input:  str1 = "dog", str2 = "frog"
output: 3

input:  str1 = "some", str2 = "some"
output: 0

input:  str1 = "some", str2 = "thing"
output: 9

input:  str1 = "", str2 = ""
output: 0

Constraints:

  • [input] string str1
  • [input] string str2
  • [output] integer

function deletionDistance($str1, $str2) {
  // your code goes here
}

Hints

  • Recommend your peer to identify the base cases first. That is, cases where one of the strings is the empty string. In this case, the deletion distance is simply the length of the other string. After that, encourage them to try cases that are somewhat similar, such as one string containing 1 or 2 characters.
  • If your peer needs additional assistance, help them by hinting toward a recursive relation between the deletionDistance(str1, str2), and the deletionDistance for some prefixes of str1 and str2. After they find the relation, you may suggest using Dynamic Programming.
  • If your peer is still stuck finding the recursive relation guide them to look at two cases:
    • Case 1: The last character in str1 is equal to the last character in str2. In that case, one may assume that these two characters aren’t deleted, and look at the prefixes that don’t include the last character.
    • Case 2: The last character in str1 is different from the last character in str2. In that case, at least one of the characters must be deleted, thus we get the following equation: d(str1,str2) = 1 + min( d(str1.substring(0, n-1), str2), d(str1, str2.substring(0, m-1)) ) where n is the length of str1, m is the length of str2, and d(x,y) is the deletion distance between x and y.

Solution

Let str1Len and str2Len be the lengths of str1 and str2, respectively. Consider the function: opt(i,j) which returns the deletion distance for the i'th prefix of str1, and the j'th prefix of str2. What we want to do in this solution, is to use dynamic programming in order to build a function that calculates opt(str1Len, str2Len). Notice the following:

  • opt(0,j) = j and opt(i,0) = i

This is true because if one string is the empty string, we have no choice but to delete all letters in the other string.

  • If i,j > 0 and str1[i] ≠ str2[j] then opt(i,j) = 1 + min(opt(i-1, j), opt(i, j-1))

This holds since we need to delete at least one of the letters str1[i] or str2[j] and the deletion of one of the letters is counted as 1 deletion (hence the 1 in the formula). Then, since we’re left with either the (i-1)'th prefix of str1, or the (j-1)'th prefix of str2, need to take the minimum between opt(i-1,j) and opt(i,j-1). We, therefore, get the equation opt(i,j) = 1 + min(opt(i-1,j), opt(i,j-1)).

  • If i,j > 0 and str1[i] = str2[j], then opt(i,j) = opt(i-1, j-1)

This holds since we don’t need to delete the last letters in order to get the same string, we simply use the same deletions we would to the (i-1)'th and (j-1)'th prefixes.

Solution 1

After finding the relations above for opt(i,j), we use dynamic programming methods to calculate opt(str1Len, str2Len), i.e. the deletion distance for the two strings, by calculating opt(i,j) for all 0 ≤ i ≤ str1Len, 0 ≤ j ≤ str2Len, and saving previous values for later use:

Pseudocode:

function deletionDistance(str1, str2):
    str1Len = str1.length
    str2Len = str2.length
    
    # allocate a 2D array with str1Len + 1 rows and str2Len + 1 columns
    memo = new Array(str1Len + 1, str2Len + 1)

    for i from 0 to str1Len:
        for j from 0 to str2Len:
            if (i == 0):
                memo[i][j] = j
            else if (j == 0):
                memo[i][j] = i
            else if (str1[i-1] == str2[j-1]):
                memo[i][j] = memo[i-1][j-1]
            else:
                memo[i][j] = 1 + min(memo[i-1][j], memo[i][j-1])

    return memo[str1Len][str2Len]

Time Complexity: we have a nested loop that executes O(1) steps at every iteration, thus we the time complexity is O(N⋅M) where N and M are the lengths of str1 and str2, respectively.

Space Complexity: we save every value of opt(i,j) in our memo 2D array, which takes O(N⋅M) space, where N and M are the lengths of str1 and str2, respectively.

Solution 2

The solution above takes O(N⋅M) space since we save all previous values, but notice that opt(i,j) requires only opt(i-1,j), opt(i,j-1) and opt(i-1,j-1). Thus, by iterating first through 0 ≤ i ≤ str1Len, and then for every i calculating 0 ≤ j ≤ str2Len, we need only to save the values for the current i and the last i. This will reduce the space needed for the function.

Pseudocode:

function deletionDistance(str1, str2):
    # make sure the length of str2 isn't
    # longer than the length of str1
    if (str1.length < str2.length)
        tmpStr = str1
        str1 = str2
        str2 = tmpStr

    str1Len = str1.length
    str2Len = str2.length
    prevMemo = new Array(str2Len  + 1)
    currMemo = new Array(str2Len  + 1)

    for i from 0 to str1Len:
        for j from 0 to str2Len:
            if (i == 0):
                currMemo[j] = j
            else if (j == 0):
                currMemo[j] = i
            else if (str1[i-1] == str2[j-1]):
                currMemo[j] = prevMemo[j-1]
            else:
                currMemo[j] = 1 + min(prevMemo[j], currMemo[j-1])

        prevMemo = currMemo
        currMemo = new Array(str2Len + 1);  
                                           
    return prevMemo[str2Len]

Time Complexity: the time complexity stays the same, i.e. O(N⋅M), since we still run a nested loop with N⋅M iterations.

Space Complexity: O(min(N,M)), as we only need to hold two rows of the double array.

Example Code

PHP code

function deletionDistance($str1, $str2) {
	$cur = range(0, strlen($str1));
	$prev = [];
	for($i=0; $i<=strlen($str2); $i++){
		$prev[$i] = 0;
	}
	
	for($i=0; $i<strlen($str1);$i++){
		$cur = $cur;
		$prev = $prev;
		$cur[0] = $i + 1;
		for($j=0; $j<strlen($str2);$j++){
			$sub = $str1[$i] == $str2[$j] ? 0 : 2;
			$cur[$j+1] = min($prev[$j]+$sub, $cur[$j]+1, $prev[$j+1]+1);
		}
	}
	return $cur[count($cur)-1];
}

Python code 1 (working)

def deletionDistance(s1, s2):
  cur = list(range(len(s2) + 1))
  prev = [0] * (len(s2) + 1)
  for i in range(len(s1)):
    cur, prev = prev, cur
    cur[0] = i + 1
    for j in range(len(s2)):
      # Substitution is same as two deletions
      sub = 0 if s1[i] == s2[j] else 2
      cur[j+1] = min(prev[j] + sub, cur[j] + 1, prev[j+1] + 1)
  return cur[-1]

Python code 2 (not working)

def deletionDistance(s1, s2):
	m = [[0 for j in range(len(s2)+1)] for i in range(len(s1)+1)]
	for i in range(len(s1)+1):
    	for j in range(len(s2)+1):
	        if i == 0:
	            m[i][j] = sum(bytearray(s2[:j]))
	        elif j == 0:
	            m[i][j] = sum(bytearray(s1[:i]))
	        elif s1[i-1] == s2[j-1]:
	            m[i][j] = m[i-1][j-1]
	        else:
	            m[i][j] = ord(s1[i-1]) + ord(s2[j-1]) + min(m[i-1][j-1], m[i-1][j], m[i][j-1])
	return m[len(s1)][len(s2)]

Python code 3 (not working)

def deletionDistance(s1, s2):
    m = [[0 for j in range(len(s2)+1)] for i in range(len(s1)+1)]
    for i in range(len(s1)+1):
        for j in range(len(s2)+1):
            if i == 0:
                m[i][j] = sum(bytearray(s2[:j]))
            elif j == 0:
                m[i][j] = sum(bytearray(s1[:i]))
            elif s1[i-1] == s2[j-1]:
                m[i][j] = m[i-1][j-1]
            else:
                s1del = ord(s1[i-1])
                s2del = ord(s2[j-1])
                s1s2del = s1del + s2del
                m[i][j] = min(m[i-1][j-1] + s1s2del, m[i-1][j] + s1del, m[i][j-1] + s2del)
    return m[len(s1)][len(s2)]

Test case

PHP

$testCase = [
	["", "", 0],
	["", "hit", 3],
	["neat", "", 4],
	["heat", "hit", 3],
	["hot", "not", 2],
	["some", "thing", 9],
	["abc", "adbc", 1],
	["awesome", "awesome", 0],
	["ab", "ba", 2],
];

foreach($testCase as $key=>$case) {
	$index = $key+1;
	if ($case[2] != deletionDistance($case[0], $case[1])) {
		echo "Test case #{$index} failed
\n"; } else { echo "Test case #{$index} passed
\n"; } }

Python

testCase = [
	["", "", 0],
	["", "hit", 3],
	["neat", "", 4],
	["heat", "hit", 3],
	["hot", "not", 2],
	["some", "thing", 9],
	["abc", "adbc", 1],
	["awesome", "awesome", 0],
	["ab", "ba", 2],
];
# print(testCase)

index = 1;
for case in testCase:
  if(case[2] != deletionDistance(case[0], case[1])):
    print('#' + str(index) + 'failed
\n'); else: print('#' + str(index) + 'passed
\n'); index+=1

https://stackoverflow.com/questions/44490091/deletion-distance-between-2-strings https://stackoverflow.com/questions/41275345/deletion-distance-between-words

How does Python calculate edit distance?

The edit distance between two strings refers to the minimum number of character insertions, deletions, and substitutions required to change one string to the other. For example, the edit distance between "kitten" and "sitting" is three: substitute the "k" for "s", substitute the "e" for "i", and append a "g".

What is the minimum edit distance between two strings?

Minimum Edit distance between two strings str1 and str2 is defined as the minimum number of insert/delete/substitute operations required to transform str1 into str2. For example if str1 = "ab", str2 = "abc" then making an insert operation of character 'c' on str1 transforms str1 into str2.

How do you find the distance between two letters in python?

Distance between two letters is difference between their positions in the alphabet. for example: dist(c, e) = dist(e, c) = 2. dist(a, z) = dist(z, a) = 25.

How do you find the distance between two strings?

There are several ways to measure the distance between two strings. The simplest one is to use hamming distance to find the number of mismatch between two strings. However, the two strings must have the same length.