Python regex remove duplicate words

I am very new a Python

I want to change sentence if there are repeated words.

Correct

  • Ex. "this just so so so nice" --> "this is just so nice"
  • Ex. "this is just is is" --> "this is just is"

Right now am I using this reg. but it do all so change on letters. Ex. "My friend and i is happy" --> "My friend and is happy" (it remove the "i" and space) ERROR

text = re.sub(r'(\w+)\1', r'\1', text) #remove duplicated words in row

How can I do the same change but instead of letters it have to check on words?

asked Jun 21, 2013 at 15:08

text = re.sub(r'\b(\w+)( \1\b)+', r'\1', text) #remove duplicated words in row

The \b matches the empty string, but only at the beginning or end of a word.

answered Jun 21, 2013 at 15:15

tomtom

20.4k6 gold badges40 silver badges36 bronze badges

Non- regex solution using itertools.groupby:

>>> strs = "this is just is is"
>>> from itertools import groupby
>>> " ".join([k for k,v in groupby(strs.split())])
'this is just is'
>>> strs = "this just so so so nice" 
>>> " ".join([k for k,v in groupby(strs.split())])
'this just so nice'

answered Jun 21, 2013 at 15:10

Python regex remove duplicate words

Ashwini ChaudharyAshwini Chaudhary

236k55 gold badges442 silver badges495 bronze badges

1

  • \b: Matches Word Boundaries

  • \w: Any word character

  • \1: Replaces the matches with the second word found

      import re
    
    
      def Remove_Duplicates(Test_string):
          Pattern = r"\b(\w+)(?:\W\1\b)+"
          return re.sub(Pattern, r"\1", Test_string, flags=re.IGNORECASE)
    
    
      Test_string1 = "Good bye bye world world"
      Test_string2 = "Ram went went to to his home"
      Test_string3 = "Hello hello world world"
      print(Remove_Duplicates(Test_string1))
      print(Remove_Duplicates(Test_string2))
      print(Remove_Duplicates(Test_string3))
    

Result:

    Good bye world
    Ram went to his home
    Hello world

answered Feb 17, 2021 at 19:22

Python regex remove duplicate words

Not the answer you're looking for? Browse other questions tagged python regex or ask your own question.

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Given a string str which represents a sentence, the task is to remove the duplicate words from sentences using regular expression in java.
    Examples: 
     

    Input: str = “Good bye bye world world” 
    Output: Good bye world 
    Explanation: 
    We remove the second occurrence of bye and world from Good bye bye world world
    Input: str = “Ram went went to to to his home” 
    Output: Ram went to his home 
    Explanation: 
    We remove the second occurrence of went and the second and third occurrences of to from Ram went went to to to his home.
    Input: str = “Hello hello world world” 
    Output: Hello world 
    Explanation: 
    We remove the second occurrence of hello and world from Hello hello world world. 
     

    Approach
     

    1. Get the sentence.
    2. Form a regular expression to remove duplicate words from sentences. 
       
    regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
    1. The details of the above regular expression can be understood as: 
      • “\\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.
      • “\\w+” A word character: [a-zA-Z_0-9] 
         
      • “\\W+”: A non-word character: [^\w] 
         
      • “\\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+) 
         
      • “+”: Match whatever it’s placed after 1 or more times 
         
    2. Match the sentence with the Regex. In Java, this can be done using Pattern.matcher().
       
    3. return the modified sentence.

    Below is the implementation of the above approach:
     

    C++

    #include

    #include

    using namespace std;

    string removeDuplicateWords(string s)

    {

      const regex pattern("\\b(\\w+)(?:\\W+\\1\\b)+", regex_constants::icase);

      string answer = s;

      for (auto it = sregex_iterator(s.begin(), s.end(), pattern);

           it != sregex_iterator(); it++)

      {

          smatch match;

          match = *it;

          answer.replace(answer.find(match.str(0)), match.str(0).length(), match.str(1));

      }

      return answer;

    }

    int main()

    {

      string str1

          = "Good bye bye world world";

      cout << removeDuplicateWords(str1) << endl;

      string str2

          = "Ram went went to to his home";

      cout << removeDuplicateWords(str2) << endl;

      string str3

          = "Hello hello world world";

      cout << removeDuplicateWords(str3) << endl;

      return 0;

    }

    Java

    import java.util.regex.Matcher;

    import java.util.regex.Pattern;

    class GFG {

        public static String

        removeDuplicateWords(String input)

        {

            String regex

                = "\\b(\\w+)(?:\\W+\\1\\b)+";

            Pattern p

                = Pattern.compile(

                    regex,

                    Pattern.CASE_INSENSITIVE);

            Matcher m = p.matcher(input);

            while (m.find()) {

                input

                    = input.replaceAll(

                        m.group(),

                        m.group(1));

            }

            return input;

        }

        public static void main(String args[])

        {

            String str1

                = "Good bye bye world world";

            System.out.println(

                removeDuplicateWords(str1));

            String str2

                = "Ram went went to to his home";

            System.out.println(

                removeDuplicateWords(str2));

            String str3

                = "Hello hello world world";

            System.out.println(

                removeDuplicateWords(str3));

        }

    }

    Python3

    import re

    def removeDuplicateWords(input):

        regex = r'\b(\w+)(?:\W+\1\b)+'

        return re.sub(regex, r'\1', input, flags=re.IGNORECASE)

    str1 = "Good bye bye world world"

    print(removeDuplicateWords(str1))

    str2 = "Ram went went to to his home"

    print(removeDuplicateWords(str2))

    str3 = "Hello hello world world"

    print(removeDuplicateWords(str3))

    Output:

    Good bye world
    Ram went to his home
    Hello world


    How do you delete repeated words in regex?

    Get the sentence. Form a regular expression to remove duplicate words from sentences..
    The details of the above regular expression can be understood as: ... .
    Match the sentence with the Regex. ... .
    return the modified sentence..

    How do you remove duplicates from a word in Python?

    1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

    How do you check for repeated words in Python?

    Python.
    string = "big black bug bit a big black dog on his big black nose";.
    #Converts the string into lowercase..
    string = string.lower();.
    #Split the string into words using built-in function..
    words = string.split(" ");.
    print("Duplicate words in a given string : ");.
    for i in range(0, len(words)):.
    count = 1;.