How to remove html tag using regex?
You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point. Show
The following examples are Java, but the regex will be similar -- if not identical -- for other languages.
Assuming your non-html does not contain any < or > and that your input string is correctly structured. If you know they're a specific tag -- for example you know the text contains only Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if
there were multiple tags. For example, if the input string were In a situation where multiple tags are expected, we could do something like: This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends. HTML stands for HyperText Markup Language and is used to display information in the browser. HTML regular expressions can be used to find tags in the text, extract them or remove them. Generally, it’s not a good idea to parse HTML with regex, but a limited known set of HTML can be sometimes parsed. Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. Test it! /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/ True False Enter a text in the input above to see the result Example code in JavaScript: One of the most common operations with HTML and regex is the extraction of the text between certain tags (a.k.a. scraping). For this operation, the following regular expression can be used. Test it! / True False Enter a text in the input above to see the result Example code in Javascript: Test it! True False
Enter a text in the input above to see the result You should never use regular expressions to fully parse HTML documents as regular expressions are not intended for such tasks. Instead, you can use HTML or XML document parsers that can do validation alongside parsing. A friend of mine asked for a regex to remove all HTML tags from a webpage and to leave everything else, including what's between the tags and this is the regular expresion that I came up with for him: Another option is to strip out only certain tags and that can be done as: You can
use remove HTML and other <> tags from any field. ✨ In the selected field, any text appearing between < and > (like |