How to remove html tag using regex?

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.

The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

String target = someString.replaceAll["]*>", ""];

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only tags, you could do something like this:

String target = someString.replaceAll["[?i]]*>", ""];

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were SomethingAnother Thing, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll["[?i]]*>", " "].replaceAll["\\s+", " "].trim[];

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

HTML stands for HyperText Markup Language and is used to display information in the browser. HTML regular expressions can be used to find tags in the text, extract them or remove them. Generally, it’s not a good idea to parse HTML with regex, but a limited known set of HTML can be sometimes parsed.

Match all HTML tags

Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only.

/]]+>/g;

Test it!

/]]+>/

True

False

Enter a text in the input above to see the result

Example code in JavaScript:

// Remove all tags from a string
var htmlRegexG = /]]+>/g;
'Hello, world!
'.replace[htmlRegexG, '']; // returns 'Hello, world';

Extract text between certain tags

One of the most common operations with HTML and regex is the extraction of the text between certain tags [a.k.a. scraping]. For this operation, the following regular expression can be used.

var r1 = /
[.*?]/g // Tag only var r2 = /[?

Chủ Đề