Java remove html tags from string regex

Starting from aioobe's code, I tried something more daring:

String input = "

some text

\n

another text

"; String stripped = input.replaceAll("", ""); System.out.println(stripped);

The code to strip every HTML tag would look like this:

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h2","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("

some text


another text

")); } }

I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...

Advantages:

  • You can generate lists of tags you want to strip, which means you can keep those you want
  • You avoid stripping stuff that isn't an HTML tag
  • You keep the whitespaces

Drawbacks:

  • You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.

If you see any other drawbacks, I would really be glad to know them.


A String is a final class in Java and it is immutable, it means that we cannot change the object itself, but we can change the reference to the object. The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

Syntax

public String replaceAll(String regex, String replacement)

Example

public class RemoveHTMLTagsTest {
   public static void main(String[] args) {
      String str = "

Welcome to Tutorials Point

";       System.out.println("Before removing HTML Tags: " + str);       str = str.replaceAll("\<.*?\>", "");       System.out.println("After removing HTML Tags: " + str);    } }

Output

Before removing HTML Tags: 

Welcome to Tutorials Point

After removing HTML Tags: Welcome to Tutorials Point

Java remove html tags from string regex

Updated on 01-Jul-2020 07:57:01

  • Related Questions & Answers
  • How to remove html tags from a string in JavaScript?
  • Java program to remove all the white spaces from a given string
  • How to remove the last character from a string in Java?
  • Remove all duplicates from a given string in C#
  • Remove all duplicates from a given string in Python
  • Remove a Given Word from a String using C++
  • Remove and add new HTML Tags with JavaScript?
  • How to use Boto3 to remove tags from AWS Glue Resources
  • Program to remove duplicate characters from a given string in Python
  • How to remove all whitespace from String in Java?
  • Explain how to remove Leading Zeroes from a String in Java
  • Java Program to Get a Character From the Given String
  • Java Program to Remove All Whitespaces from a String
  • How to remove underline from a link in HTML?
  • How to remove consonants from a string using regular expressions in Java?

How do you replace HTML tag from string in Java?

A String is a final class in Java and it is immutable, it means that we cannot change the object itself, but we can change the reference to the object. The HTML tags can be removed from a given string by using replaceAll() method of String class.

How do I remove text tags in HTML?

Removing HTML Tags from Text.
Press Ctrl+H. ... .
Click the More button, if it is available. ... .
Make sure the Use Wildcards check box is selected..
In the Find What box, enter the following: \([!<]@)\.
In the Replace With box, enter the following: \1..
With the insertion point still in the Replace With box, press Ctrl+I once..

Which function is used to remove all HTML tags from a string passed to a form?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter.

How do I delete a tag in Jsoup?

Document docsoup = Jsoup. parse(htmlin); docsoup. head(). remove();