How do i convert html content to plain text in python?
I am trying to convert an html block to text using Python. Show Input:
Desired output:
I tried the
The
Rob Bednark 23.8k20 gold badges78 silver badges117 bronze badges asked Feb 4, 2013 at 19:55
Aaron BandelliAaron Bandelli 1,1182 gold badges11 silver badges16 bronze badges 1
output:
To keep newlines:
To be identical to your example, you can replace a newline with two newlines:
Rob Bednark 23.8k20 gold badges78 silver badges117 bronze badges answered Feb 4, 2013 at 20:06
3 It's possible using python standard
julienc 17.8k17 gold badges80 silver badges80 bronze badges answered Apr 24, 2019 at 8:03
FrBrGeorgeFrBrGeorge 4905 silver badges6 bronze badges 4 You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:
Output
Rob Bednark 23.8k20 gold badges78 silver badges117 bronze badges answered Feb 4, 2013 at 20:02
ATOzTOAATOzTOA 33.5k22 gold badges92 silver badges116 bronze badges 3 The main problem is how you keep some basic formatting. Here is my own minimal approach to keep new lines and bullets. I am sure it's not the solution to everything you want to keep but it's a starting point:
The above adds a new line for answered Mar 18, 2021 at 11:57
AndreasAndreas 87816 silver badges27 bronze badges The
answered Feb 4, 2013 at 20:11
t-8cht-8ch 2,51512 silver badges18 bronze badges 4 I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the
See comment for usage. This converts all of the text inside the answered Jun 3, 2020 at 18:45
Mark ChackerianMark Chackerian 20.2k6 gold badges103 silver badges97 bronze badges There are some nice things here, and i might as well throw in my solution:
answered Sep 15, 2020 at 9:50
dermasmiddermasmid 3104 silver badges7 bronze badges gazpacho might be a good choice for this! Input:
Output:
answered Oct 9, 2020 at 20:38
emehexemehex 8,9429 gold badges53 silver badges93 bronze badges I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.
|