How do i convert html content to plain text in python?

Question

I am trying to convert an html block to text using Python.

Nội dung chính Show

How do you convert HTML to text in Python?
How do I convert HTML format to plain text?
How do I extract all text from a website in Python?
How do I get data from HTML to Python?

Input:


Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Desired output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

I tried the html2text module without much success:

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print(html2text.html2text(txt))

The txt object produces the html block above. I'd like to convert it to text and print it on the screen.

How do i convert html content to plain text in python?

Rob Bednark

23.8k20 gold badges78 silver badges117 bronze badges

asked Feb 4, 2013 at 19:55

Aaron BandelliAaron Bandelli

1,1182 gold badges11 silver badges16 bronze badges

1

soup.get_text() outputs what you want:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

To keep newlines:

print(soup.get_text('\n'))

To be identical to your example, you can replace a newline with two newlines:

soup.get_text().replace('\n','\n\n')

Rob Bednark

23.8k20 gold badges78 silver badges117 bronze badges

answered Feb 4, 2013 at 20:06

3

It's possible using python standard html.parser:

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ""
    def handle_data(self, data):
        self.text += data

f = HTMLFilter()
f.feed(data)
print(f.text)

julienc

17.8k17 gold badges80 silver badges80 bronze badges

answered Apr 24, 2019 at 8:03

FrBrGeorgeFrBrGeorge

4905 silver badges6 bronze badges

4

You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:

import re

data = """
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa"""

data = re.sub(r'<.*?>', '', data)

print(data)

Output

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Rob Bednark

23.8k20 gold badges78 silver badges117 bronze badges

answered Feb 4, 2013 at 20:02

ATOzTOAATOzTOA

33.5k22 gold badges92 silver badges116 bronze badges

3

The main problem is how you keep some basic formatting. Here is my own minimal approach to keep new lines and bullets. I am sure it's not the solution to everything you want to keep but it's a starting point:

from bs4 import BeautifulSoup

def parse_html(html):
    elem = BeautifulSoup(html, features="html.parser")
    text = ''
    for e in elem.descendants:
        if isinstance(e, str):
            text += e.strip()
        elif e.name in ['br',  'p', 'h2', 'h2', 'h3', 'h4','tr', 'th']:
            text += '\n'
        elif e.name == 'li':
            text += '\n- '
    return text

The above adds a new line for 'br', 'p', 'h2', 'h2', 'h3', 'h4','tr', 'th' and a new line with - in front of text for li elements

answered Mar 18, 2021 at 11:57

AndreasAndreas

87816 silver badges27 bronze badges

The '\n' places a newline between the paragraphs.

from bs4 import Beautifulsoup

soup = Beautifulsoup(text)
print(soup.get_text('\n'))

answered Feb 4, 2013 at 20:11

t-8cht-8ch

2,51512 silver badges18 bronze badges

4

I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body tag and added a convenience method so that HTML to text is a single line:

from abc import ABC
from html.parser import HTMLParser


class HTMLFilter(HTMLParser, ABC):
    """
    A simple no dependency HTML -> TEXT converter.
    Usage:
          str_output = HTMLFilter.convert_html_to_text(html_input)
    """
    def __init__(self, *args, **kwargs):
        self.text = ''
        self.in_body = False
        super().__init__(*args, **kwargs)

    def handle_starttag(self, tag: str, attrs):
        if tag.lower() == "body":
            self.in_body = True

    def handle_endtag(self, tag):
        if tag.lower() == "body":
            self.in_body = False

    def handle_data(self, data):
        if self.in_body:
            self.text += data

    @classmethod
    def convert_html_to_text(cls, html: str) -> str:
        f = cls()
        f.feed(html)
        return f.text.strip()

See comment for usage.

This converts all of the text inside the body, which in theory could include style and script tags. Further filtering could be achieved by extending the pattern of as shown for body -- i.e. setting instance variables in_style or in_script.

answered Jun 3, 2020 at 18:45

Mark ChackerianMark Chackerian

20.2k6 gold badges103 silver badges97 bronze badges

There are some nice things here, and i might as well throw in my solution:

from html.parser import HTMLParser
def _handle_data(self, data):
    self.text += data + '\n'

HTMLParser.handle_data = _handle_data

def get_html_text(html: str):
    parser = HTMLParser()
    parser.text = ''
    parser.feed(html)

    return parser.text.strip()

answered Sep 15, 2020 at 9:50

dermasmiddermasmid

3104 silver badges7 bronze badges

gazpacho might be a good choice for this!

Input:

from gazpacho import Soup

html = """\

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
"""

Output:

text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

answered Oct 9, 2020 at 20:38

emehexemehex

8,9429 gold badges53 silver badges93 bronze badges

I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.

import urllib 

def html2text(strText):
    str1 = strText
    int2 = str1.lower().find("0:
       str1 = str1[int2:]
    int2 = str1.lower().find("")
if int2>0:
str1 = str1[:int2]
list1 = ['
', '', 'span>', 'li>', '' ]
list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)]
bolFlag1 = True
bolFlag2 = True
strReturn = ""
for int1 in range(len(str1)):
str2 = str1[int1]
for int2 in range(len(list1)):
if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
strReturn = strReturn + list2[int2]
if str1[int1:int1+7].lower() == '' or str1[int1:int1+11].lower() == '':
bolFlag1 = True
if str2 == '<':
bolFlag2 = False
if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
strReturn = strReturn + str2
if str2 == '>':
bolFlag2 = True
if bolFlag1 and bolFlag2:
strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13), '\n')
return strReturn
url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"
html = urllib.urlopen(url).read()
print html2text(html)

answered Sep 25, 2014 at 20:47

1

It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:

from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
   if child.name == 'script':
       child.decompose() 
print(soup.body.get_text())

answered Dec 12, 2017 at 22:58

Sarah MesserSarah Messer

2,82223 silver badges40 bronze badges

A two-step lxml-based approach with markup sanitizing before converting to plain text.

The script accepts either a path to an HTML file or piped stdin.

Will remove script blocks and all possibly undesired text. You can configure the lxml Cleaner instance to suit your needs.

#!/usr/bin/env python3

import sys
from lxml import html
from lxml.html import tostring
from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if len(sys.argv) > 1:
  fin = open(sys.argv[1], encoding='utf-8')
else:
  fin = sys.stdin

source = fin.read()
source = sanitize(source)
source = source.replace('
', '\n')

tree = html.fromstring(source)
plain = tostring(tree, method='text', encoding='utf-8')

print(plain.decode('utf-8'))

answered Oct 25, 2021 at 13:48

ccpizzaccpizza

26.4k14 gold badges155 silver badges150 bronze badges

I personally like Gazpacho solution by emehex, but it only use regular expression for filtering out the tags. No more magic. This means that solution keep text inside ", "", text, flags=re.DOTALL) # remove other tags text = re.sub("<[^>]+>", " ", text) # strip whitespace text = " ".join(text.split()) # unescape html entities text = unescape(text) return text

Of course, this does not error prove as BeautifulSoup or other parsers solutions. But you don't need any 3rd party package.

answered Oct 29, 2021 at 11:39

quickquick

1,05410 silver badges17 bronze badges

I encountered the same problem using Scrapy you may try adding this to settings.py

#settings.py
FEED_EXPORT_ENCODING = 'utf-8'

answered Jun 28 at 23:46

Jaypee TanJaypee Tan

951 silver badge10 bronze badges

There is a library called inscripts really simple and light and can get its input from a file or directly from an URL:

from inscriptis import get_text
text = get_text(html)
print(text)

The output is:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

answered Aug 19 at 13:06

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ''
    def handle_data(self, data):
        self.text += f'{data}\n'

def html2text(html):
    filter = HTMLFilter()
    filter.feed(html)

    return filter.text

content = html2text(content_temp)

answered Jan 18 at 8:02

1

How do you convert HTML to text in Python?

In this short guide, we'll see how to convert HTML to raw text with Python and Pandas. It is also known as text extraction from HTML tags..

Setup. ... .

Step 1: Install Beautiful Soup library. ... .

Step 2: Extract text from HTML tags by Python. ... .

Step 3: HTML to raw text in Pandas..

How do I convert HTML format to plain text?

Save the web page as a web page file (. HTM or . HTML file extension)..

Click the File tab again, then click the Save as option..

In the Save as type drop-down list, select the Plain Text (*. txt) option. ... .

Click the Save button to save as a text document..

How do I extract all text from a website in Python?

To extract data using web scraping with python, you need to follow these basic steps:.

Find the URL that you want to scrape..

Inspecting the Page..

Find the data you want to extract..

Write the code..

Run the code and extract the data..

Store the data in the required format..

How do I get data from HTML to Python?

To scrape a website using Python, you need to perform these four basic steps:.

Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. ... .

Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List..

programming python BeautifulSoup HTML Parser Remove HTML Python Html-to-text - npm

How do i convert html content to plain text in python?

How do you convert HTML to text in Python?

How do I convert HTML format to plain text?

How do I extract all text from a website in Python?

How do I get data from HTML to Python?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội