Skip to content Skip to sidebar Skip to footer

Processing Html Files Python

I dont know much about html... How do you remove just text from the page? For example if the html page reads as: http://www.crummy.com/software/BeautifulSoup/ instead.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('Your resource<title>hi</title>')
soup.title.string# Your title string.

Solution 2:

Use an html parser for that. One could be BeautifulSoup

To get text content of the page:

from BeautifulSoup import BeautifulSoup


 soup = BeautifulSoup(your_html)
 text_nodes = soup.findAll(text = True)
 retult = ' '.join(text_nodes)

Solution 3:

I usually use http://lxml.de/ for html parsing! it is really easy to use, and pretty much to get tags you can use xpath for it! which just make things easy as well as fast.

I have a example of use, in a script that I did to read a xml feed and count the words:

https://gist.github.com/1425228

Also you can find more examples in the documentation: http://lxml.de/lxmlhtml.html

Post a Comment for "Processing Html Files Python"