Hi,
Today I found that, How to get the web page as text via python.
import urllib
myurl = urllib.urlopen("http://tuxworld.wordpress.com")
source = myurl.read()
This simple code will get the above web page into as string.
The string contains the html source of the web page.
If you will print the ‘source’ means, it will print as the web page content with html tags.
How to extract the html tags from that string ?
Ya, we can do that by following way.
$ sudo apt-get install python-setuptools
$ sudo easy_install stripogram
import urllib
from stripogram import html2text
myurl = urllib.urlopen("http://tuxworld.wordpress.com")
html_string = myurl.read()
text = html2text( html_string )
print text
Now you will the whole web page in “text” variable as normal text not as html string.
Enjoy with Python 🙂
Regards,
Arulalan.T

But how to remove those tags and all?
LikeLike
nice one 😉 , please how that code will be looks like, if we need first to login to https:\\page and than strip that page contents ?
LikeLike
Wow… thats awesome…
LikeLike