How to get the webpage content & How to extract the text from html string in Python

Posted on September 14, 2010 by arulalant

Hi,

Today I found that, How to get the web page as text via python.


import urllib
myurl = urllib.urlopen("http://tuxworld.wordpress.com")
source = myurl.read()

This simple code will get the above web page into as string.
The string contains the html source of the web page.

If you will print the ‘source’ means, it will print as the web page content with html tags.

How to extract the html tags from that string ?

Ya, we can do that by following way.

$ sudo apt-get install python-setuptools
$ sudo easy_install stripogram


import urllib

from stripogram import html2text

myurl = urllib.urlopen("http://tuxworld.wordpress.com")

html_string = myurl.read()

text = html2text( html_string )

print text

Now you will the whole web page in “text” variable as normal text not as html string.

Enjoy with Python 🙂

Regards,
Arulalan.T

About arulalant

Currently working as "Project Scientist – C" in National Centre for Medium Range Weather Forecasting (NCMRWF), MoES, Noida, India

View all posts by arulalant →

This entry was posted in Python, Web. Bookmark the permalink.

3 Responses to How to get the webpage content & How to extract the text from html string in Python

susi says:

February 12, 2013 at 3:05 pm

But how to remove those tags and all?

LikeLike

Reply
kiske says:

December 11, 2012 at 9:06 pm

nice one 😉 , please how that code will be looks like, if we need first to login to https:\\page and than strip that page contents ?

LikeLike

Reply
mohi says:

September 15, 2010 at 5:28 am

Wow… thats awesome…

LikeLike

Reply