How to get the webpage content & How to extract the text from html string in Python

Hi,

Today I found that, How to get the web page as text via python.


import urllib
myurl = urllib.urlopen("http://tuxworld.wordpress.com")
source = myurl.read()

This simple code will get the above web page into as string.
The string contains the html source of the web page.

If you will print the ‘source’ means, it will print as the web page content with html tags.

How to extract the html tags from that string ?

Ya, we can do that by following way.

$ sudo apt-get install python-setuptools
$ sudo easy_install stripogram


import urllib

from stripogram import html2text

myurl = urllib.urlopen("http://tuxworld.wordpress.com")

html_string = myurl.read()

text = html2text( html_string )

print text

Now you will the whole web page in “text” variable as normal text not as html string.

Enjoy with Python 🙂

Regards,
Arulalan.T

Unknown's avatar

About arulalant

Currently working as "Project Scientist – C" in National Centre for Medium Range Weather Forecasting (NCMRWF), MoES, Noida, India
This entry was posted in Python, Web. Bookmark the permalink.

3 Responses to How to get the webpage content & How to extract the text from html string in Python

  1. susi's avatar susi says:

    But how to remove those tags and all?

    Like

  2. kiske's avatar kiske says:

    nice one 😉 , please how that code will be looks like, if we need first to login to https:\\page and than strip that page contents ?

    Like

  3. mohi's avatar mohi says:

    Wow… thats awesome…

    Like

Leave a comment