Learn Web Scraping using Python

Master Python with 70+ Hands-on Projects and Get Job-ready - Learn Python

Many of us would have come across situations when we need to extract the information from the websites. We generally copy the required information from the website. What if the data is too large that it’s hard to copy? This is where web scraping comes into play. And we will be learning to do web scraping using Python. Let’s start

What is Web scraping?

Web scraping is an automated method of extracting data from websites. Wondering we can copy and paste? Of course, you can but it will be tiring to separate the data you require from the others, and also data on the websites are unstructured. And so, we do web scraping that helps to collect these unstructured data and store it in a structured form.

Thinking if it would be legal to scrape from any website? Some websites actually allow web scraping while some don’t. We can check the website’s “robots.txt” file, append “/robots.txt” to the URL, to know more about this information. that you want to scrape.

Applications of Web Scraping

We just talked that we extract data by scraping. Have you got a doubt about what we will be doing with the data? Here are some applications for you.

1. Price Comparison: This is used for comparing the prices of similar products from different online shopping websites.

2. Gatherings emails: We all would have got marketing emails from a website you subscribed to? Do you think these emails are sent individually? Of course, no! Using web scraping email IDs are collected and then send bulk emails.

3. Scraping Social Media Content: Social Media websites are scraped to find out what’s trending.

4. Research and Analysis: Large set of data (Statistics, Reviews, Temperature, etc.) are used for analysis and R&D, for developing a model and testing.

5. Listings: Details of job openings, interviews, etc are collected from different websites and made available in one place.

Why is Python Good for Web Scraping?

We will be using Python to do web scraping which is very suitable because of the following reasons.

1. Ease of Use: Python is simple to code with easy to learn syntax

2. Large Collection of Libraries: It has a huge collection of libraries, which provides methods and services making the coding task easier. It also has modules specifically for web scraping purposes.

3. Dynamically typed: Python does not need you to define data types for variables. This saves time and makes the job faster.

4. Small code, large task: As said previously, Python has built in functions that make the small codes do large tasks.

5. Community: It has a huge community working on improvements and clearing queries. These active communities, help you whenever you are struck.

Libraries for Web scraping

Python has libraries explicitly used for the purpose of web scraping. And these libraries come with multiple built-in functions, making scraping easy. These include

Requests
Beautiful Soup
lxml
Selenium

The requests Library

Requests is a library used for making HTTP requests to a specific URL and getting the response. It also contains some inbuilt functions for managing both the request and response.
This library can be installed using the below command:

pip install requests

Making the request

Once installed, you can get the HTTP request of the required URL using the get() function. It also extracts the information from the server related to the website. Therefore, this object can be used to get information like URL, status, content, etc., as shown in the below code.

Example of making a HTTP request:

import requests

# Making a GET request
r = requests.get('https://data-flair.training/')

print(r)

#printing the URL
print(r.url)

# print the status code
print(r.status_code)

# print content of request
print(r.content)

Output

Beautiful Soup

Beautiful soup is a Python library specifically built for web scraping. It works with a parser to extract data from HTML, to parse, search and modify by generating a parse tree. It can also turn even invalid markup into a parse tree. However, it cannot request data from web servers in the form of HTML file and so we use the requests library.
This library can be installed using the following command:

pip install beautifulsoup4

Inspecting Website

Before extracting any information from a website, it is important to understand its structure. It helps in understanding the format of the data inside the website and in extracting the required information. It can be done right clicking the mouse on the website and selecting the inspect option.

After doing this, you get to see the Document Object Model (DOM) of the website as shown below.

Parsing HTML using Beautiful Soup

Now let’s see how to get the HTML version that forms a website using beautiful soup.

Example of parsing HTML using Beautiful soup:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://data-flair.training/')

# parsing HTML
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

Output

Here, we first make a request to the website using the get() function and send the content to the BeautifulSoup() class. The object obtained is what we see in the below figure showing the HTML format of the page.

lxml

lxml is a Python parsing library that works with both HTML and XML. It is a fast, powerful, and easy-to-use library, especially when extracting data from large datasets. However, it is impacted by poorly designed HTML, making its parsing capabilities impeded. Even with this library we first need to request the HTML form using the requests library. This library can be installed using the below command

pip install lxml

Scraping with lxml works in the following way:

1. We first get requests from the website as discussed above
2. And then using the fromstring() function in the html module in lxml library we get the tree object as shown below

response = requests.get()
from lxml import html
tree = html.fromstring(response.text)

Then we use the xpath() function to give a query to extract the required information from the website

Selenium

Nowadays almost all the websites built are responsive and dynamic. This poses a problem with the python web scraping libraries like requests. This is where the role of selenium comes into the role which is an open-source browser automation tool (web driver) that initiates the rendering web pages, just like any browser.

It mainly requires three components:

Web Browser – Chrome, Edge, Firefox and Safari
Driver for the browser
The Python selenium package

You can install the package using the below command

pip install selenium

Now we start by importing an appropriate class for the browser from the selenium package. Then we create the object of the class giving the path of the driver executable.

After this, we will use the get() method to load any page in the browser as shown below.

Example of loading a page using selenium:

from selenium.webdriver import Chrome
driverObj = Chrome(executable_path='/driver/path/on/your/device')
driverObj.get(‘https://data-flair.training/’)

Selenium also allows us to use of CSS selectors and XPath to extract data from the websites. Let see an example to get all the blog titles using CSS selectors.

Example of getting all the blog titles using CSS selectors:

blog_titles = driverObj.get_elements_by_css_selector(' h2.blog-card__content-title') 
for title in blog_tiles: 
    print(title.text) 
driver.quit() # closing the browser

Besides being able to handle dynamic websites, selenium makes the web scraping process slow. And the reason is that it must first execute the JavaScript code for each page before making it available for parsing. And because of this it is unideal for large-scale data extraction.

Comparison between the three web scraping libraries

	Requests	Beautiful Soup	lxml	Selenium
Purpose	Making the HTTP requests	Parsing	Parsing	Making HTTP requests
Ease-of-use	High	High	Medium	Medium
Speed	Fast	Fast	Very fast	Slow
Ease of learning	High	High	Medium	Medium
Documentation	Very good	Very good	Good	Good
JavaScript Support	None	None	None	Yes
CPU and Memory Usage	Low	Low	Low	High
Size of Project Supported	Large and small	Large and small	Large and small	Small

Picking a web driver and browser

Every web scraper needs a browser to connect to the destination URL. We recommend you to use a regular browser (or not a headless one) for testing purposes, especially for newcomers. This would make the troubleshooting and debugging processes simpler by giving a better understanding of the entire process.

On the other hand, headless browsers can be used later on as they are more efficient for complex tasks. Here we will be using the Chrome browser. You can also use firefox and download the web driver that matches your browser’s version.

You can do this by selecting the requisite package, and downloading and unzipping it. Then copy the driver’s executable file to any easily accessible directory.

Selecting an appropriate URL

Previously we saw how to inspect a website and get a better understanding of its structure. Here are some more tips for you to help you pick an URL:

It is very important to ensure that you are scraping public data, and are not falling into third-party rights issues. You can confirm it with the help of robots.txt file for guidance.
Avoid data hidden in Javascript elements as these sometimes need to be scraped by performing specific actions and require a more sophisticated use of Python and its logic.
Avoid image scraping because you can easily download them directly with Selenium.

Extracting information

There seems to be a lot of HTML content on the output. Don’t worry, we also have methods to extract useful information like title

Example of extracting data from the website:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://data-flair.training/')

# check status code for response received
# success code - 200
soup = BeautifulSoup(r.content, 'html.parser')

#Extracting the title
page_title = soup.title.text

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_title)
print(page_head)
print(page_body)

Output

We can also get the tags and details of it. Let’s see how to get the details of the title and its tags.

Example of getting tags and tag information from the website:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://data-flair.training/')

# check status code for response received
# success code - 200
soup = BeautifulSoup(r.content, 'html.parser')

# Getting the title tag of the page
print(soup.title)
 
# Getting the name of the title tag
print(soup.title.name)
 
# Getting the name of parent tag of the title
print(soup.title.parent.name)

Output

<title>Attention Required! | Cloudflare</title>
title
head

Selecting with Beautiful

Beautiful soup has a select element that returns the list of elements that we would like to select like headings, etc.
Let’s see an example of selecting the heading 2’s returning the 2nd one from the obtained list respectively.

Example of selecting a heading from a website:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.amazon.in/')

# check status code for response received
# success code - 200
soup = BeautifulSoup(r.content, 'html.parser')

second_head = soup.select('h2')[1].text
print(second_head)

Output

Big Savings for Everyone

Wondering what if we want to get all the headings or other components? Then, I would like to ask you which construct do we use to access multiple elements or run through different conditions?

Yes, you are correct, the looping statements. Let’s now see how to get all the elements under a tag.

Example of selecting all headings from the website:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://data-flair.training/')

soup = BeautifulSoup(r.content, 'html.parser')
all_h1_tags = []
for element in soup.select('h1'):
    all_h1_tags.append(element.text)
print(all_h1_tags)

Output

[‘One more step’, ‘Please turn JavaScript on and reload the page.’]

Finding components

This library also has other methods like find that is used for searching the required elements from the class. Let’s dive deeper into it with some examples.

Finding by class

Let’s search based on the class. In the below example, we search for a div tag with the class ‘a-section’. And then find all the heading 2s from the website and print them.

Example of finding elements by class:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.amazon.in/')

soup = BeautifulSoup(r.content, 'html.parser')
s = soup.find('div', class_='a-section') 
content = s.find_all('h2')
 
print(content)

Output

Finding by Id

We know that when we add some components, we add class, id, etc. to add properties and add uniqueness to the elements. As we are done searching based in class, let’s search based on id. It’s the same as the above example we saw, except that we use the id parameter in the find() function, to search.

Example of finding elements by id:

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://data-flair.training/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div',id="cf-wrapper")
content = s.find_all('p')

print(content)

Output

Extracting Text from the tags

If you see the above outputs, we see that tags also got included along with the text. But when it comes to real life applications, the important part is the content. We will take the above example and see how we remove the tags from it.

Example of getting text from the tags:

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://data-flair.training/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div',id="cf-wrapper")
content = s.find_all('p')

for line in content:
    print(line.text)

Output

Here we find all the paragraphs and then run a loop to print all the found elements. This avoids the appearance of tags in the output.

Extracting Links

In many of the cases, links get attached to the content on the website. And this is done using the <a> tag by giving the link to be attached to the ‘href’ attribute of the tag. This information is what we will use along with the find_all() function to extract the links. Let’s see an example.

Example of getting links from a website:

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://data-flair.training/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

Output

https://www.cloudflare.com/5xx-error-landing

Extracting Image Information

We also have images attached in the website and extracting these is what we see in the below example.

Example of extracting images from a website:

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.amazon.in/')

soup = BeautifulSoup(r.content, 'html.parser')

images_list = []
 
images = soup.select('img')
for image in images:
    src = image.get('src')
    alt = image.get('alt')
    images_list.append({"src": src, "alt": alt})
     
for image in images_list:
    print(image)

Output

Here we saw that we select all the ‘img’ tags from the website and then get the information of ‘src’ and ‘alt’ attributes from each of the images and print them.

Scraping multiple Pages

Scraping information from various elements from multiple websites can be tedious. Beautiful soup can also scrape through multiple pages from the same website or from different URLs. We will see both cases.

When we use a website with multiple pages, then we take a base url and run a for loop to go through each of the websites. Let’s say we have a website with 10 pages, and we want to extract information from each website, say title.

Example of running through multiple pages:

import requests
from bs4 import BeautifulSoup as bs

URL = ''

for page in range(1, 10):

    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
    print(soup.title.text)

We can follow the same process to run through multiple URLs by storing them in a list.

Example of running through multiple URLs:

import requests
from bs4 import BeautifulSoup as bs

URLs = ['https://www.amazon.in/','https://www.flipkart.com/']

for i in URLs:

    req = requests.get(i)
    soup = bs(req.text, 'html.parser')
    print(soup.title.text)

Output

Online Shopping site in India: Shop Online for Mobiles, Books, Watches, Shoes and More – Amazon.in
Online Shopping Site for Mobiles, Electronics, Furniture, Grocery, Lifestyle, Books & More. Best Offers!

Here we store all the URLs in a list and then run through each URL in the list using a loop to extract required information from each website.

Saving Information in CSV

We can also save the information obtained from the website(s) in our device, for example in the form of a CSV. In the below example, we run through different URLs and use the for loop to run through each website and get its title. Finally, save the title and the URL in the form of a CSV.

import requests
from bs4 import BeautifulSoup as bs
import csv

URLs = ['https://www.amazon.in/gp/bestsellers/?ref_=nav_em_cs_bestsellers_0_1_1_2',
       'https://www.amazon.in/gp/new-releases/?ref_=nav_em_cs_newreleases_0_1_1_3',
       'https://www.amazon.in/gp/movers-and-shakers/?ref_=nav_em_ms_0_1_1_4',
       'https://www.amazon.in/finds?ref_=nav_em_sbc_desktop_foundit_0_1_1_27',
        'https://www.flipkart.com/mobile-phones-store?otracker=nmenu_sub_Electronics_0_Mobiles',
        'https://www.flipkart.com/laptops-store?otracker=nmenu_sub_Electronics_0_Laptops',
        'https://www.flipkart.com/books-store?otracker=nmenu_sub_Sports%2C%20Books%20%26%20More_0_Books',
        'https://www.flipkart.com/offers-store?otracker=nmenu_offer-zone'
       ]

titles_list = []
for i in URLs:

    req = requests.get(i)
    soup = bs(req.text, 'html.parser')
    d = {}
    d['URL'] = i
    d['Title Name'] = soup.title.text
    titles_list.append(d)

filename = r'C:\Users\Sai Siva Teja\Downloads\titles.csv'
with open(filename, 'w', newline='') as f:
    w = csv.DictWriter(f,['URL','Title Name'])
    w.writeheader()
     
    w.writerows(titles_list)

Python Web Scraping Output

Conclusion

Here we are at the end of the article on web scraping with python. In this article, we got introduced with the concept of web scraping, sending HTTP requests, and extracting website data using the beautiful soup in python. Hope you enjoyed this article. Happy learning!

Did we exceed your expectations?
If Yes, share your valuable feedback on Google

Learn Web Scraping using Python

What is Web scraping?

Applications of Web Scraping

Why is Python Good for Web Scraping?

Libraries for Web scraping

The requests Library

Making the request

Beautiful Soup

Inspecting Website

Parsing HTML using Beautiful Soup

lxml

Selenium

Comparison between the three web scraping libraries

Picking a web driver and browser

Selecting an appropriate URL

Extracting information

Selecting with Beautiful

Finding components

Finding by class

Finding by Id

Extracting Text from the tags

Extracting Links

Extracting Image Information

Scraping multiple Pages

Saving Information in CSV

Python Web Scraping Output

Conclusion

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials