Python Project – Web Crawlers

Master Programming with Our Comprehensive Courses Enroll Now!

The Web Scraper project is developed in Python using requests and BeautifulSoup libraries. It provides a simple tool to scrape titles from a specific website and saves the extracted data into a CSV file. This project demonstrates the basic principles of web scraping and data extraction.

About Python Web Crawler Project

The web crawler project automates data extraction from web pages using Python. It employs libraries like Requests for fetching pages and BeautifulSoup for parsing HTML content. Concurrency is managed with ThreadPoolExecutor for efficient parallel processing. Extracted data is stored in structured formats like CSV, facilitating further analysis and applications in data-driven tasks.

Objectives of Python Web Crawler Project

Develop a robust web crawler to fetch and parse web pages automatically.
Extract targeted information from HTML documents, such as titles, links, or specific content.
Implement concurrent crawling techniques to enhance performance and speed.
Store extracted data in a structured format for subsequent analysis or use.

Python Web Crawler Project Setup

Required Libraries

The project requires the following standard Python libraries:

Requests: These are for making HTTP requests to fetch web pages.
Beautiful Soup (bs4): This is used to parse HTML and XML documents to extract structured data.
Pandas: For data manipulation and storage, they are useful for storing extracted data in tabular format.

Technology Stack

Python
Requests
BeautifulSoup

Prerequisites for Python Web Crawler

Basic understanding of Python programming.
Familiarity with HTML and web scraping concepts.

Download Python Web Crawler Project

Please download the source code of the Python Web Crawler Project: Python Web Crawler Project Code.

Step by Step implementation of Python Web Crawler Project

1. Importing Libraries

This helps to import necessary libraries for the GUI of the Web Crawler:-

tkinter: Used for creating the graphical user interface (GUI).
messagebox: Provides dialog boxes for displaying messages to the user.
requests: This is used to make HTTP requests to fetch web pages from the internet.
BeautifulSoup: It parses HTML and XML documents.
threading: Enables running tasks concurrently, utilized to play the alarm sound
pandas: Used for data manipulation and analysis.

import tkinter as tk
from tkinter import messagebox
import requests
from bs4 import BeautifulSoup
import pandas as pd
import threading

2. WebCrawler Class

__init__ is a constructor method for the WebCrawler class.
tk.label creates an entry widget for the user to input the URL.
self.results create a text widget to display the crawling results.
Finally, it adds a “Save to CSV” button that calls the save_to_csv method.

class WebCrawlerApp:
   def __init__(self, root):
       self.root = root
       self.root.title("PythonGeeks@Web Scraper")
      
       tk.Label(root, text="Enter URL:").grid(row=0, column=0, padx=10, pady=5)
       self.url_entry = tk.Entry(root, width=50)
       self.url_entry.grid(row=0, column=1, padx=10, pady=5)
      
       tk.Button(root, text="Start Crawling", command=self.start_crawling).grid(row=0, column=2, padx=10, pady=5)
      
       self.results_text = tk.Text(root, width=80, height=20)
       self.results_text.grid(row=1, column=0, columnspan=3, padx=10, pady=5)

3. Start Crawling Method

This method gets the URL from the entry widget.
Then, check if the URL is valid and non-empty.
It starts a new thread to run the crawl method with the provided URL to avoid freezing the GUI.

def start_crawling(self):
       url = self.url_entry.get().strip()
       if not url:
           messagebox.showerror("Error", "Please enter a URL")
           return
      
       self.results_text.delete(1.0, tk.END)
       self.results_text.insert(tk.END, "Starting crawl...\n")
      
       thread = threading.Thread(target=self.crawl, args=(url,))

4. Crawl Method

This method clears the text widget before starting the crawl, fetching the web page using requests.get.
It checks if the request was successful using raise_for_status, then parses the HTML content using BeautifulSoup.
It finds all elements with the class quote and extracts the text and author for each quote.
Finally, it stores the extracted quotes in the quotes list and displays them in the text widget.

def crawl(self, url):
       try:
           response = requests.get(url)
           response.raise_for_status()
          
           soup = BeautifulSoup(response.text, 'html.parser')
           books = self.extract_books(soup)
          
           if books:
               self.save_to_csv(books)
               self.display_results(books)
           else:
               self.results_text.insert(tk.END, "No books were extracted.\n")
              
       except requests.RequestException as e:
           self.results_text.insert(tk.END, f"Error fetching {url}: {e}\n")

5. Extract Books Method

books =[ ]: It initializes an empty list to store the extracted book information.
book_elements: This uses BeautifulSoup to find all HTML elements with the class product_pod,
For each book in book_elements, it extracts the title from the h2 element tag.
This method returns the list of dictionaries containing book titles and prices.

def extract_books(self, soup):
       books = []
       book_elements = soup.select('.product_pod')
       print(f"Found {len(book_elements)} book elements.")  


       for book in book_elements:
           title = book.h3.a['title']
           price = book.select_one('.price_color').get_text(strip=True)
           books.append({'title': title, 'price': price})
       return books

6. Save & Display Results Method

This method checks if there are any quotes to save.
Then, it saves the DataFrame to a CSV file quotes.csv.
It shows a success message if the data is saved, otherwise an error message if there are no quotes to save.

def save_to_csv(self, books):
       df = pd.DataFrame(books)
       df.to_csv('books.csv', index=False)
      
   def display_results(self, books):
       self.results_text.insert(tk.END, "Crawl completed. Books extracted:\n")
       for book in books:
           self.results_text.insert(tk.END, f"Title: {book['title']}, Price: {book['price']}\n")
       self.results_text.insert(tk.END, "Books saved to books.csv\n")

7. Main Execution

root = tk.Tk(): It forms the main application window using tkinter function.
alarm_clock = WebCrawlerApp(root): This is done to form an instance of the WebCrawlerApp class, passing root as an argument. It initializes the Crawler application
root.mainloop(): It starts the Tkinter event loop, which listens for events and updates the GUI.

if __name__ == "__main__":
   root = tk.Tk()
   app = WebCrawlerApp(root)
   root.mainloop()

Python Web Crawler Output

Application Interface

Adding URL

Scraping Done

Added to CSV File

Conclusion

The Web Crawler application effectively demonstrates the integration of Tkinter and BeautifulSoup for web scraping. It provides a practical solution for extracting and managing data from web pages. This project can be further enhanced by adding features like pagination handling, more sophisticated data extraction, and improved error handling for a more comprehensive web scraping solution.