{"id":64292,"date":"2025-05-03T13:20:39","date_gmt":"2025-05-03T13:20:39","guid":{"rendered":"https:\/\/www.askpython.com\/?p=64292"},"modified":"2025-11-19T13:21:53","modified_gmt":"2025-11-19T13:21:53","slug":"python-beautifulsoup-web-scraping-example","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/python-beautifulsoup-web-scraping-example","title":{"rendered":"Python BeautifulSoup Web Scraping Example"},"content":{"rendered":"\n<p>I was working on a Python project the other day when I needed to grab some product prices from an online store. Going through pages manually would have taken hours, but Python made it simple &#8211; just a few lines of code with BeautifulSoup and requests, and I had all the data I needed in minutes.<\/p>\n\n\n\n<p>Web scraping automates the process of extracting data from websites. When you visit a webpage, your browser receives HTML content from a server. Web scraping tools follow the same process &#8211; they send HTTP requests, receive HTML responses, and parse that data for specific information.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p><em>Web scraping turns manual data collection into automated workflows, saving developers countless hours of repetitive work. Python&#8217;s BeautifulSoup library makes this process straightforward by providing intuitive methods to navigate HTML structures and extract desired content.<\/em><\/p><\/blockquote><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Setting Up Your BeautifulSoup Environment<\/h2>\n\n\n\n<p>Before you start scraping websites, you&#8217;ll need three essential libraries:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npip3 install requests\npip3 install beautifulsoup4\npip3 install html5lib\n\n<\/pre><\/div>\n\n\n<p>The <code>requests<\/code> library handles HTTP requests to websites. BeautifulSoup parses HTML content, while html5lib acts as the parsing engine that creates a tree structure from raw HTML.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport requests\nfrom bs4 import BeautifulSoup\n\n<\/pre><\/div>\n\n\n<p>These imports give you everything needed to start extracting web data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Making Your First Request with Python bs4<\/h2>\n\n\n\n<p>Every web scraping project starts with fetching a webpage&#8217;s HTML content:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nURL = &quot;https:\/\/example.com&quot;\nresponse = requests.get(URL)\nprint(response.content)\n\n<\/pre><\/div>\n\n\n<p>This code sends a GET request to the specified URL and stores the server&#8217;s response. The <code>response.content<\/code> contains the raw HTML of the webpage.<\/p>\n\n\n\n<p>Sometimes servers block automated requests. Adding a user agent header mimics a real browser:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nheaders = {\n    &#039;User-Agent&#039;: &#039;Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36&#039;\n}\nresponse = requests.get(URL, headers=headers)\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Parsing HTML with BeautifulSoup<\/h2>\n\n\n\n<p>Raw HTML isn&#8217;t very useful on its own. BeautifulSoup transforms it into a navigable structure:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nsoup = BeautifulSoup(response.content, &#039;html5lib&#039;)\nprint(soup.prettify())\n\n<\/pre><\/div>\n\n\n<p>The <code>prettify()<\/code> method displays the HTML with proper indentation, making it readable:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: xml; title: ; notranslate\" title=\"\">\n&lt;html&gt;\n &lt;body&gt;\n  &lt;div class=&quot;product&quot;&gt;\n   &lt;h2&gt;Product Name&lt;\/h2&gt;\n   &lt;span class=&quot;price&quot;&gt;$29.99&lt;\/span&gt;\n  &lt;\/div&gt;\n &lt;\/body&gt;\n&lt;\/html&gt;\n\n<\/pre><\/div>\n\n\n<p>BeautifulSoup creates a parse tree where you can search for specific elements using tag names, classes, IDs, or attributes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Finding Elements<\/h2>\n\n\n\n<p>BeautifulSoup offers two main methods for locating elements:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The find() Method<\/h3>\n\n\n\n<p>Returns the first matching element:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Find first div with class &#039;product&#039;\nproduct = soup.find(&#039;div&#039;, class_=&#039;product&#039;)\n\n# Find element by ID\nheader = soup.find(&#039;div&#039;, id=&#039;header&#039;)\n\n# Find by multiple attributes\nspecial_item = soup.find(&#039;div&#039;, attrs={&#039;class&#039;: &#039;product&#039;, &#039;data-sale&#039;: &#039;true&#039;})\n\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">The find_all() Method<\/h3>\n\n\n\n<p>Returns all matching elements as a list:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Find all product divs\nproducts = soup.find_all(&#039;div&#039;, class_=&#039;product&#039;)\n\n# Find all links\nlinks = soup.find_all(&#039;a&#039;)\n\n# Limit results\nfirst_five_products = soup.find_all(&#039;div&#039;, class_=&#039;product&#039;, limit=5)\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Extracting Data<\/h2>\n\n\n\n<p>Once you&#8217;ve found elements, you can extract their content and attributes:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nproduct = soup.find(&#039;div&#039;, class_=&#039;product&#039;)\n\n# Get text content\nproduct_name = product.h2.text\nprice = product.find(&#039;span&#039;, class_=&#039;price&#039;).text\n\n# Get attributes\nimage_url = product.img&#x5B;&#039;src&#039;]\nproduct_link = product.a&#x5B;&#039;href&#039;]\n\n# Handle missing elements safely\ndescription = product.find(&#039;p&#039;, class_=&#039;description&#039;)\nif description:\n    desc_text = description.text\nelse:\n    desc_text = &quot;No description available&quot;\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Reading the Parse Tree<\/h2>\n\n\n\n<p>BeautifulSoup lets you move through HTML elements using dot notation:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Access nested elements\ncontainer = soup.find(&#039;div&#039;, class_=&#039;container&#039;)\nfirst_product = container.div\nproduct_title = first_product.h2.text\n\n# Navigate siblings\nnext_product = first_product.find_next_sibling(&#039;div&#039;)\nprevious_product = first_product.find_previous_sibling(&#039;div&#039;)\n\n# Navigate parents\nproduct_section = first_product.parent\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Scraping Product Data<\/h2>\n\n\n\n<p>Let&#8217;s build a complete scraper that extracts product information. I&#8217;m using a sample website with clean HTML for demonstration but you can apply the methods similarly to most websites. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport requests\nfrom bs4 import BeautifulSoup\nimport csv\n\ndef scrape_products(url):\n    headers = {&#039;User-Agent&#039;: &#039;Mozilla\/5.0&#039;}\n    response = requests.get(url, headers=headers)\n    soup = BeautifulSoup(response.content, &#039;html5lib&#039;)\n    \n    products = &#x5B;]\n    \n    # Find all product containers\n    for item in soup.find_all(&#039;div&#039;, class_=&#039;product-item&#039;):\n        product = {}\n        \n        # Extract product details\n        product&#x5B;&#039;name&#039;] = item.find(&#039;h3&#039;, class_=&#039;product-name&#039;).text.strip()\n        product&#x5B;&#039;price&#039;] = item.find(&#039;span&#039;, class_=&#039;price&#039;).text.strip()\n        product&#x5B;&#039;url&#039;] = item.find(&#039;a&#039;)&#x5B;&#039;href&#039;]\n        \n        # Handle optional fields\n        rating = item.find(&#039;div&#039;, class_=&#039;rating&#039;)\n        product&#x5B;&#039;rating&#039;] = rating.text.strip() if rating else &#039;No rating&#039;\n        \n        products.append(product)\n    \n    return products\n\n# Scrape and save data\nproducts = scrape_products(&#039;https:\/\/example-shop.com\/products&#039;)\n\n# Save to CSV\nwith open(&#039;products.csv&#039;, &#039;w&#039;, newline=&#039;&#039;, encoding=&#039;utf-8&#039;) as file:\n    writer = csv.DictWriter(file, fieldnames=&#x5B;&#039;name&#039;, &#039;price&#039;, &#039;url&#039;, &#039;rating&#039;])\n    writer.writeheader()\n    writer.writerows(products)\n\n<\/pre><\/div>\n\n\n<p>Sample output (products.csv):<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nname,price,url,rating\nWireless Headphones,$79.99,\/products\/wireless-headphones,4.5 stars\nSmart Watch,$199.99,\/products\/smart-watch,4.8 stars\nBluetooth Speaker,$49.99,\/products\/bluetooth-speaker,No rating\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Handling Dynamic Content<\/h2>\n\n\n\n<p>Many modern websites load content dynamically. BeautifulSoup works with static HTML only. For dynamic content, you&#8217;ll need to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Check the Network tab in browser developer tools<\/li>\n\n\n\n<li>Find API endpoints that return JSON data<\/li>\n\n\n\n<li>Use requests to call these APIs directly<\/li>\n<\/ol>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Example: Scraping from an API\napi_url = &quot;https:\/\/example.com\/api\/products&quot;\nresponse = requests.get(api_url)\nproducts = response.json()\n\nfor product in products&#x5B;&#039;items&#039;]:\n    print(f&quot;{product&#x5B;&#039;name&#039;]}: ${product&#x5B;&#039;price&#039;]}&quot;)\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Best Practices<\/h2>\n\n\n\n<p>Web scraping requires responsible behavior:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Respect robots.txt<\/strong>: Check if the website allows scraping<\/li>\n\n\n\n<li><strong>Add delays<\/strong>: Space out requests to avoid overwhelming servers<\/li>\n\n\n\n<li><strong>Handle errors gracefully<\/strong>: Websites change, elements disappear<\/li>\n\n\n\n<li><strong>Cache responses<\/strong>: Store data locally to minimize requests<\/li>\n<\/ol>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport time\nfrom requests.exceptions import RequestException\n\ndef safe_scrape(url, delay=1):\n    try:\n        response = requests.get(url, timeout=10)\n        response.raise_for_status()\n        time.sleep(delay)  # Be polite\n        return BeautifulSoup(response.content, &#039;html5lib&#039;)\n    except RequestException as e:\n        print(f&quot;Error scraping {url}: {e}&quot;)\n        return None\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Common Pitfalls<\/h2>\n\n\n\n<p>Watch out for these issues:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Class names change<\/strong>: Websites update their HTML structure<\/li>\n\n\n\n<li><strong>IP blocking<\/strong>: Too many requests trigger security measures<\/li>\n\n\n\n<li><strong>Legal concerns<\/strong>: Some sites prohibit scraping in their terms<\/li>\n\n\n\n<li><strong>Encoding issues<\/strong>: Handle special characters properly<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Handle encoding\nsoup = BeautifulSoup(response.content, &#039;html5lib&#039;, from_encoding=&#039;utf-8&#039;)\n\n# Deal with special characters\ntext = element.text.encode(&#039;ascii&#039;, &#039;ignore&#039;).decode(&#039;ascii&#039;)\n\n<\/pre><\/div>\n\n\n<p>Web scraping opens up countless possibilities for data collection and automation. Start with simple projects, respect website policies, and gradually tackle more complex scraping challenges as you gain experience with BeautifulSoup&#8217;s powerful features.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was working on a Python project the other day when I needed to grab some product prices from an online store. Going through pages manually would have taken hours, but Python made it simple &#8211; just a few lines of code with BeautifulSoup and requests, and I had all the data I needed in [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":64296,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-64292","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/64292","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=64292"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/64292\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/64296"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=64292"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=64292"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=64292"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}