How to Remove HTML Tags from a String in Python?

As a Python developer working on web scraping projects, I often encounter strings that contain unwanted HTML tags. These tags can interfere with further processing and analysis of the extracted data. In this tutorial, I will explain how to remove HTML tags from a string in Python. After researching, I found two important methods to achieve this task, Let us learn them with the help of examples.

Remove HTML Tags from a String in Python

Let’s say you’re working on a project for a client in the United States. You’ve scraped some data from a website, but the extracted strings contain HTML tags. For example:

html_string = "<p>John Doe, a <strong>renowned scientist</strong> from <a href='https://example.com'>New York</a>, discovered a new species.</p>"

Your task is to remove all the HTML tags from this string while preserving the text content.

Check out How to Reverse a String in Python?

Method 1: Use Regular Expressions

One common approach to remove HTML tags is by using regular expressions. Python’s re module provides powerful tools for pattern matching and string manipulation. Here’s how you can use regular expressions to remove HTML tags:

import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

html_string = "<p>John Doe, a <strong>renowned scientist</strong> from <a href='https://example.com'>New York</a>, discovered a new species.</p>"
clean_text = remove_html_tags(html_string)
print(clean_text)

Output:

John Doe, a renowned scientist from New York, discovered a new species.

I have executed the above example code and added the screenshot below.

Remove HTML Tags from a String in Python

In this example, we define a function called remove_html_tags() that takes a string text as input. We compile a regular expression pattern '<.*?>' using re.compile(). This pattern matches any text enclosed within angle brackets < > , typically representing HTML tags.

We then use the re.sub() function to substitute all occurrences of the pattern with an empty string '' , effectively removing the HTML tags from the string. The resulting clean text is stored in the clean_text variable and printed.

Read Find the First Number in a String in Python

Method 2: Use BeautifulSoup

Another popular approach to remove HTML tags is by using the BeautifulSoup library. BeautifulSoup is a useful library for parsing HTML and XML documents. It provides a convenient way to extract data from web pages. Here’s how you can use BeautifulSoup to remove HTML tags:

from bs4 import BeautifulSoup

def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

html_string = "<p>John Doe, a <strong>renowned scientist</strong> from <a href='https://example.com'>New York</a>, discovered a new species.</p>"
clean_text = remove_html_tags(html_string)
print(clean_text)

Output:

John Doe, a renowned scientist from New York, discovered a new species.

I have executed the above example code and added the screenshot below.

How to Remove HTML Tags from a String in Python

In this example, we import the BeautifulSoup class from the bs4 module. We define a function remove_html_tags() that takes a string text as input.

Inside the function, we create a BeautifulSoup object by passing the text and specifying the HTML parser as “html.parser”. This creates a parsed representation of the HTML document.

We then use the get_text() method of the BeautifulSoup object to extract all the text content from the parsed HTML, effectively removing the HTML tags. The resulting clean text is returned by the function.

Check out How to Compare Strings in Python?

Conclusion

In this tutorial, I explained how to remove HTML tags from a string in Python. I discussed some methods, such as using regular expression and using BeautifulSoup.

You may also like to read:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.