Stories by Chandanhari on Medium

Preprocessing in NLP: Unlocking the Hidden Power of Text

Chandanhari — Sun, 12 Jan 2025 19:37:40 GMT

Essential Steps in Text Preprocessing for NLP Simplified 🚀📚

Natural Language Processing (NLP) requires converting raw text into a structured and clean format before analysis or building machine learning models. This process is called text preprocessing, and it involves several steps based on the requirements of your problem statement (P.S).

Preprocessing of Text:

Preprocessing:

It is technique we are trying to convent raw data into preprocessed data.

Mainly we do 2 stages
1. Cleaning — — → Based on Problem Statement

2. Simple Preprocessing

Raw text —> single case

We have to convert character/word by to single cased based on requirement of P.S

Problem Statement-1: Fake News Detection
Sentence: Today Registration is open for Jio mobile at 1000 ₹.
Here, converting text to lowercase or uppercase doesn't affect the result because grammar isn't important.

Problem Statement 2: Chatbot
Sentence: We give input to CHATBOT it returns output.
Here, grammar might be important, so both uppercase and lowercase are preserved.

Why Text Preprocessing?

Preprocessing reduces the dimensionality of the data and improves the performance of machine learning models.

Tabular data: Columns are dimensions, and rows are data points.
Images: Pixels are dimensions.
Text: Every character or word is considered a dimension.

As the Dimensionality of ML Algorithm ↑ the performance of ML ↓

1. Case Conversion

Changing text to lowercase or uppercase helps standardize data.

When to Use?

Convert to lowercase if grammar isn’t important (e.g sentiment analysis).

Preserve case when grammar matters (e.g chatbots or grammatical analysis).

# Case
data[coln]=data[coln].str.lower() #To convert lower case
data[coln]=data[coln].str.upper() #To convert upper case

2. Emoji Handling

Convert emojis to their textual descriptions or remove them completely.

When to Use?

Convert emojis to text if their sentiment or meaning is important.
Remove emojis if they do not add value to your analysis. Most of the times we preserve them.

 #emoji
 #By default, ":" will be used as the delimiter.
 #To remove or replace it, use the delimiter parameter.
 data[coln]=data[coln].apply(lambda x:emoji.demojize(x,delimiters=("","")))

3. Tags Handling

Remove HTML or XML tags from the text.

When to Use?

When dealing with web-scraped data containing tags that don’t contribute to the analysis.

# Tags(HTML,XML) removals 
 data[coln] = data[coln].apply(lambda x: re.sub(“<.*/>”, “ “, x))

4. URL Removal

Remove URLs from the text.

When to Use?

When URLs do not add significant information to your analysis (e.g for sentiment analysis).

#Urls removals
data[coln] = data[coln].apply(lambda x: re.sub(“https?://\S+”, “ “, x))

5. Email Removal

Remove email addresses from the text.

When to Use?

When email addresses are not relevant to the context or analysis.

# Emails removals 
data[coln] = data[coln].apply(lambda x: re.sub(“\S+@\S+”, “ “, x))

6. Mentions Handling

Remove mentions (e.g @username or #hashtags) from the text.

When to Use?

When mentions or hashtags do not contribute meaningfully to the analysis.

# Mentions removals
 data[coln] = data[coln].apply(lambda x: re.sub(“\B[@#]\S+”, “ “, x))

7. Digit Removal

Remove numerical digits from the text.

When to Use?

When numbers do not add relevant context (e.g in reviews or text-based analysis).

# Digit removals
data[coln] = data[coln].apply(lambda x: re.sub(“\d”, “ “, x))

8. Date Removal

Remove dates from the text.

When to Use?

When dates are not relevant or are considered noise in your analysis.

# Dates 
data[coln] = data[coln].apply(lambda x: re.sub(r”^[0–9]{1,2}/[0–9]{1,2}/[0–9]{4}$”, “ “, x))
# Matches dates like 'dd/mm/yyyy'    
data[coln] = data[coln].apply(lambda x: re.sub(r”\b[0–9]{4}/[0–9]{1,2}/[0–9]{1,2}\b”, “ “, x))
# Matches dates like 'yyyy/mm/dd'
data[coln] = data[coln].apply(lambda x: re.sub(r”\b[0–9]{1,2}/[0–9]{1,2}/[0–9]{4}\b”, “ “, x))
# Matches dates like 'dd/mm/yyyy'

9. Punctuation Removal

Remove punctuation marks from the text.

When to Use?

When punctuation does not add meaningful context to your analysis.

# Punctuations Removals
 data[coln] = data[coln].apply(lambda x: re.sub(‘[!”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~]’, “ “, x))

10. Tokenization

Tokenization splits text into smaller units like words or sentences. We don’t have character tokenization in nltk. When working with NLP, you can use libraries like Natural Language Toolkit (NLTK)

Types:

Word Tokenization: Breaking the sentence into words.

pip install nltk
from nltk.tokenize import word_tokenize
text = "I love biryani"
words = word_tokenize(text)
print(words) 

# Output: ['I', 'love', 'biryani']

2. Sentence Tokenization: Breaking the document into sentences.

pip install nltk
from nltk.tokenize import sent_tokenize
paragraph = "I love selenium. I love beautiful soup."
sentences = sent_tokenize(paragraph)
print(sentences) 

# Output: ['I love selenium.', 'I love beautiful soup.']

11.Stop Words (Advanced Preprocessing)

What are Stop Words?
Stop words are common words in a language that typically do not carry significant meaning, such as is, the, and etc. These words may not contribute much to the context of documents or sentences.

When to Remove or Keep Stop Words?

Removing Stop Words:
If stop words are not important for the problem statement you’re solving, they can be removed. This helps reduce the dimensionality of the data, making computations faster and models more efficient.
Preserving Stop Words:
If grammar or sentence structure is essential for your task, such as in certain Natural Language Processing (NLP) tasks, it is better to preserve stop words.
How to Handle Stop Words in NLP?
When working with NLP, you can use libraries like Natural Language Toolkit (NLTK) to handle stop words.

# Importing the library
from nltk.corpus import stopwords

# Getting the list of stop words for English
stop_words = stopwords.words("english") # Default language is "arabic"

# Checking the total number of stop words in English
print(len(stop_words))  # Output: 179

for removing stop words from the data we use the following code

for doc in data[col]:
 l=[]
 for word in word_tokenize(doc):
 if word.lower() in stp:
 pass
 else:
 l.append(word)
 print(“ “.join(l))

12. Handling Contractions in NLP

What are Contractions?
Contractions are shortened forms of words or phrases.

For example:

asap stands for as soon as possible.
you’re expands to you are.

Machines cannot easily understand contractions when processing text in Natural Language Processing (NLP). To make text clearer for analysis, we need to expand these contractions into their full forms.

pip install contractions
import contractions
# Expanding contractions
print(contractions.fix("asap")) 
# Output: "as soon as possible"
print(contractions.fix("I'd come with you")) 
# Output: "I would come with you"

data[“Expanded_col”] = data[col].apply(lambda x: contractions.fix(x))

13a.Stemming

What is Stemming?
Stemming
When I perform stemming the it converts inflected word to root word the main disadvantage is the stem may or may not be an actual English word.

happiness becomes happi.

Types of Stemming Algorithms:

Porter Stemming:

This is a popular rule-based algorithm that applies 5 stages to a word. It checks the word with different rules based on parts of speech and suffixes.
Disadvantage: Since it only works on English, it might not work well for other languages.

2. Snowball Stemming:

Snowball is an improved version of the Porter stemmer that works for multiple languages, not just English.

3. Lancaster Stemming:

This method uses more iterations, meaning it performs a lot of heavy processing.
Disadvantage: Due to its heavy iteration, it can remove too much from a word, leading to the creation of non-English words. This results in lower-quality stems.

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer(language='english')
lancaster = LancasterStemmer()
word = "programmers"
print("Porter Stemmer:", porter.stem(word))
print("Snowball Stemmer:", snowball.stem(word))
print("Lancaster Stemmer:", lancaster.stem(word))

# Output of the above codes 
Porter Stemmer: programm
Snowball Stemmer: programm
Lancaster Stemmer: prog

13b. Lemmatization

What is Lemmatization?
Lemmatization is another technique for reducing words to their root form, but it works differently than stemming. Unlike stemming, lemmatization ensures that the word is valid and meaningful in the language.

How Does Lemmatization Work?
Lemmatization uses a database (like WordNet) to check if the root word is a valid English word. Here’s how it works:

For example, take the word running. The algorithm will first remove suffixes like ing to get run.
It will then check if “run” is a valid word in the database. If yes, it stops there.
If the word run isn’t valid, it continues removing more suffixes until it gets a valid word.

from nltk.stem import WordNetLemmatizer
import nltk

# Download WordNet database
nltk.download('wordnet')

# create a function for WordnetLemmatizer()
lemmatizer = WordNetLemmatizer()

# Give a word
word = "programming"

# Display the result
print("Lemmatized Word:", lemmatized_word)

Lemmatized Word: program

Summary:

Stemming is faster but might produce non-English words because it uses rules without checking if the word is valid.
Lemmatization is slower but always ensures the word is valid by checking a database.

Both techniques are useful in different scenarios depending on whether you need speed (stemming) or accuracy (lemmatization).

PREPROCESSING FUNCTION

Introduction to Natural Language Processing (NLP):

Chandanhari — Mon, 06 Jan 2025 13:24:35 GMT

Natural Language Processing (NLP)

Bridging the Gap Between Humans and Machines

Imagine this: When Earth was created, there were only two people based on biological concepts. Over time, the human population began to grow. The first challenge humans faced was communication. Initially, they had no way to share ideas or work together. But eventually, they created languages to communicate with one another.

As time passed, these languages evolved, enabling humans to share thoughts, emotions, and collaborate on complex tasks.

Now, as humans communicate effortlessly using natural languages, another challenge arose communicating with machines. Humans know natural languages, but machines only understand their own machine-level languages. This gap inspired the creation of Natural Language Processing (NLP), a field of Artificial Intelligence (AI) that helps machines process and understand human languages.

What is NLP?

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on helping machines process, understand, and analyze human (natural) language. These human languages vary based on culture, religion, and regions, but collectively, they are referred to as natural languages. NLP makes it possible for machines to understand and work with these natural languages.

Thus, NLP was introduced. It enables machines to mimic natural languages, making it easier for humans to interact with them. For example, applications like ChatGPT and Gemini are built using NLP techniques.

Why do we need NLP?

Humans can easily communicate with each other in their preferred languages. But for machines to understand and respond to humans naturally, NLP is required.

For example, voice assistants like Alexa or Google Assistant help humans interact with machines comfortably by converting spoken language into commands machines can understand.

Life Cycle of an NLP Project

Let’s walk through the life cycle of an NLP project using a real-world example.

Problem Statement: Imagine a client who owns a news agency. They want a system that can classify news as fake or real based on the text data.

Collect Data:

Data should be collected from servers with help of api key we can collect data. Web Scrapping is mainly used when ever we are collecting text data. Collecting data directly from kaggle.

Simple Exploratory Data Analysis (EDA):

The raw data is checked for quality. This involves identifying issues like HTML tags, emojis, URLs, or irrelevant symbols. For example: Replacing emojis with equivalent characters or removing unwanted tags.

Preprocessing:

The data is cleaned based on the problem statement. For this project, we might remove URLs, hashtags, and stopwords (e.g. “is,” “the,” “and”).

Feature Engineering:

Text data is converted into numerical representations (vectors). For instance Using Bag of Words or TF-IDF techniques to convert text into numbers.

Model Training:

A machine learning model is trained on the preprocessed data.

Testing:

The model is tested to classify whether a given news article is fake or real.

Deploying and Monitoring:

The model is deployed into production and monitored for accuracy over time.

Key Terminologies in NLP

Here are some important terms and concepts to understand in NLP:

1. Corpus: A collection of documents (e.g, news articles, books).

2. Document: A single entity within a corpus, such as a paragraph or sentence.

3. Tokenization: Breaking down text into smaller units, called tokens.

Sentence Tokenization: Splitting text into sentences.
Example: “I love biryani. I hate junk food.”
Tokens: [“I love biryani.”, “I hate junk food.”]

Word Tokenization: Splitting sentences into words.
Example: “I love data.”
Tokens: [“I”, “love”, “data”]

4. Stop Words: Words like “is,” “the,” and “and” that don’t contribute much meaning to the text.

5. Vectorization: Converting text into numerical format. Common techniques include:

One-Hot Encoding

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

Word Embeddings: Techniques like Word2Vec, GloVe, and FastText.

Real-Life Applications of NLP

Natural Language Processing (NLP) is transforming the way we interact with technology, making machines smarter and communication more seamless. Below are some everyday applications of NLP:

1.Chatbots and Virtual Assistants

Examples: Siri, Alexa, Google Assistant, and customer support chatbots. These systems use NLP to understand user queries, process them, and respond intelligently, enabling smooth human-computer interaction.

2.Language Translation

Examples: Google Translate, Microsoft Translator. NLP algorithms translate text or speech between languages while maintaining the meaning and context, bridging communication gaps worldwide.

3.Text Summarization

Examples: News aggregators, tools like Resoomer.NLP condenses long documents or articles into concise summaries, extracting key information for quick understanding.

4.Spam Detection

Examples: Gmail’s spam filtering system.NLP analyzes email content, identifying spam based on language patterns, keywords, and contextual clues to keep your inbox clutter-free.

5.Sentiment Analysis

Examples: Social media monitoring tools and customer feedback analysis platforms.NLP evaluates the emotional tone in text to determine whether opinions are positive, negative, or neutral, helping businesses understand customer sentiments.

6.Personalized Content Recommendations

Examples: Platforms like Netflix, YouTube, and Amazon.NLP processes user behavior and preferences to recommend tailored content, enhancing user experience and engagement.