<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Chandanhari on Medium]]></title>
        <description><![CDATA[Stories by Chandanhari on Medium]]></description>
        <link>https://medium.com/@chandanhari1924?source=rss-f3d71360599a------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*ZFOH_idbZGIw2YQl</url>
            <title>Stories by Chandanhari on Medium</title>
            <link>https://medium.com/@chandanhari1924?source=rss-f3d71360599a------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Thu, 02 Jul 2026 22:11:17 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@chandanhari1924/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Preprocessing in NLP: Unlocking the Hidden Power of Text]]></title>
            <link>https://medium.com/@chandanhari1924/preprocessing-in-nlp-unlocking-the-hidden-power-of-text-364bc7bbbe7e?source=rss-f3d71360599a------2</link>
            <guid isPermaLink="false">https://medium.com/p/364bc7bbbe7e</guid>
            <dc:creator><![CDATA[Chandanhari]]></dc:creator>
            <pubDate>Sun, 12 Jan 2025 19:37:40 GMT</pubDate>
            <atom:updated>2025-01-12T19:37:40.895Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GSpYIJg-wPXZBozotzMtPA.png" /><figcaption><strong>Essential Steps in Text Preprocessing for NLP Simplified 🚀📚</strong></figcaption></figure><p>Natural Language Processing (NLP) requires converting raw text into a structured and clean format before analysis or building machine learning models. This process is called <strong>text preprocessing</strong>, and it involves several steps based on the requirements of your problem statement (P.S).</p><h3><strong>Preprocessing of Text:</strong></h3><p><strong>Preprocessing:</strong></p><p>It is technique we are trying to convent raw data into preprocessed data.</p><blockquote>Mainly we do 2 stages<br> 1. Cleaning — — → Based on Problem Statement</blockquote><blockquote>2. Simple Preprocessing</blockquote><blockquote>Raw text —&gt; single case</blockquote><blockquote>We have to convert character/word by to single cased based on requirement of P.S</blockquote><p><strong>Problem Statement-1: Fake News Detection</strong><br>Sentence: Today Registration is open for Jio mobile at 1000 ₹.<br>Here, converting text to lowercase or uppercase doesn&#39;t affect the result because grammar isn&#39;t important.</p><p><strong>Problem Statement 2: Chatbot</strong><br>Sentence: We give input to <strong>CHATBOT</strong> it returns output.<br>Here, grammar might be important, so both uppercase and lowercase are preserved.</p><h3>Why Text Preprocessing?</h3><p>Preprocessing reduces the dimensionality of the data and improves the performance of machine learning models.</p><ul><li><strong>Tabular data</strong>: Columns are dimensions, and rows are data points.</li><li><strong>Images</strong>: Pixels are dimensions.</li><li><strong>Text</strong>: Every character or word is considered a dimension.</li></ul><p>As the Dimensionality of <strong>ML Algorithm</strong> <strong>↑</strong> the performance of <strong>ML</strong> <strong>↓</strong></p><h3>1. Case Conversion</h3><p>Changing text to lowercase or uppercase helps standardize data.</p><p>When to Use?</p><p>Convert to <strong>lowercase</strong> if grammar isn’t important (e.g sentiment analysis).</p><p>Preserve case when grammar matters (e.g chatbots or grammatical analysis).</p><pre># Case<br>data[coln]=data[coln].str.lower() #To convert lower case<br>data[coln]=data[coln].str.upper() #To convert upper case</pre><h3>2. Emoji Handling</h3><p>Convert emojis to their textual descriptions or remove them completely.</p><p><strong>When to Use?</strong></p><ul><li>Convert <strong>emojis</strong> to text if their sentiment or meaning is important.</li><li>Remove <strong>emojis</strong> if they do not add value to your analysis. Most of the times we preserve them.</li></ul><pre> #emoji<br> #By default, &quot;:&quot; will be used as the delimiter.<br> #To remove or replace it, use the delimiter parameter.<br> data[coln]=data[coln].apply(lambda x:emoji.demojize(x,delimiters=(&quot;&quot;,&quot;&quot;)))</pre><h3>3. Tags Handling</h3><p>Remove <strong>HTML</strong> or <strong>XML</strong> tags from the text.</p><p><strong>When to Use?</strong></p><ul><li>When dealing with web-scraped data containing tags that don’t contribute to the analysis.</li></ul><pre># Tags(HTML,XML) removals <br> data[coln] = data[coln].apply(lambda x: re.sub(“&lt;.*/&gt;”, “ “, x))</pre><h3>4. URL Removal</h3><p>Remove <strong>URLs</strong> from the text.</p><p><strong>When to Use?</strong></p><ul><li>When URLs do not add significant information to your analysis (e.g for sentiment analysis).</li></ul><pre>#Urls removals<br>data[coln] = data[coln].apply(lambda x: re.sub(“https?://\S+”, “ “, x))</pre><h3>5. Email Removal</h3><p>Remove <strong>email </strong>addresses from the text.</p><p><strong>When to Use?</strong></p><ul><li>When email addresses are not relevant to the context or analysis.</li></ul><pre># Emails removals <br>data[coln] = data[coln].apply(lambda x: re.sub(“\S+@\S+”, “ “, x))<br></pre><h3>6. Mentions Handling</h3><p>Remove mentions (e.g <strong>@username</strong> or <strong>#hashtags</strong>) from the text.</p><p><strong>When to Use?</strong></p><ul><li>When <strong>mentions</strong> or <strong>hashtags</strong> do not contribute meaningfully to the analysis.</li></ul><pre># Mentions removals<br> data[coln] = data[coln].apply(lambda x: re.sub(“\B[@#]\S+”, “ “, x))</pre><h3>7. Digit Removal</h3><p>Remove <strong>numerical digits</strong> from the text.</p><p><strong>When to Use?</strong></p><ul><li>When numbers do not add relevant context (e.g in reviews or text-based analysis).</li></ul><pre># Digit removals<br>data[coln] = data[coln].apply(lambda x: re.sub(“\d”, “ “, x))</pre><h3>8. Date Removal</h3><p>Remove dates from the text.</p><p><strong>When to Use?</strong></p><ul><li>When dates are not relevant or are considered <strong>noise</strong> in your analysis.</li></ul><pre># Dates <br>data[coln] = data[coln].apply(lambda x: re.sub(r”^[0–9]{1,2}/[0–9]{1,2}/[0–9]{4}$”, “ “, x))<br># Matches dates like &#39;dd/mm/yyyy&#39;    <br>data[coln] = data[coln].apply(lambda x: re.sub(r”\b[0–9]{4}/[0–9]{1,2}/[0–9]{1,2}\b”, “ “, x))<br># Matches dates like &#39;yyyy/mm/dd&#39;<br>data[coln] = data[coln].apply(lambda x: re.sub(r”\b[0–9]{1,2}/[0–9]{1,2}/[0–9]{4}\b”, “ “, x))<br># Matches dates like &#39;dd/mm/yyyy&#39;</pre><h3>9. Punctuation Removal</h3><p>Remove <strong>punctuation</strong> marks from the text.</p><p><strong>When to Use?</strong></p><ul><li>When <strong>punctuation</strong> does not add meaningful context to your analysis.</li></ul><pre># Punctuations Removals<br> data[coln] = data[coln].apply(lambda x: re.sub(‘[!”#$%&amp;\’()*+,-./:;&lt;=&gt;?@[\\]^_`{|}~]’, “ “, x))</pre><h3>10. Tokenization</h3><p>Tokenization splits text into smaller units like words or sentences. We don’t have character tokenization in <strong>nltk. </strong>When working with NLP, you can use libraries like <strong>Natural Language Toolkit (NLTK)</strong></p><h3>Types:</h3><ol><li><strong>Word Tokenization</strong>: Breaking the <strong>sentence</strong> into <strong>words.</strong></li></ol><pre>pip install nltk<br>from nltk.tokenize import word_tokenize<br>text = &quot;I love biryani&quot;<br>words = word_tokenize(text)<br>print(words) <br><br># Output: [&#39;I&#39;, &#39;love&#39;, &#39;biryani&#39;]</pre><p><strong>2. Sentence Tokenization</strong>: Breaking the<strong> document</strong> into <strong>sentences.</strong></p><pre>pip install nltk<br>from nltk.tokenize import sent_tokenize<br>paragraph = &quot;I love selenium. I love beautiful soup.&quot;<br>sentences = sent_tokenize(paragraph)<br>print(sentences) <br><br># Output: [&#39;I love selenium.&#39;, &#39;I love beautiful soup.&#39;]</pre><h3>11.Stop Words (Advanced Preprocessing)</h3><p><strong>What are Stop Words?</strong><br>Stop words are common words in a language that typically do not carry significant meaning, such as <strong>is, the, and</strong> etc. These words may not contribute much to the context of <strong>documents</strong> or <strong>sentences.</strong></p><p><strong>When to Remove or Keep Stop Words?</strong></p><ul><li><strong>Removing Stop Words</strong>:<br>If stop words are not important for the problem statement you’re solving, they can be removed. This helps reduce the <strong>dimensionality</strong> of the data, making computations faster and models <strong>more efficient</strong>.</li><li><strong>Preserving Stop Words</strong>:<br>If grammar or sentence structure is essential for your task, such as in certain <strong>Natural Language Processing (NLP)</strong> tasks, it is better to preserve stop words.</li><li><strong>How to Handle Stop Words in NLP?</strong><br>When working with NLP, you can use libraries like <strong>Natural Language Toolkit (NLTK)</strong> to handle stop words.</li></ul><pre># Importing the library<br>from nltk.corpus import stopwords<br><br># Getting the list of stop words for English<br>stop_words = stopwords.words(&quot;english&quot;) # Default language is &quot;arabic&quot;<br><br># Checking the total number of stop words in English<br>print(len(stop_words))  # Output: 179</pre><p>for removing <strong>stop words</strong> from the <strong>data </strong>we use the following code</p><pre>for doc in data[col]:<br> l=[]<br> for word in word_tokenize(doc):<br> if word.lower() in stp:<br> pass<br> else:<br> l.append(word)<br> print(“ “.join(l))</pre><h3>12. Handling Contractions in NLP</h3><p><strong>What are Contractions?</strong><br>Contractions are <strong>shortened forms</strong> of words or phrases.</p><p>For example:</p><ul><li><strong>asap</strong> stands for <strong>as soon as possible</strong>.</li><li><strong>you’re</strong> expands to <strong>you are</strong>.</li></ul><p>Machines cannot easily understand contractions when processing text in Natural Language Processing (<strong>NLP</strong>). To make text clearer for analysis, we need to expand these contractions into their full forms.</p><pre>pip install contractions<br>import contractions<br># Expanding contractions<br>print(contractions.fix(&quot;asap&quot;)) <br># Output: &quot;as soon as possible&quot;<br>print(contractions.fix(&quot;I&#39;d come with you&quot;)) <br># Output: &quot;I would come with you&quot;</pre><pre>data[“Expanded_col”] = data[col].apply(lambda x: contractions.fix(x))</pre><h3>13a.Stemming</h3><p><strong>What is Stemming?</strong><br>Stemming <br>When I perform stemming the it converts inflected word to root word the main disadvantage is the stem may or may not be an actual English word.</p><ul><li><strong>happiness</strong> becomes <strong>happi</strong>.</li></ul><p><strong>Types of Stemming Algorithms</strong>:</p><ol><li><strong>Porter Stemming</strong>:</li></ol><ul><li>This is a popular rule-based algorithm that applies 5 stages to a word. It checks the word with different rules based on parts of speech and suffixes.</li><li><strong>Disadvantage</strong>: Since it only works on English, it might not work well for other languages.</li></ul><p><strong>2. Snowball Stemming</strong>:</p><ul><li>Snowball is an improved version of the Porter stemmer that works for multiple languages, not just English.</li></ul><p><strong>3. Lancaster Stemming</strong>:</p><ul><li>This method uses more iterations, meaning it performs a lot of heavy processing.</li><li><strong>Disadvantage</strong>: Due to its heavy iteration, it can remove too much from a word, leading to the creation of non-English words. This results in lower-quality stems.</li></ul><pre>from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer<br>porter = PorterStemmer()<br>snowball = SnowballStemmer(language=&#39;english&#39;)<br>lancaster = LancasterStemmer()<br>word = &quot;programmers&quot;<br>print(&quot;Porter Stemmer:&quot;, porter.stem(word))<br>print(&quot;Snowball Stemmer:&quot;, snowball.stem(word))<br>print(&quot;Lancaster Stemmer:&quot;, lancaster.stem(word))</pre><pre># Output of the above codes <br>Porter Stemmer: programm<br>Snowball Stemmer: programm<br>Lancaster Stemmer: prog</pre><h3>13b. Lemmatization</h3><p><strong>What is Lemmatization?</strong><br>Lemmatization is another technique for reducing words to their root form, but it works differently than stemming. Unlike stemming, <strong>lemmatization</strong> ensures that the word is valid and meaningful in the language.</p><p><strong>How Does Lemmatization Work?</strong><br>Lemmatization uses a <strong>database</strong> (like WordNet) to check if the root word is a valid English word. Here’s how it works:</p><ul><li>For example, take the word <strong>running</strong>. The algorithm will first remove suffixes like <strong>ing</strong> to get <strong>run</strong>.</li><li>It will then check if “run” is a valid word in the database. If yes, it stops there.</li><li>If the word run isn’t valid, it continues removing more suffixes until it gets a valid word.</li></ul><pre>from nltk.stem import WordNetLemmatizer<br>import nltk<br><br># Download WordNet database<br>nltk.download(&#39;wordnet&#39;)<br><br># create a function for WordnetLemmatizer()<br>lemmatizer = WordNetLemmatizer()<br><br># Give a word<br>word = &quot;programming&quot;<br><br># Display the result<br>print(&quot;Lemmatized Word:&quot;, lemmatized_word)</pre><pre>Lemmatized Word: program</pre><h3>Summary:</h3><ul><li><strong>Stemming</strong> is faster but might produce non-English words because it uses rules without checking if the word is valid.</li><li><strong>Lemmatization</strong> is slower but always ensures the word is valid by checking a database.</li></ul><p>Both techniques are useful in different scenarios depending on whether you need speed (stemming) or accuracy (lemmatization).</p><h3><strong>PREPROCESSING FUNCTION</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0Oa26S-9d8ApgnogX5JhrA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=364bc7bbbe7e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introduction to Natural Language Processing (NLP):]]></title>
            <link>https://medium.com/@chandanhari1924/introduction-to-natural-language-processing-nlp-db3d62a9ac7f?source=rss-f3d71360599a------2</link>
            <guid isPermaLink="false">https://medium.com/p/db3d62a9ac7f</guid>
            <dc:creator><![CDATA[Chandanhari]]></dc:creator>
            <pubDate>Mon, 06 Jan 2025 13:24:35 GMT</pubDate>
            <atom:updated>2025-01-06T13:24:35.482Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YyGYRTDMcXkilzjGDnZavQ.jpeg" /><figcaption><strong>Natural Language Processing (NLP)</strong></figcaption></figure><h3>Bridging the Gap Between Humans and Machines</h3><p><strong>Imagine this</strong>: When Earth was created, there were only two people based on biological concepts. Over time, the human population began to grow. The first challenge humans faced was <strong>communication</strong>. Initially, they had no way to share ideas or work together. But eventually, they created <strong>languages </strong>to communicate with one another.</p><p>As time passed, these <strong>languages </strong>evolved, enabling <strong>humans </strong>to share thoughts, emotions, and collaborate on complex tasks.</p><p>Now, as humans communicate effortlessly using <strong>natural languages</strong>, another challenge arose <strong>communicating</strong> with <strong>machines</strong>. Humans know <strong>natural</strong> <strong>languages</strong>, but machines only understand their own <strong>machine</strong>-<strong>level languages</strong>. This gap inspired the creation of <strong>Natural Language Processing</strong> (<strong>NLP</strong>), a field of <strong>Artificial Intelligence (AI)</strong> that helps machines process and understand human languages.</p><h3>What is NLP?</h3><p><strong>Natural Language Processing (NLP)</strong> is a subfield of <strong>Artificial Intelligence</strong> (<strong>AI</strong>) that focuses on helping machines process, understand, and analyze human (<strong>natural</strong>) language. These human languages vary based on culture, religion, and regions, but collectively, they are referred to as natural languages. <strong>NLP </strong>makes it possible for machines to understand and work with these natural languages.</p><p>Thus, <strong>NLP </strong>was introduced. It enables machines to mimic natural languages, making it easier for humans to interact with them. For example, applications like <strong>ChatGPT</strong> and <strong>Gemini</strong> are built using <strong>NLP </strong>techniques.</p><h3>Why do we need NLP?</h3><p>Humans can easily communicate with each other in their preferred languages. But for machines to understand and respond to humans naturally, <strong>NLP</strong> is required.</p><p>For example, voice assistants like Alexa or Google Assistant help humans interact with machines comfortably by converting spoken language into commands machines can understand.</p><h3>Life Cycle of an NLP Project</h3><p>Let’s walk through the life cycle of an NLP project using a real-world example.</p><p><strong>Problem Statement:</strong> Imagine a client who owns a news agency. They want a system that can classify news as <strong>fake</strong> or <strong>real</strong> based on the text data.</p><p><strong>Collect Data:</strong></p><p>Data should be collected from servers with help of <strong>api key</strong> we can collect data. <strong>Web Scrapping</strong> is mainly used when ever we are collecting <strong>text data.</strong> Collecting data directly from <strong>kaggle</strong>.</p><p><strong>Simple Exploratory Data Analysis (EDA):</strong></p><p>The raw data is checked for quality. This involves identifying issues like <strong>HTML</strong> <strong>tags</strong>, <strong>emojis</strong>, <strong>URLs</strong>, or irrelevant symbols. For example: Replacing emojis with equivalent characters or removing unwanted <strong>tags</strong>.</p><p><strong>Preprocessing:</strong></p><p>The data is cleaned based on the problem statement. For this project, we might remove URLs, hashtags, and <strong>stopwords</strong> (e.g. <em>“is,” “the,” “and”</em>).</p><p><strong>Feature Engineering:</strong></p><p><strong>Text data</strong> is converted into numerical representations (<strong>vectors</strong>). For instance Using <strong>Bag of Words</strong> or <strong>TF-IDF</strong> techniques to convert <strong>text</strong> into <strong>numbers</strong>.</p><p><strong>Model Training:</strong></p><p>A machine learning model is trained on the preprocessed data.</p><p><strong>Testing:</strong></p><p>The model is tested to <strong>classify</strong> whether a given news article is <strong>fake</strong> or <strong>real</strong>.</p><p><strong>Deploying and Monitoring:</strong></p><p>The model is deployed into production and monitored for accuracy over time.</p><h3>Key Terminologies in NLP</h3><p>Here are some important terms and concepts to understand in NLP:</p><p><strong>1. Corpus</strong>: A collection of <strong>documents </strong>(e.g, news articles, books).</p><p><strong>2. Document:</strong> A single entity within a <strong>corpus</strong>, such as a <strong>paragraph </strong>or <strong>sentence</strong>.</p><p><strong>3. Tokenization:</strong> Breaking down text into smaller units, called <strong>tokens</strong>.</p><p><strong>Sentence Tokenization:</strong> Splitting text into sentences.<br> Example: “I love biryani. I hate junk food.”<br> Tokens: [“I love biryani.”, “I hate junk food.”]</p><p><strong>Word Tokenization:</strong> Splitting sentences into words.<br><strong>Example:</strong> “I love data.”<br><strong>Tokens:</strong> [“I”, “love”, “data”]</p><p><strong>4. Stop Words:</strong> Words like “is,” “the,” and “and” that don’t contribute much meaning to the text.</p><p><strong>5. Vectorization:</strong> Converting text into numerical format. Common techniques include:</p><p><strong>One-Hot Encoding</strong></p><p><strong>Bag of Words (</strong>BoW<strong>)</strong></p><p><strong>TF-IDF </strong>(Term Frequency-Inverse Document Frequency)</p><p><strong>Word Embeddings:</strong> Techniques like Word2Vec, GloVe, and FastText.</p><h3>Real-Life Applications of NLP</h3><p>Natural Language Processing (NLP) is transforming the way we interact with technology, making machines smarter and communication more seamless. Below are some everyday applications of NLP:</p><h4>1.Chatbots and Virtual Assistants</h4><p><em>Examples:</em> <strong>Siri, Alexa, Google Assistant</strong>, and customer support chatbots. These systems use NLP to understand user queries, process them, and respond intelligently, enabling smooth human-computer interaction.</p><h4>2.Language Translation</h4><p><em>Examples:</em> <strong>Google Translate, Microsoft Translator</strong>. NLP algorithms translate text or speech between languages while maintaining the meaning and context, bridging communication gaps worldwide.</p><h4>3.Text Summarization</h4><p><em>Examples:</em> <strong>News aggregators</strong>, tools like Resoomer.NLP condenses long documents or articles into concise summaries, extracting key information for quick understanding.</p><h4>4.Spam Detection</h4><p><em>Examples:</em> <strong>Gmail’s</strong> spam filtering system.NLP analyzes email content, identifying spam based on language <strong>patterns</strong>, <strong>keywords</strong>, and contextual clues to keep your inbox clutter-free.</p><h4>5.Sentiment Analysis</h4><p><em>Examples:</em> <strong>Social media</strong> monitoring tools and customer feedback analysis platforms.NLP evaluates the emotional tone in text to determine whether opinions are <strong>positive</strong>, <strong>negative</strong>, or <strong>neutral</strong>, helping businesses understand customer sentiments.</p><h4>6.Personalized Content Recommendations</h4><p><em>Examples:</em> Platforms like <strong>Netflix, YouTube</strong>, and <strong>Amazon</strong>.NLP processes user behavior and preferences to recommend tailored content, enhancing user experience and engagement.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=db3d62a9ac7f" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>