Stories by Krish Soni on Medium

Feature Engineering in Machine Learning — The Secret Sauce Behind Smart Models

Krish Soni — Wed, 03 Sep 2025 06:46:13 GMT

Feature Engineering in Machine Learning — The Secret Sauce Behind Smart Models

When we think of machine learning, most people picture fancy algorithms like neural networks or random forests. But here’s the truth: a simple model with well-engineered features can often beat a complex model with messy data.

This process of shaping raw data into meaningful inputs for a model is called Feature Engineering, and it’s one of the most powerful (yet underrated) parts of the ML pipeline.

In this blog, we’ll explore the main areas of feature engineering with simple explanations and examples:

Feature Transformation
Feature Selection
Other useful tricks (handling missing values, outliers, and feature creation)

1. What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that make machine learning algorithms work better.

Think of it like cooking: raw vegetables aren’t very tasty, but when you cut, season, and cook them, they become a delicious dish. Similarly, raw data often needs to be cleaned, scaled, and encoded before models can “digest” it.

2. Feature Transformation

Feature transformation means changing the way a feature is represented without changing the underlying information. This makes data easier for models to process.

The two most common transformations are Scaling and Encoding.

A) Feature Scaling

Not all features are measured on the same scale. Imagine predicting car prices with these features:

km_driven ranges from 500 to 300,000
owner ranges from 0 to 3

If we feed this directly into models like Linear Regression or KNN, the model will think km_driven is way more important just because it has bigger numbers.

That’s where scaling comes in.

Before:

km_driven: [500, 300000, 45000]
owner:     [0, 1, 2]

After:

km_driven: [0.0, 1.0, 0.15]
owner:     [0.0, 0.33, 0.67]

✅ Now both features are on a comparable scale, and the model can judge importance fairly.

B) Feature Encoding

Models cannot understand text directly.

👉 Suppose we want to include fuel type when predicting prices:

Possible values: Petrol, Diesel, CNG

If we just assign numbers (Petrol = 1, Diesel = 2, CNG = 3), the model will think Diesel > Petrol or CNG > Diesel, which is wrong.

That’s why we use encoding.

Before:

fuel: [Petrol, Diesel, CNG, Petrol]

After (One-Hot Encoding):

fuel_Petrol  fuel_Diesel  fuel_CNG
     1            0           0
     0            1           0
     0            0           1
     1            0           0

✅ Now each fuel type is represented fairly without fake order.

3. Feature Selection

Not all features help the model. Some add noise or unnecessary complexity.

👉 Example: predicting car prices:

Useful features: brand, fuel, km_driven, owner
Less useful: seller_name, car_color

Before:

[brand, fuel, km_driven, owner, car_color, seller_name]

After (Selected):

[brand, fuel, km_driven, owner]

✅ Removing irrelevant features reduces overfitting and speeds up training.

4. Other Handy Tricks in Feature Engineering

Handling Missing Values: Fill missing km_driven with median instead of dropping rows.
Outlier Treatment: Remove extreme values like “1 crore km driven.”
Feature Creation: From date_of_birth, create a new feature age.

Final Takeaway

Feature Engineering is the backbone of Machine Learning. It prepares raw data so algorithms can do their job effectively.

Scaling → puts numbers on the same level
Encoding → converts text to numbers
Selection → keeps only what matters
Transformation → reshapes features to reveal patterns

Better features → Better models → Better predictions.

Predict Instagram Likes with Just Hashtags using Python + Linear Regression

Krish Soni — Fri, 11 Jul 2025 14:44:05 GMT

Ever wondered how much likes your Instagram post might get based on the hashtags you use? In this article, we’ll build a fun, hands-on machine learning project that predicts the number of likes using past hashtag performance data.

📦 What We’ll Build

A small Python script that:

Takes your post’s hashtags as input
Analyzes their past performance
Predicts how many likes your post might get based on that

All using just Pandas, Scikit-Learn, and a CSV file.

🧠 Prerequisites

Basic understanding of Python and machine learning
pandas, scikit-learn, and a CSV of your Instagram data (with columns like: Likes, Hashtags, etc.)

🧾 Step 1: Import Libraries and Load Data

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("instagram_data.csv", encoding='latin1')

🧹 Step 2: Clean Hashtags Column

# Fill missing values and lowercase everything for consistency
df['Hashtags'] = df['Hashtags'].fillna('').str.lower()

🔍 Step 3: Extract and Encode Hashtags

# Explode all hashtags into individual tags
all_hashtags = df['Hashtags'].str.split().explode()

# Get the 20 most common hashtags
top_hashtags = all_hashtags.value_counts().head(20).index.tolist()

# Create binary columns for each top hashtag (1 if present, 0 if not)
for tag in top_hashtags:
    df[tag] = df['Hashtags'].apply(lambda x: 1 if tag in x else 0)

📊 Step 4: Prepare Features and Target Variable

X = df[top_hashtags]  # Independent variables
y = df['Likes']       # Target variable

🏋️ Step 5: Train the Linear Regression Model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

🧪 Step 6: Predict Likes for New Hashtags

# Ask the user for input
input_tags = input("Enter hashtags for your post: ")

# Check if at least one known hashtag is present
known = any(tag in input_tags for tag in top_hashtags)

if not known:
    print("⚠️ No known high-performing hashtags found.")
else:
    # Create feature vector from input
    input_vector = [1 if tag in input_tags else 0 for tag in top_hashtags]
    
    # Predict likes
    predicted_likes = model.predict([input_vector])
    print(f"🎯 Predicted Likes: {int(predicted_likes[0])}")

💡 Example Input & Output

Input:

Enter hashtags for your post: #python #ai

Output:

🎯 Predicted Likes: 66

Understanding Tokenization in LLMs

Krish Soni — Sat, 28 Jun 2025 09:40:26 GMT

When you interact with a Large Language Model (LLM) like GPT, your input text doesn’t go into the model as-is. It first goes through a crucial process called tokenization. This step is what allows the model to break down human language into something it can understand and work with.

🤔 What is tokenization ?

Tokenization is the process of splitting text into smaller units called tokens. These tokens can be words, parts of words (subwords), or even characters depending on the tokenizer used.

📌Example:

“Tokenization is powerful.”

This might get tokenized as:

[“Token”, “ization”, “ is”, “ powerful”, “.”]

🔍 Breaking Down the Example

Notice how the word "Tokenization" is split into two tokens: "Token" and "ization". This is done because the tokenizer tries to reuse common word parts. If a word is uncommon, it's more efficient to break it into familiar segments.

Also, " is" and " powerful" include the space in the token — that's how the tokenizer remembers where words start or end.

🤔 Why Tokenization?

Efficient Vocabulary: Instead of remembering every possible word (which is infinite), the model learns a limited set of subwords that can represent any text.
Handles New or Rare Words: If the model sees a rare word like neurogenomics, it might not know the full word — but it can understand "neuro", "genom", and "ics" separately.
Smaller Context Units: Smaller tokens give the model more precision. For example, breaking "unbelievable" into "un", "believ", and "able" helps the model understand the meaning better.
Optimizes Performance and Cost: Most LLMs work with a token limit (e.g., 4,000 or 8,000 tokens). Also, API usage is priced per token. Understanding tokenization helps you stay within limits and optimize usage.

🧪 Types of Tokenizers

GPT-3/4: Byte Pair Encoding (BPE) or Byte-level BPE
BERT: WordPiece
T5: SentencePiece

🔚 Conclusion

Tokenization might seem like a technical detail, but it’s one of the most important parts of how LLMs work. It bridges the gap between natural language and machine understanding. The better you understand tokenization, the better you’ll be at using and building AI systems.

Understanding Synchronous vs. Asynchronous in JavaScript

Krish Soni — Sun, 24 Nov 2024 04:38:09 GMT

JavaScript is everywhere. From creating Interactive websites to server side applications, It’s go-to language for developers. One of the fundamental concept of JavaScript is understanding how it handles tasks — specifically difference between synchronous and asynchronous behavior. Let’s break it down simply.

What is Synchronous JavaScript?

Synchronous behavior is default behavior of JavaScript. It executes the code line by line, in order it appears. Each line of code waits for the previous line of code to finish. This sometimes refer to blocking behavior because the program waits for current task to complete.

Example :

console.log("Step 1");  
console.log("Step 2");  
console.log("Step 3");

Output:

Step 1  
Step 2  
Step 3

As shown in the example, Each console.log runs after the other. Every line has to wait for previous line to finish execution.

What is Asynchronous JavaScript?

Asynchronous code allows JavaScript to handle time-consuming operations, such as API requests, file reading, or database access, without blocking the execution of other tasks. While the asynchronous operation runs in the background, the main program continues to execute synchronous code. Once the asynchronous task is complete, its result is processed.

Example:

console.log("Step 1");

async function step2(){
    console.log(await "Step 2")
}
step2();
console.log("Step 3");

Output:

Step 1
Step 3
Step 2

As shown in the example, Step 1 and Step 3 run immediately because they are synchronous. Step 2, however, is asynchronous. It doesn’t block the main thread and waits in the background. Once all the synchronous code (Steps 1 and 3) has finished executing, Step 2 is then processed.