- ✅ Handle missing values
- ✅ Correct data formatting
- ✅ Standardize and normalize data
Data wrangling is the process of transforming raw data into a more useful format for analysis. It includes handling missing values, correcting data formats, normalizing values, and more.
Missing values can impact data analysis. We can identify and handle them using Pandas.
- Identify missing data
- Deal with missing data
- Correct data format
In Pandas, we can detect missing values using:
# Check for missing values
df.isnull()
df.notnull()In datasets, missing values may be represented by ?. We can replace them with NaN:
import numpy as np
df.replace("?", np.nan, inplace=True)We can handle missing data in different ways:
- Drop Data
- 🔻 Drop the whole row
- 🔻 Drop the whole column
- Replace Data
- 🔄 Replace with mean
- 🔄 Replace with most frequent value
- 🔄 Replace using other functions
Ensuring that all data is in the correct format is crucial for analysis.
df.dtypesdf["column_name"] = df["column_name"].astype("desired_type")Normalization helps scale values to a consistent range, making comparisons easier. Common techniques:
- Mean normalization (scaling to average = 0)
- Variance normalization (scaling to variance = 1)
- Min-Max scaling (scaling values from 0 to 1)
df['normalized_column'] = df['column_name'] / df['column_name'].max()Binning groups continuous numerical values into discrete categories.
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
group_names = ["Low", "Medium", "High"]
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True)Dummy variables convert categorical values into numerical labels for regression analysis.
dummies = pd.get_dummies(df["fuel-type"])
df = pd.concat([df, dummies], axis=1)
df.drop("fuel-type", axis=1, inplace=True)By completing this, you now understand how to: ✔ Handle missing data ✔ Correct data formatting ✔ Normalize and standardize data ✔ Use binning for grouped analysis ✔ Convert categorical data into numerical dummy variables