🔹 Missing Values Handling in Machine Learning
Cleaning Incomplete Data for Reliable Models
Missing values occur when no data is stored for a variable in an observation. Handling them properly is important because many machine learning algorithms cannot work with missing data.
🔎 Why Missing Values Occur
- Data entry errors
- Sensor failure
- Survey non-response
- Data corruption
- Optional fields left blank
🔹 Methods to Handle Missing Values
1️⃣ Remove Missing Data (Deletion Method)
✔ A. Remove Rows (Listwise Deletion)
If only a few rows contain missing values.
Age Salary
25 50000
NaN 60000
30 55000
Remove the second row.
- ✅ Simple
- ❌ Loses data
✔ B. Remove Columns
If a column has too many missing values (e.g., 70% missing).
2️⃣ Mean / Median / Mode Imputation
Replace missing values with statistical measures.
- Mean – Numerical data, normal distribution
- Median – Numerical data with outliers
- Mode – Categorical data
Gender
Male
Female
Male
NaN
Mode = Male → Replace NaN with Male
3️⃣ Forward Fill / Backward Fill
Mostly used in time-series data.
- Forward Fill: Use previous value
- Backward Fill: Use next value
4️⃣ Interpolation
Estimate missing values based on trends (mainly for time-series).
Example: Temperature missing between 20°C and 24°C → estimate 22°C.
5️⃣ Predictive Imputation (Advanced)
Use machine learning models to predict missing values.
- K-Nearest Neighbors
- Random Forest
6️⃣ Using Constant Value
Replace missing values with:
- 0
- "Unknown"
- -1
Useful when missing itself has meaning.
🔹 When to Use What?
Situation Best Method
Few missing rows Remove rows
Many missing column Remove column
Normal distribution Mean
Outliers present Median
Categorical data Mode
Time series Forward fill / Interpolation
Large dataset Predictive imputation
📘 Simple Python Example
import pandas as pd
from sklearn.impute import SimpleImputer
data = pd.DataFrame({
'Age': [25, None, 30],
'Salary': [50000, 60000, None]
})
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
print(data_imputed)
🔹 Important Tip (Exam / Interview)
- Check percentage of missing data
- Understand why data is missing
- Choose method based on data type