π§Ή Data Preprocessing in Machine Learning
Cleaning & Transforming Raw Data for Models
πΉ What is Data Preprocessing?Data preprocessing is the process of cleaning and transforming raw data into a format suitable for machine learning models.
Good preprocessing often improves model accuracy more than changing the algorithm itself.
β Why Data Preprocessing is Important
- Improves model accuracy
- Reduces training time
- Prevents overfitting
- Makes data understandable to algorithms
π Main Steps in Data Preprocessing
- Handling missing values
- Handling categorical data
- Feature scaling
- Outlier removal
- Feature selection
- Train-test split
1οΈβ£ Handling Missing Values
Real-world datasets often contain missing values.
Example:
Name Age Salary
John 25 50000
Anna NaN 60000
Mike 30 NaN
Common Methods:
- Remove rows or columns
- Replace with mean, median, or mode
2οΈβ£ Handling Categorical Data
Machine learning models require numerical input.
Example:
Gender
Male
Female
Encoding Techniques:
- Label Encoding (Male β 0, Female β 1)
- One-Hot Encoding (Male β [1,0], Female β [0,1])
3οΈβ£ Feature Scaling
Some algorithms (KNN, SVM) are sensitive to feature scale.
Age Salary
25 50000
Scaling Methods:
- Normalization (Min-Max Scaling)
- Standardization (Z-score)
4οΈβ£ Removing Outliers
Outliers are extreme values that can negatively impact model performance.
Example: Most salaries are between 40kβ60k, but one value is 5,000,000.
5οΈβ£ Feature Selection
Selecting relevant features improves performance and reduces overfitting.
- House size β
- Location β
- Ownerβs favorite color β
6οΈβ£ Splitting the Dataset
- Training set (70β80%)
- Testing set (20β30%)
This ensures fair model evaluation.
π Simple Python Example
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.DataFrame({
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
})
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
X_train, X_test = train_test_split(data_scaled, test_size=0.2)
π FAQs on Data Preprocessing
Q1. Why is data preprocessing required?
Because raw data is noisy, inconsistent, and unsuitable for ML models.
Q2. Is preprocessing more important than model selection?
Often yes. Clean data can outperform complex models on bad data.
Q3. Should scaling be done before train-test split?
No. Scaling must be done after splitting to avoid data leakage.
π Key Points to Remember
- Preprocessing improves accuracy
- Always avoid data leakage
- Scale numerical features properly
- Clean data beats complex models