🔹 Outlier Detection in Machine Learning

Identifying Abnormal Data Points

An outlier is a data point that is very different from other observations. It can occur due to errors, rare events, or natural variation.

🔎 Example (Salary in $)


40,000

45,000

50,000

48,000

5,000,000  ❗ Outlier

That extreme value can distort the model.

🔹 Why Outlier Detection is Important

Prevents model distortion
Improves accuracy
Reduces overfitting
Important for fraud & anomaly detection

🔹 Methods of Outlier Detection

1️⃣ Z-Score Method (Standard Deviation)

Measures how far a point is from the mean.


Z = (X − μ) / σ

If |Z| > 3 → Outlier

Works best when data is normally distributed.

2️⃣ IQR Method (Interquartile Range)

Most commonly used method.

Steps:

Find Q1 (25th percentile)
Find Q3 (75th percentile)
IQR = Q3 − Q1

Outlier condition:


X < Q1 − 1.5 × IQR

OR

X > Q3 + 1.5 × IQR

Best suited for skewed data.

3️⃣ Box Plot Method

Visual method based on IQR.

Outliers appear as points outside the whiskers.

4️⃣ Using Machine Learning Algorithms

Advanced techniques for large or high-dimensional data:

Isolation Forest
Local Outlier Factor (LOF)
One-Class SVM

📘 Simple Python Example (IQR Method)


import numpy as np



data = np.array([40000, 45000, 50000, 48000, 5000000])



Q1 = np.percentile(data, 25)

Q3 = np.percentile(data, 75)

IQR = Q3 - Q1



lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR



outliers = data[(data < lower) | (data > upper)]

print("Outliers:", outliers)

🔹 What to Do After Detecting Outliers?

✔ Remove them (if error)
✔ Cap them (Winsorization)
✔ Transform data (log transformation)
✔ Keep them (if meaningful, e.g. fraud)

🔹 Real-Life Examples

Fraud detection (unusual transactions)
Network intrusion detection
Medical abnormal readings
Stock market crash data

Missing Values Handling Home