🔹 Outlier Detection in Machine Learning

Identifying Abnormal Data Points

An outlier is a data point that is very different from other observations. It can occur due to errors, rare events, or natural variation.

🔎 Example (Salary in $)
40,000
45,000
50,000
48,000
5,000,000  ❗ Outlier

That extreme value can distort the model.

🔹 Why Outlier Detection is Important

  • Prevents model distortion
  • Improves accuracy
  • Reduces overfitting
  • Important for fraud & anomaly detection

🔹 Methods of Outlier Detection

1️⃣ Z-Score Method (Standard Deviation)

Measures how far a point is from the mean.

Z = (X − μ) / σ

If |Z| > 3 → Outlier

Works best when data is normally distributed.

2️⃣ IQR Method (Interquartile Range)

Most commonly used method.

Steps:
  • Find Q1 (25th percentile)
  • Find Q3 (75th percentile)
  • IQR = Q3 − Q1
Outlier condition:
X < Q1 − 1.5 × IQR
OR
X > Q3 + 1.5 × IQR

Best suited for skewed data.

3️⃣ Box Plot Method

Visual method based on IQR.

Outliers appear as points outside the whiskers.

4️⃣ Using Machine Learning Algorithms

Advanced techniques for large or high-dimensional data:

  • Isolation Forest
  • Local Outlier Factor (LOF)
  • One-Class SVM

📘 Simple Python Example (IQR Method)

import numpy as np

data = np.array([40000, 45000, 50000, 48000, 5000000])

Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = data[(data < lower) | (data > upper)]
print("Outliers:", outliers)

🔹 What to Do After Detecting Outliers?

  • ✔ Remove them (if error)
  • ✔ Cap them (Winsorization)
  • ✔ Transform data (log transformation)
  • ✔ Keep them (if meaningful, e.g. fraud)

🔹 Real-Life Examples

  • Fraud detection (unusual transactions)
  • Network intrusion detection
  • Medical abnormal readings
  • Stock market crash data