📥 Data Collection for Machine Learning
A practical, engineer-focused overview (with examples)
Before modeling, tuning, or deploying — you need reliable data sources.
In real-world ML systems, data typically comes from four major channels:
- Flat files (CSV, Excel, JSON)
- APIs
- Web scraping
- Databases (SQL)
Let’s break each down clearly and practically.
1️⃣ CSV, Excel, JSON — The Structured File Layer
These are the most common starting points for ML projects.
📄 CSV (Comma-Separated Values)
Simple tabular format.
Example (sales.csv):
customer_id,age,city,purchase_amount
101,25,New York,200
102,32,Chicago,150
Load with Python:
import pandas as pd
df = pd.read_csv("sales.csv")
✔ Best for:
- Clean tabular datasets
- Kaggle-style projects
- Quick experimentation
⚠ Limitation:
- No nested structure
- Large files can become memory-heavy
📊 Excel (.xlsx)
Common in business environments.
df = pd.read_excel("sales.xlsx")
✔ Good for:
- Corporate data
- Financial reports
- Multi-sheet datasets
⚠ Be careful:
- Hidden formatting issues
- Mixed data types
- Manual editing errors
🗂 JSON (JavaScript Object Notation)
Semi-structured format. Supports nested data.
Example:
{
"customer_id": 101,
"orders": [
{"product": "Laptop", "price": 1200},
{"product": "Mouse", "price": 25}
]
}
Load in Python:
df = pd.read_json("data.json")
✔ Best for:
- APIs
- Hierarchical data
- Logs
- Event tracking
JSON is more flexible than CSV — but requires normalization before modeling.
2️⃣ APIs (Application Programming Interfaces)
APIs allow you to fetch live data programmatically.
For example:
- Weather APIs
- Social media APIs
- Financial market APIs
Using Python:
import requests
response = requests.get("https://api.example.com/data")
data = response.json()
Real-world example:
You might pull stock data from a finance API instead of downloading static files.
✔ Best for:
- Real-time systems
- Dynamic data
- Automation
⚠ Consider:
- Rate limits
- Authentication (API keys)
- Data format consistency
APIs are foundational in production ML systems.
3️⃣ Web Scraping (Basics)
When data isn’t provided via API, you extract it directly from websites.
Using libraries like:
- BeautifulSoup
- Selenium
Basic example:
from bs4 import BeautifulSoup
import requests
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
titles = soup.find_all("h2")
✔ Used for:
- Product price monitoring
- Job listings analysis
- News sentiment datasets
⚠ Important:
- Respect robots.txt
- Follow legal and ethical guidelines
- Avoid aggressive scraping
Web scraping is powerful — but must be handled responsibly.
4️⃣ Databases (SQL Introduction)
In real-world ML systems, most data lives in databases.
Instead of reading files, you query structured data using SQL.
What is SQL?
SQL (Structured Query Language) retrieves and manipulates relational data.
Basic query example:
SELECT age, purchase_amount
FROM customers
WHERE age > 30;
Using Python:
import pandas as pd
import sqlite3
conn = sqlite3.connect("database.db")
df = pd.read_sql("SELECT * FROM customers", conn)
Common databases:
- MySQL
- PostgreSQL
- SQLite
✔ Best for:
- Structured enterprise data
- Large datasets
- Relational data models
SQL is non-negotiable for ML engineers working in production environments.
📊 Strategic Comparison
Source Structure Best For Real-Time?
CSV Flat tabular Small projects ❌
Excel Tabular Business data ❌
JSON Nested APIs & logs ❌
APIs Structured JSON Live systems ✅
Web Scraping Semi-structured Unavailable data Sometimes
Databases Relational Production ML ✅