📥 Data Collection for Machine Learning

A practical, engineer-focused overview (with examples)

Before modeling, tuning, or deploying — you need reliable data sources.

In real-world ML systems, data typically comes from four major channels:

  • Flat files (CSV, Excel, JSON)
  • APIs
  • Web scraping
  • Databases (SQL)

Let’s break each down clearly and practically.

1️⃣ CSV, Excel, JSON — The Structured File Layer

These are the most common starting points for ML projects.

📄 CSV (Comma-Separated Values)

Simple tabular format.

Example (sales.csv):

customer_id,age,city,purchase_amount
101,25,New York,200
102,32,Chicago,150

Load with Python:

import pandas as pd

df = pd.read_csv("sales.csv")

✔ Best for:

  • Clean tabular datasets
  • Kaggle-style projects
  • Quick experimentation

⚠ Limitation:

  • No nested structure
  • Large files can become memory-heavy

📊 Excel (.xlsx)

Common in business environments.

df = pd.read_excel("sales.xlsx")

✔ Good for:

  • Corporate data
  • Financial reports
  • Multi-sheet datasets

⚠ Be careful:

  • Hidden formatting issues
  • Mixed data types
  • Manual editing errors

🗂 JSON (JavaScript Object Notation)

Semi-structured format. Supports nested data.

Example:

{
  "customer_id": 101,
  "orders": [
    {"product": "Laptop", "price": 1200},
    {"product": "Mouse", "price": 25}
  ]
}

Load in Python:

df = pd.read_json("data.json")

✔ Best for:

  • APIs
  • Hierarchical data
  • Logs
  • Event tracking

JSON is more flexible than CSV — but requires normalization before modeling.

2️⃣ APIs (Application Programming Interfaces)

APIs allow you to fetch live data programmatically.

For example:

  • Weather APIs
  • Social media APIs
  • Financial market APIs

Using Python:

import requests

response = requests.get("https://api.example.com/data")
data = response.json()

Real-world example:
You might pull stock data from a finance API instead of downloading static files.

✔ Best for:

  • Real-time systems
  • Dynamic data
  • Automation

⚠ Consider:

  • Rate limits
  • Authentication (API keys)
  • Data format consistency

APIs are foundational in production ML systems.

3️⃣ Web Scraping (Basics)

When data isn’t provided via API, you extract it directly from websites.

Using libraries like:

  • BeautifulSoup
  • Selenium

Basic example:

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

titles = soup.find_all("h2")

✔ Used for:

  • Product price monitoring
  • Job listings analysis
  • News sentiment datasets

⚠ Important:

  • Respect robots.txt
  • Follow legal and ethical guidelines
  • Avoid aggressive scraping

Web scraping is powerful — but must be handled responsibly.

4️⃣ Databases (SQL Introduction)

In real-world ML systems, most data lives in databases.

Instead of reading files, you query structured data using SQL.

What is SQL?

SQL (Structured Query Language) retrieves and manipulates relational data.

Basic query example:

SELECT age, purchase_amount
FROM customers
WHERE age > 30;

Using Python:

import pandas as pd
import sqlite3

conn = sqlite3.connect("database.db")
df = pd.read_sql("SELECT * FROM customers", conn)

Common databases:

  • MySQL
  • PostgreSQL
  • SQLite

✔ Best for:

  • Structured enterprise data
  • Large datasets
  • Relational data models

SQL is non-negotiable for ML engineers working in production environments.

📊 Strategic Comparison

Source     Structure     Best For     Real-Time?
CSV      Flat tabular  Small projects  ❌
Excel    Tabular     Business data  ❌
JSON     Nested     APIs & logs  ❌
APIs     Structured JSON Live systems  ✅
Web Scraping Semi-structured Unavailable data Sometimes
Databases  Relational    Production ML  ✅