📥 Data Collection for Machine Learning

A practical, engineer-focused overview (with examples)

Before modeling, tuning, or deploying — you need reliable data sources.

In real-world ML systems, data typically comes from four major channels:

Flat files (CSV, Excel, JSON)
APIs
Web scraping
Databases (SQL)

Let’s break each down clearly and practically.

1️⃣ CSV, Excel, JSON — The Structured File Layer

These are the most common starting points for ML projects.

📄 CSV (Comma-Separated Values)

Simple tabular format.

Example (sales.csv):


customer_id,age,city,purchase_amount

101,25,New York,200

102,32,Chicago,150

Load with Python:


import pandas as pd


df = pd.read_csv("sales.csv")

✔ Best for:

Clean tabular datasets
Kaggle-style projects
Quick experimentation

⚠ Limitation:

No nested structure
Large files can become memory-heavy

📊 Excel (.xlsx)

Common in business environments.


df = pd.read_excel("sales.xlsx")

✔ Good for:

Corporate data
Financial reports
Multi-sheet datasets

⚠ Be careful:

Hidden formatting issues
Mixed data types
Manual editing errors

🗂 JSON (JavaScript Object Notation)

Semi-structured format. Supports nested data.

Example:


{

  "customer_id": 101,

  "orders": [

    {"product": "Laptop", "price": 1200},

    {"product": "Mouse", "price": 25}

  ]

}

Load in Python:


df = pd.read_json("data.json")

✔ Best for:

APIs
Hierarchical data
Logs
Event tracking

JSON is more flexible than CSV — but requires normalization before modeling.

2️⃣ APIs (Application Programming Interfaces)

APIs allow you to fetch live data programmatically.

For example:

Weather APIs
Social media APIs
Financial market APIs

Using Python:


import requests


response = requests.get("https://api.example.com/data")

data = response.json()

Real-world example:
You might pull stock data from a finance API instead of downloading static files.

✔ Best for:

Real-time systems
Dynamic data
Automation

⚠ Consider:

Rate limits
Authentication (API keys)
Data format consistency

APIs are foundational in production ML systems.

3️⃣ Web Scraping (Basics)

When data isn’t provided via API, you extract it directly from websites.

Using libraries like:

BeautifulSoup
Selenium

Basic example:


from bs4 import BeautifulSoup

import requests


url = "https://example.com"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")


titles = soup.find_all("h2")

✔ Used for:

Product price monitoring
Job listings analysis
News sentiment datasets

⚠ Important:

Respect robots.txt
Follow legal and ethical guidelines
Avoid aggressive scraping

Web scraping is powerful — but must be handled responsibly.

4️⃣ Databases (SQL Introduction)

In real-world ML systems, most data lives in databases.

Instead of reading files, you query structured data using SQL.

What is SQL?

SQL (Structured Query Language) retrieves and manipulates relational data.

Basic query example:


SELECT age, purchase_amount

FROM customers

WHERE age > 30;

Using Python:


import pandas as pd

import sqlite3


conn = sqlite3.connect("database.db")

df = pd.read_sql("SELECT * FROM customers", conn)

Common databases:

MySQL
PostgreSQL
SQLite

✔ Best for:

Structured enterprise data
Large datasets
Relational data models

SQL is non-negotiable for ML engineers working in production environments.

📊 Strategic Comparison


Source     Structure     Best For     Real-Time?

CSV      Flat tabular  Small projects  ❌

Excel    Tabular     Business data  ❌

JSON     Nested     APIs & logs  ❌

APIs     Structured JSON Live systems  ✅

Web Scraping Semi-structured Unavailable data Sometimes

Databases  Relational    Production ML  ✅

Data Handling & Analysis Data preprocessing