Data Engineering

NBA Data Pipeline

Python & API integration pipeline extracting and transforming NBA statistics

PythonPandasREST APISQLGit

Overview

A data engineering portfolio project that demonstrates end-to-end pipeline development: extracting live NBA statistics from public API endpoints, performing data cleaning and transformation in Pandas, and loading the results into a structured relational database for downstream analysis.

⚡

The Challenge

NBA stats APIs return nested JSON payloads with inconsistent schemas across seasons and endpoints. The pipeline needed to normalize these structures, handle missing values gracefully, and produce clean, query-ready tables without manual intervention.

✓

The Solution

Built a modular Python pipeline with a clear extract-transform-load separation. The extraction layer handles rate limiting and transient-failure retries with exponential backoff. The transformation layer uses Pandas for normalization, type casting, and deduplication. The load layer upserts records into a structured database with a defined schema.

System Architecture

Extraction Layer

Pulls player, team, and game statistics from NBA API endpoints with rate limiting and retry handling

Python requestsNBA Stats APIJSON parsing

↓

Transformation Layer

Cleans, normalizes, and enriches raw API payloads into structured tabular data

pandasNumPydata type coercion

↓

Load Layer

Inserts cleaned records into a relational database with upsert logic to avoid duplicates

SQLAlchemyPostgreSQL / SQLiteschema migrations

Key Features

Automated Data Extraction

Fetches player stats, team standings, and game logs from API endpoints with built-in rate limiting and retry logic.

Pandas Transformation Pipeline

Normalizes nested JSON, handles missing values, casts types, and deduplicates records before loading.

Structured Database Output

Loads clean data into a relational schema optimized for analytical queries and reporting.

Git-Based Version Control

Full commit history with meaningful messages, branch strategy, and documented best practices for reproducibility.

Results & Impact

100%

Automated

Zero manual steps from API call to database record

30+

Stats per player

Per-game and season-aggregate metrics extracted and normalized

Tech Stack Deep Dive

Languages & Libraries

PythonCore pipeline language

pandasData cleaning and transformation

requestsHTTP client for API calls

SQLAlchemyDatabase ORM and connection management

Infrastructure & Tooling

Git / GitHubVersion control and commit history

PostgreSQLStructured data storage

Lessons Learned

Public APIs throttle aggressive clients; building rate limiting and retries in from the start beats bolting them on after the first ban.

Explicit schema definitions before loading prevent silent type errors downstream.

Meaningful commit messages are documentation; treat them as such from day one.

View All Projects