NBA Data Pipeline
Python & API integration pipeline extracting and transforming NBA statistics
Overview
A data engineering portfolio project that demonstrates end-to-end pipeline development: extracting live NBA statistics from public API endpoints, performing data cleaning and transformation in Pandas, and loading the results into a structured relational database for downstream analysis.
The Challenge
NBA stats APIs return nested JSON payloads with inconsistent schemas across seasons and endpoints. The pipeline needed to normalize these structures, handle missing values gracefully, and produce clean, query-ready tables without manual intervention.
The Solution
Built a modular Python pipeline with a clear extract-transform-load separation. The extraction layer handles API pagination and rate limiting. The transformation layer uses Pandas for normalization, type casting, and deduplication. The load layer upserts records into a structured database with a defined schema.
System Architecture
Extraction Layer
Pulls player, team, and game statistics from NBA API endpoints with pagination and error handling
Transformation Layer
Cleans, normalizes, and enriches raw API payloads into structured tabular data
Load Layer
Inserts cleaned records into a relational database with upsert logic to avoid duplicates
Key Features
Automated Data Extraction
Fetches player stats, team standings, and game logs from API endpoints with built-in rate limiting and retry logic.
Pandas Transformation Pipeline
Normalizes nested JSON, handles missing values, casts types, and deduplicates records before loading.
Structured Database Output
Loads clean data into a relational schema optimized for analytical queries and reporting.
Git-Based Version Control
Full commit history with meaningful messages, branch strategy, and documented best practices for reproducibility.
Results & Impact
Zero manual steps from API call to database record
Per-game and season-aggregate metrics extracted and normalized
Tech Stack Deep Dive
Languages & Libraries
Infrastructure & Tooling
Lessons Learned
API pagination patterns vary significantly; building a generic iterator early saves rework.
Explicit schema definitions before loading prevent silent type errors downstream.
Meaningful commit messages are documentation; treat them as such from day one.