Data Engineering

NBA Data Pipeline

Python & API integration pipeline extracting and transforming NBA statistics

PythonPandasREST APISQLGit

Overview

A data engineering portfolio project that demonstrates end-to-end pipeline development: extracting live NBA statistics from public API endpoints, performing data cleaning and transformation in Pandas, and loading the results into a structured relational database for downstream analysis.

The Challenge

NBA stats APIs return nested JSON payloads with inconsistent schemas across seasons and endpoints. The pipeline needed to normalize these structures, handle missing values gracefully, and produce clean, query-ready tables without manual intervention.

The Solution

Built a modular Python pipeline with a clear extract-transform-load separation. The extraction layer handles API pagination and rate limiting. The transformation layer uses Pandas for normalization, type casting, and deduplication. The load layer upserts records into a structured database with a defined schema.

System Architecture

01

Extraction Layer

Pulls player, team, and game statistics from NBA API endpoints with pagination and error handling

Python requestsNBA Stats APIJSON parsing
02

Transformation Layer

Cleans, normalizes, and enriches raw API payloads into structured tabular data

pandasNumPydata type coercion
03

Load Layer

Inserts cleaned records into a relational database with upsert logic to avoid duplicates

SQLAlchemyPostgreSQL / SQLiteschema migrations

Key Features

01

Automated Data Extraction

Fetches player stats, team standings, and game logs from API endpoints with built-in rate limiting and retry logic.

02

Pandas Transformation Pipeline

Normalizes nested JSON, handles missing values, casts types, and deduplicates records before loading.

03

Structured Database Output

Loads clean data into a relational schema optimized for analytical queries and reporting.

04

Git-Based Version Control

Full commit history with meaningful messages, branch strategy, and documented best practices for reproducibility.

Results & Impact

100%
Automated

Zero manual steps from API call to database record

30+
Stats per player

Per-game and season-aggregate metrics extracted and normalized

Tech Stack Deep Dive

Languages & Libraries

PythonCore pipeline language
pandasData cleaning and transformation
requestsHTTP client for API calls
SQLAlchemyDatabase ORM and connection management

Infrastructure & Tooling

Git / GitHubVersion control and commit history
PostgreSQLStructured data storage

Lessons Learned

API pagination patterns vary significantly; building a generic iterator early saves rework.

Explicit schema definitions before loading prevent silent type errors downstream.

Meaningful commit messages are documentation; treat them as such from day one.