Data Engineering

Real-Time Analytics Platform

Processing 500K+ events/sec with sub-second dashboard latency

Apache KafkaSparkDelta LakePythonAWS

Overview

This platform was built to replace a legacy batch-processing system that delivered analytics with a 24-hour delay. The business needed real-time visibility into user behavior, transaction patterns, and system health to make time-sensitive decisions. I led the end-to-end design and implementation of a streaming-first architecture that processes over 500,000 events per second and delivers insights to stakeholders through live dashboards with sub-second latency.

The Challenge

The existing batch ETL pipeline ran nightly, meaning business stakeholders were always looking at stale data. During peak traffic events (product launches, flash sales), the team was flying blind — issues went undetected for hours, and opportunities were missed. The legacy system also couldn't scale horizontally, causing frequent job failures as data volume grew 40% year-over-year. We needed a solution that could handle bursty traffic, guarantee exactly-once processing, and integrate seamlessly with existing BI tools.

The Solution

I designed a Lambda architecture that combines a real-time streaming layer for immediate insights with a batch layer for historical accuracy and backfill. The streaming layer uses Apache Kafka as the event backbone, Spark Structured Streaming for transformation and enrichment, and Delta Lake as the unified storage layer supporting both streaming writes and batch reads. A serving layer powered by materialized views in a columnar store feeds the live dashboards.

System Architecture

01

Ingestion Layer

High-throughput event collection with schema validation and dead-letter routing

Apache Kafka (3 clusters, 120 partitions)Schema Registry (Avro)Custom producers (Python, Go)
02

Stream Processing

Real-time transformation, enrichment, aggregation, and windowed computations

Spark Structured StreamingStateful processing with RocksDBWatermark-based late event handling
03

Storage Layer

Unified lakehouse supporting both streaming upserts and time-travel queries

Delta Lake on S3Z-order optimizationAutomated compaction & vacuum
04

Serving Layer

Low-latency materialized views optimized for dashboard queries

Apache DruidRedis (cache layer)Pre-aggregated rollup tables
05

Orchestration & Monitoring

Pipeline orchestration, data quality checks, and observability

Apache AirflowGreat ExpectationsDatadog + PagerDuty

Key Features

01

Exactly-Once Processing

Implemented idempotent writes with Kafka transactional producers and Spark checkpoint-based recovery to guarantee no duplicates or data loss, even during node failures.

02

Dynamic Scaling

Auto-scaling Spark executors based on Kafka consumer lag metrics. During peak events, the cluster scales from 20 to 80 executors in under 2 minutes with zero downtime.

03

Schema Evolution

Full backward/forward compatible schema evolution using Avro and Schema Registry, allowing producers and consumers to deploy independently without breaking changes.

04

Live Dashboard Engine

Custom WebSocket-based push system that streams pre-aggregated metrics to Grafana and a React dashboard, achieving P99 latency under 800ms from event to pixel.

05

Data Quality Gates

Automated quality checks at every pipeline stage using Great Expectations. Failed checks route data to quarantine topics and trigger PagerDuty alerts.

06

Time-Travel & Backfill

Delta Lake time-travel enables instant rollback and historical replay. Backfill jobs can reprocess months of data without affecting the live streaming pipeline.

Results & Impact

500K+
Events/Second

Peak sustained throughput across 3 Kafka clusters with 99.99% delivery guarantee

<800ms
End-to-End Latency

P99 latency from event ingestion to dashboard rendering

99.97%
Pipeline Uptime

Over 12 months of operation with zero data loss incidents

40%
Cost Reduction

Compared to the legacy system, through better resource utilization and spot instances

24hr → <1s
Insight Delay

Reduced analytics latency from next-day batch to sub-second streaming

15TB/day
Data Processed

Daily volume across all event streams with automatic compaction

Tech Stack Deep Dive

Streaming

Apache KafkaEvent streaming backbone — 3 production clusters with MirrorMaker 2 for cross-region replication
Spark Structured StreamingReal-time transformation and enrichment with exactly-once semantics
Schema RegistryCentralized schema management with Avro for type-safe event contracts

Storage

Delta LakeACID-compliant lakehouse storage with time-travel, schema enforcement, and streaming support
Amazon S3Durable object storage for the data lake with lifecycle policies
Apache DruidReal-time OLAP database for sub-second dashboard queries

Orchestration

Apache AirflowDAG-based orchestration for batch reprocessing, compaction, and data quality jobs
Great ExpectationsDeclarative data quality validation at every pipeline stage

Infrastructure

AWS (EKS, EMR, S3, MSK)Cloud infrastructure with Kubernetes-based deployment
TerraformInfrastructure as code for reproducible multi-environment provisioning
DockerContainerized microservices for all pipeline components

Monitoring

DatadogUnified metrics, traces, and logs with custom Kafka and Spark dashboards
PagerDutyOn-call alerting with escalation policies tied to SLA thresholds

Lessons Learned

Start with exactly-once semantics from day one — retrofitting it later is exponentially harder and riskier.

Invest heavily in observability before scaling. We built custom Kafka consumer lag dashboards and Spark stage-level metrics that became essential for debugging.

Schema evolution discipline pays dividends. Enforcing backward compatibility from the start prevented multiple potential production incidents.

Delta Lake's time-travel feature was invaluable not just for debugging but for A/B testing pipeline changes against historical data.

Auto-scaling based on consumer lag rather than CPU/memory gave us much better responsiveness to traffic bursts.