QUERY OPTIMIZER — 33 ms Query Engine (700× Faster Than DuckDB)

Inspiration

Every day, AppLovin processes over 1.6 billion ad impressions, generating terabytes of auction, click, bid, and purchase data. We asked ourselves:

“What if analytics on billions of ad events could run in milliseconds, not seconds?”

That question led to QUERY OPTIMIZER, a high-performance analytics engine built from scratch to challenge the limits of speed and scalability.

⚙️ What It Does

Our system transforms a 19 GB / 245 million-row ad-event dataset into a real-time analytics engine that executes complex queries in:

33 ms on first run (≈ 700× faster than DuckDB)
4–5 ms with cached queries (≈ 6,100× faster)

Two-Phase Architecture

Prepare Phase: One-time transformation that partitions, compresses, and pre-computes frequent aggregations
Query Phase: A smart router analyzes query patterns and executes them using the optimal strategy

🧠 How We Built It

Phase 1 – Data Preparation

Partitioning: Split by event type + day → skips > 75 % of data per query
Compression: ZSTD + dictionary encoding → 61 % storage reduction (19 GB → 7.5 GB)
Pre-Computation: Identified 5 high-frequency query patterns for one-time aggregation
Parallelization: 8 workers + lazy CSV loading for efficient ETL

Phase 2 – Query Execution

Smart Router: Pattern-matches incoming queries to pre-computed results or dynamic partitions
3-Tier Execution: a. Pre-computed lookup (95 % of queries) → < 1 ms b. Partition pruning → 10–25 ms c. Full scan fallback → 50–100 ms
Query Caching: MD5-hashed results → instant repeat execution

⚡ V4 Optimizations

Dictionary encoding → 70 % smaller categorical columns
Query result cache → 99 % faster on repeats
Optimized ZSTD level 1 → 40 % faster preparation
Pre-sorted columns → 15 % faster range queries
Native Polars writer → 25 % faster I/O
Parallel I/O (8 workers) → 30 % faster processing
Lazy loading pipeline → reduced RAM footprint
Partition index cache → 33 ms average query time

🧩 Challenges

1️⃣ Memory Constraints

Processing 245 M rows on 16 GB RAM:

Lazy evaluation (scan_csv()) instead of loading everything
Streamed transformations
Independent partition processing

2️⃣ Prep Time vs Query Speed

Version	Prep Time	Query Speed	Speedup vs DuckDB
v1	211 min	62 ms	394×
v2	120–150 min	40 ms	610×
v3	60 min	36 ms	680×
v4 (Recommended)	90 min	33 ms	700×

3️⃣ Pre-computation vs Dynamic Execution

Full pre-aggregation = terabytes of data. ➡ We selected only 5 top query patterns for pre-computation and routed everything else through dynamic execution.

4️⃣ Categorical Concatenation

Fixed StringCacheMismatchError by enabling global string cache (pl.enable_string_cache()).

🏆 Accomplishments

✅ 33 ms average query time on 245 M rows ⚡ 700× faster than DuckDB 💾 61 % compression (19 GB → 7.5 GB) 🎯 100 % accuracy on benchmarks 🏗️ Modular, fault-tolerant architecture with query pattern routing

💡 What We Learned

Columnar storage rocks: Predicate pushdown and column pruning cut I/O drastically
Pre-aggregation > brute force caching
Partition pruning = the single largest speed gain
Rust + SIMD = real performance: Polars delivers 10–100× Python speedups
Lazy evaluation lets optimizers eliminate redundant work