QUERY OPTIMIZER — 33 ms Query Engine (700× Faster Than DuckDB)

Inspiration

Every day, AppLovin processes over 1.6 billion ad impressions, generating terabytes of auction, click, bid, and purchase data. We asked ourselves:

“What if analytics on billions of ad events could run in milliseconds, not seconds?”

That question led to QUERY OPTIMIZER, a high-performance analytics engine built from scratch to challenge the limits of speed and scalability.


⚙️ What It Does

Our system transforms a 19 GB / 245 million-row ad-event dataset into a real-time analytics engine that executes complex queries in:

  • 33 ms on first run (≈ 700× faster than DuckDB)
  • 4–5 ms with cached queries (≈ 6,100× faster)

Two-Phase Architecture

  1. Prepare Phase: One-time transformation that partitions, compresses, and pre-computes frequent aggregations
  2. Query Phase: A smart router analyzes query patterns and executes them using the optimal strategy

🧠 How We Built It

Phase 1 – Data Preparation

  • Partitioning: Split by event type + day → skips > 75 % of data per query
  • Compression: ZSTD + dictionary encoding → 61 % storage reduction (19 GB → 7.5 GB)
  • Pre-Computation: Identified 5 high-frequency query patterns for one-time aggregation
  • Parallelization: 8 workers + lazy CSV loading for efficient ETL

Phase 2 – Query Execution

  • Smart Router: Pattern-matches incoming queries to pre-computed results or dynamic partitions
  • 3-Tier Execution: a. Pre-computed lookup (95 % of queries) → < 1 ms b. Partition pruning → 10–25 ms c. Full scan fallback → 50–100 ms
  • Query Caching: MD5-hashed results → instant repeat execution

⚡ V4 Optimizations

  1. Dictionary encoding → 70 % smaller categorical columns
  2. Query result cache → 99 % faster on repeats
  3. Optimized ZSTD level 1 → 40 % faster preparation
  4. Pre-sorted columns → 15 % faster range queries
  5. Native Polars writer → 25 % faster I/O
  6. Parallel I/O (8 workers) → 30 % faster processing
  7. Lazy loading pipeline → reduced RAM footprint
  8. Partition index cache → 33 ms average query time

🧩 Challenges

1️⃣ Memory Constraints

Processing 245 M rows on 16 GB RAM:

  • Lazy evaluation (scan_csv()) instead of loading everything
  • Streamed transformations
  • Independent partition processing

2️⃣ Prep Time vs Query Speed

Version Prep Time Query Speed Speedup vs DuckDB
v1 211 min 62 ms 394×
v2 120–150 min 40 ms 610×
v3 60 min 36 ms 680×
v4 (Recommended) 90 min 33 ms 700×

3️⃣ Pre-computation vs Dynamic Execution

Full pre-aggregation = terabytes of data. ➡ We selected only 5 top query patterns for pre-computation and routed everything else through dynamic execution.

4️⃣ Categorical Concatenation

Fixed StringCacheMismatchError by enabling global string cache (pl.enable_string_cache()).


🏆 Accomplishments

33 ms average query time on 245 M rows ⚡ 700× faster than DuckDB 💾 61 % compression (19 GB → 7.5 GB) 🎯 100 % accuracy on benchmarks 🏗️ Modular, fault-tolerant architecture with query pattern routing


💡 What We Learned

  • Columnar storage rocks: Predicate pushdown and column pruning cut I/O drastically
  • Pre-aggregation > brute force caching
  • Partition pruning = the single largest speed gain
  • Rust + SIMD = real performance: Polars delivers 10–100× Python speedups
  • Lazy evaluation lets optimizers eliminate redundant work

🔮 Next Steps

  • Adaptive indexing based on query patterns
  • Cost-based query planning for hybrid execution
  • Distributed execution for > 1 billion rows
  • Real-time stream ingestion support
  • ML-driven query prediction to pre-warm caches

Built With

Share this project:

Updates