QUERY OPTIMIZER — 33 ms Query Engine (700× Faster Than DuckDB)
Inspiration
Every day, AppLovin processes over 1.6 billion ad impressions, generating terabytes of auction, click, bid, and purchase data. We asked ourselves:
“What if analytics on billions of ad events could run in milliseconds, not seconds?”
That question led to QUERY OPTIMIZER, a high-performance analytics engine built from scratch to challenge the limits of speed and scalability.
⚙️ What It Does
Our system transforms a 19 GB / 245 million-row ad-event dataset into a real-time analytics engine that executes complex queries in:
- 33 ms on first run (≈ 700× faster than DuckDB)
- 4–5 ms with cached queries (≈ 6,100× faster)
Two-Phase Architecture
- Prepare Phase: One-time transformation that partitions, compresses, and pre-computes frequent aggregations
- Query Phase: A smart router analyzes query patterns and executes them using the optimal strategy
🧠 How We Built It
Phase 1 – Data Preparation
- Partitioning: Split by event type + day → skips > 75 % of data per query
- Compression: ZSTD + dictionary encoding → 61 % storage reduction (19 GB → 7.5 GB)
- Pre-Computation: Identified 5 high-frequency query patterns for one-time aggregation
- Parallelization: 8 workers + lazy CSV loading for efficient ETL
Phase 2 – Query Execution
- Smart Router: Pattern-matches incoming queries to pre-computed results or dynamic partitions
- 3-Tier Execution: a. Pre-computed lookup (95 % of queries) → < 1 ms b. Partition pruning → 10–25 ms c. Full scan fallback → 50–100 ms
- Query Caching: MD5-hashed results → instant repeat execution
⚡ V4 Optimizations
- Dictionary encoding → 70 % smaller categorical columns
- Query result cache → 99 % faster on repeats
- Optimized ZSTD level 1 → 40 % faster preparation
- Pre-sorted columns → 15 % faster range queries
- Native Polars writer → 25 % faster I/O
- Parallel I/O (8 workers) → 30 % faster processing
- Lazy loading pipeline → reduced RAM footprint
- Partition index cache → 33 ms average query time
🧩 Challenges
1️⃣ Memory Constraints
Processing 245 M rows on 16 GB RAM:
- Lazy evaluation (
scan_csv()) instead of loading everything - Streamed transformations
- Independent partition processing
2️⃣ Prep Time vs Query Speed
| Version | Prep Time | Query Speed | Speedup vs DuckDB |
|---|---|---|---|
| v1 | 211 min | 62 ms | 394× |
| v2 | 120–150 min | 40 ms | 610× |
| v3 | 60 min | 36 ms | 680× |
| v4 (Recommended) | 90 min | 33 ms | 700× |
3️⃣ Pre-computation vs Dynamic Execution
Full pre-aggregation = terabytes of data. ➡ We selected only 5 top query patterns for pre-computation and routed everything else through dynamic execution.
4️⃣ Categorical Concatenation
Fixed StringCacheMismatchError by enabling global string cache (pl.enable_string_cache()).
🏆 Accomplishments
✅ 33 ms average query time on 245 M rows ⚡ 700× faster than DuckDB 💾 61 % compression (19 GB → 7.5 GB) 🎯 100 % accuracy on benchmarks 🏗️ Modular, fault-tolerant architecture with query pattern routing
💡 What We Learned
- Columnar storage rocks: Predicate pushdown and column pruning cut I/O drastically
- Pre-aggregation > brute force caching
- Partition pruning = the single largest speed gain
- Rust + SIMD = real performance: Polars delivers 10–100× Python speedups
- Lazy evaluation lets optimizers eliminate redundant work
🔮 Next Steps
- Adaptive indexing based on query patterns
- Cost-based query planning for hybrid execution
- Distributed execution for > 1 billion rows
- Real-time stream ingestion support
- ML-driven query prediction to pre-warm caches
Log in or sign up for Devpost to join the conversation.