Inspiration

Every day, 1.6+ billion people see AppLovin ads, generating massive data volumes that need real-time analysis. Traditional databases scan entire datasets for every query—wasting time and resources. We asked: What if we could predict query patterns and pre-compute answers?

Inspired by how search engines pre-index content, we built a system that "pre-thinks" about analytics queries before they're even asked.

What it does

Dan-Query transforms 20GB of ad event data into a 3GB optimized database that answers queries more than 1000x speedup than baseline DuckDB—typically under 100ms.

Key features:

  • Smart pre-aggregation: 51 tables covering 90% of query patterns
  • Intelligent routing: Automatically picks the fastest execution path
  • Sorted storage: Enables 50x I/O reduction through zone maps
  • Sub-100ms queries: Even on 245M row dataset

Example: "Show daily revenue by country in Japan" goes from 70 seconds → 50 milliseconds.

How we built it

Architecture (3 layers):

  1. Storage: Type-partitioned Parquet files sorted by (day, country, advertiser_id)
  2. Pre-aggregations: 51 tables covering time (daily/weekly/hourly), geography (country), and entities (publisher/advertiser)
  3. Query router: Scoring algorithm that matches queries to optimal aggregation tables

Tech stack:

  • DuckDB (query engine) + Pandas (data manipulation)
  • Parquet columnar format with SNAPPY compression
  • Python for orchestration

Development process:

  1. Built baseline with type partitions
  2. Generated 70+ test queries to avoid overfitting
  3. Profiled bottlenecks (file I/O, sorting, filtering)
  4. Iteratively optimized: sorted storage → native datetime types → predicate pushdown

Challenges we ran into

Challenge 1: The "wrong table" problem

  • Router initially picked large tables (4.8M rows) when small ones (366 rows) existed
  • Solution: Refined scoring algorithm to heavily penalize extra dimensions

Challenge 2: String comparison hell

  • Date filters like day='2024-06-01' took 800ms (character-by-character matching)
  • Solution: Stored dates as native datetime types → 234ms (3.4x faster via binary comparison)

Challenge 3: Sorting 500k+ rows

  • Pandas struggled with large sorts (500ms+)
  • Solution: Adaptive strategy—DuckDB for >10k rows (20-50ms), Pandas for smaller data

Challenge 4: Useless Parquet statistics

  • Unsorted data had statistics spanning entire dataset (e.g., day: Jan 1 to Dec 31)
  • Solution: Sort before export → row groups contain single day/country → 50x I/O reduction

Accomplishments that we're proud of

1000 ~ 2000x speedup - Consistently fast across diverse query patterns

100% correctness - All 15 test queries match baseline results exactly

85% compression - 20GB → 3GB while enabling faster queries

Sub-100ms queries - Production-ready performance on 245M rows

Smart routing - The System automatically picks the optimal execution path without manual tuning

What we learned

Technical insights:

  • Sorting is everything: Organized data makes statistics useful—turned unusable min/max ranges into precise zone maps
  • Pre-aggregation ROI: 6 minutes of prep time saves 60-180 seconds per query (break-even after 2-3 queries)
  • Native types matter: String comparisons are 3-4x slower than binary operations
  • Parquet + sorting = magic: Columnar format + sorted data + predicate pushdown = 70-90% I/O reduction

Design principles:

  • Profile before optimizing: Breakdown queries into phases (read, filter, aggregate, sort) to find real bottlenecks
  • Anti-overfit testing: Generate diverse queries, not just given samples
  • Hierarchical design: Build fallbacks (aggregation → partition scan) for flexibility

Database concepts we internalized:

  • Zone maps (sorted data → useful statistics)
  • Predicate pushdown (filter during read, not after)
  • Columnar storage advantages
  • Query planning heuristics

What's next for Dan-Query

Performance:

  • [ ] Parallelize aggregation building (less setup time)
  • [ ] Bloom filters for high-cardinality columns (user_id, auction_id)

Scalability:

  • [ ] Incremental updates (add new day's data without full rebuild)
  • [ ] Partition pruning metadata (pre-computed per-partition statistics)
  • [ ] Multi-level aggregations (daily → weekly → monthly hierarchy)

Intelligence:

  • [ ] Query pattern learning (track which aggregations are actually used)
  • [ ] Auto-tune aggregation set based on workload
  • [ ] Cost-based optimizer (estimate I/O before execution)

The vision: A self-optimizing analytics database that learns from query patterns and automatically tunes itself—making data analysts feel like they're querying a 1KB table, even when it's 1TB.

Built With

Share this project:

Updates