Dan-Query | Devpost

Inspiration

Every day, 1.6+ billion people see AppLovin ads, generating massive data volumes that need real-time analysis. Traditional databases scan entire datasets for every query—wasting time and resources. We asked: What if we could predict query patterns and pre-compute answers?

Inspired by how search engines pre-index content, we built a system that "pre-thinks" about analytics queries before they're even asked.

What it does

Dan-Query transforms 20GB of ad event data into a 3GB optimized database that answers queries more than 1000x speedup than baseline DuckDB—typically under 100ms.

Key features:

Smart pre-aggregation: 51 tables covering 90% of query patterns
Intelligent routing: Automatically picks the fastest execution path
Sorted storage: Enables 50x I/O reduction through zone maps
Sub-100ms queries: Even on 245M row dataset

Example: "Show daily revenue by country in Japan" goes from 70 seconds → 50 milliseconds.

How we built it

Architecture (3 layers):

Storage: Type-partitioned Parquet files sorted by (day, country, advertiser_id)
Pre-aggregations: 51 tables covering time (daily/weekly/hourly), geography (country), and entities (publisher/advertiser)
Query router: Scoring algorithm that matches queries to optimal aggregation tables

Tech stack:

DuckDB (query engine) + Pandas (data manipulation)
Parquet columnar format with SNAPPY compression
Python for orchestration

Development process:

Built baseline with type partitions
Generated 70+ test queries to avoid overfitting
Profiled bottlenecks (file I/O, sorting, filtering)
Iteratively optimized: sorted storage → native datetime types → predicate pushdown

Challenges we ran into

Challenge 1: The "wrong table" problem

Router initially picked large tables (4.8M rows) when small ones (366 rows) existed
Solution: Refined scoring algorithm to heavily penalize extra dimensions

Challenge 2: String comparison hell

Date filters like day='2024-06-01' took 800ms (character-by-character matching)
Solution: Stored dates as native datetime types → 234ms (3.4x faster via binary comparison)

Challenge 3: Sorting 500k+ rows

Pandas struggled with large sorts (500ms+)
Solution: Adaptive strategy—DuckDB for >10k rows (20-50ms), Pandas for smaller data

Challenge 4: Useless Parquet statistics

Unsorted data had statistics spanning entire dataset (e.g., day: Jan 1 to Dec 31)
Solution: Sort before export → row groups contain single day/country → 50x I/O reduction

Accomplishments that we're proud of

1000 ~ 2000x speedup - Consistently fast across diverse query patterns

100% correctness - All 15 test queries match baseline results exactly

85% compression - 20GB → 3GB while enabling faster queries

Sub-100ms queries - Production-ready performance on 245M rows

Smart routing - The System automatically picks the optimal execution path without manual tuning

What we learned

Technical insights:

Sorting is everything: Organized data makes statistics useful—turned unusable min/max ranges into precise zone maps
Pre-aggregation ROI: 6 minutes of prep time saves 60-180 seconds per query (break-even after 2-3 queries)
Native types matter: String comparisons are 3-4x slower than binary operations
Parquet + sorting = magic: Columnar format + sorted data + predicate pushdown = 70-90% I/O reduction

Design principles:

Profile before optimizing: Breakdown queries into phases (read, filter, aggregate, sort) to find real bottlenecks
Anti-overfit testing: Generate diverse queries, not just given samples
Hierarchical design: Build fallbacks (aggregation → partition scan) for flexibility

Database concepts we internalized:

Zone maps (sorted data → useful statistics)
Predicate pushdown (filter during read, not after)
Columnar storage advantages
Query planning heuristics

What's next for Dan-Query

Performance:

[ ] Parallelize aggregation building (less setup time)
[ ] Bloom filters for high-cardinality columns (user_id, auction_id)

Scalability:

[ ] Incremental updates (add new day's data without full rebuild)
[ ] Partition pruning metadata (pre-computed per-partition statistics)
[ ] Multi-level aggregations (daily → weekly → monthly hierarchy)

Intelligence:

[ ] Query pattern learning (track which aggregations are actually used)
[ ] Auto-tune aggregation set based on workload
[ ] Cost-based optimizer (estimate I/O before execution)

The vision: A self-optimizing analytics database that learns from query patterns and automatically tunes itself—making data analysts feel like they're querying a 1KB table, even when it's 1TB.

Built With

claude
cursor
duckdb
pandas
python
sql

Updates

hw528 wu started this project — Oct 26, 2025 11:09 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.