FastQuery | Devpost

Inspiration

The project's inspiration is the Query Planning Challenge from AppLovin, which tests the ability of a data system to handle a massive volume of real-world event data (specifically 14GB+ of Ad-Tech/E-commerce logs) and execute a diverse suite of complex analytical queries while adhering to strict memory limitations (e.g., 14GB RAM). The core problem is overcoming the computational intensity of high-cardinality aggregations and data scanning without hitting an Out-of-Memory (OOM) error.

What it does

FastQuery provides a lightning-fast, memory-safe data analysis pipeline that processes multi-file CSV datasets and answers complex, aggregated SQL queries in milliseconds. It acts as an intelligent layer that translates high-level user requests (via JSON) into optimized SQL statements that run against pre-calculated summary tables, ensuring sub-second, reliable performance where a standard query engine would fail or crash due to resource exhaustion.

How we built it

We implemented a two-phase, hybrid processing pipeline using Polars and DuckDB:

Preparation Phase (prepare.py):

- Data Ingestion: Used Polars for highly efficient, memory-safe loading and type-casting of the raw CSV files.

- Normalization & Partitioning: Converted the data into a single, optimized Parquet dataset, partitioned by key columns like country for fast I/O.

- Aggregations: Used DuckDB's persistent database to create numerous, specialized, pre-aggregated tables (e.g., agg_bids_by_publisher, agg_auction_funnel, agg_unique_auctions_by_day). This moved the most computationally expensive operations, like distinct counting on UUIDs, offline.

Runtime Phase (run.py & query_planner.py):

- Query Planning: A custom Python Query Planner analyzes the user's JSON query. It intelligently determines which of the pre-aggregated tables can satisfy the request based on the required columns and aggregations.

- Optimization: It then rewrites the original SQL to run only against the smallest, most relevant aggregate table, effectively mapping the user's request (e.g., COUNT(DISTINCT auction_id)) to a direct lookup against the cached distinct_auction_count column.

Challenges we ran into

The primary challenge was ensuring memory safety and accuracy simultaneously:

Out-of-Memory (OOM) Errors: High-cardinality aggregations (COUNT(DISTINCT auction_id) or GROUP BY auction_id) on the raw dataset caused the system to exceed the 14GB memory limit. This required the development of explicit, specialized aggregate tables (like agg_auction_funnel) to move these resource-intensive tasks to the offline prepare phase.
Accuracy Issues (Float Mismatch): Minor precision errors arose from aggregating floating-point numbers (prices) in different orders across various pre-calculated tables. This necessitated meticulous debugging of the query planner's routing logic to ensure consistency with the ground truth.

Accomplishments that we're proud of

We achieved a verifiable and dramatic improvement in both speed and reliability:

99.9% Speed Improvement: The system can process a critical subset of 5 queries (from the original challenge README) on the entire dataset in 35 milliseconds (ms), compared to 52 seconds for the provided baseline model.

Memory Resilience: Successfully implemented robust, permanent solutions for all OOM-inducing queries by utilizing offline aggregation, ensuring the system runs reliably within the 14GB memory budget.

High Accuracy: Passed all 60+ diverse test queries with high precision against the ground truth.

What we learned

The core lesson learned is that for high-volume data analysis, query planning and pre-aggregation are non-negotiable for optimal performance. By using a powerful data manipulation language like Polars for preparation and a feature-rich engine like DuckDB for persistent aggregation, it is possible to shift expensive computation from the runtime to the offline preparation phase. This approach allows even massive datasets to yield complex query results in the millisecond range.

What's next for FastQuery

Future plans focus on further hardening and expanding the system:

Generalization of Planning Logic: Developing a more abstract system to automatically handle high-cardinality distinct counts (COUNT(DISTINCT *)) without needing a new, hand-coded routing rule for every new column.

Advanced Query Support: Integrating support for more complex SQL operations (e.g., joins across different pre-aggregated tables and window functions) while strictly preserving memory constraints.

Dynamic Aggregation: Creating a mechanism to automatically determine the optimal set of aggregate tables based on observed real-world query patterns, rather than relying on a manually pre-defined schema.

Built With

python

Updates

Suprad Parashar started this project — Oct 26, 2025 12:08 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.