Skip to main content
r/SQL icon

r/SQL

members
online


Your next job could start with SQL. Here’s the roadmap Data Analysts follow. Start for free now.
Image Your next job could start with SQL. Here’s the roadmap Data Analysts follow. Start for free now.



SQL optimization advice for large skewed left joins in Spark SQL SQL optimization advice for large skewed left joins in Spark SQL
Spark SQL/Databricks

dealing with serious SQL performance problem in Spark 3.2.2. My job runs a left join between a large fact table (~100M rows) and a dimension table (~5M rows, ~200MB). During the join, some tasks take much longer than others due to extreme skew, and sometimes the job fails with OOM.

I already increased executor memory to 16GB, which helped temporarily. I enabled AQE (spark.sql.adaptive.enabled = true), but the skew join optimization never triggers. I also tried broadcast join hints, but Spark still chooses a shuffle join. Using random suffixes to redistribute data inflated the size 10x and caused worse memory issues.

My questions.

  • Why would Spark refuse to apply a broadcast join when the table looks small enough? Could data types, nulls, or statistics prevent it?

  • Why does AQE not detect such a clear skew, and what exact conditions are needed for it to activate?

  • Beyond memory increases and random suffix hacks, what real SQL-level optimization strategies could help, like repartitioning, bucketing, custom partitioning, or specific Spark SQL configs?

  • Any practical experience or insights with large skewed left joins in SQL / Spark SQL would be very helpful.