Benchmarks
Benchmark Result
Here is the benchmark result of TPC-DS 1TB Dataset, running under Apache Spark-3.5.6 and Apache Auron (Incubating)-6.0.0-preview (dc8d7a9).
Below is a brief introduction of how we run TPC-DS benchmark with Apache Spark/Apache Auron (Incubating).
Get TPC-DS tools
git clone https://github.com/auron-project/tpcds-tools
Generate TPC-DS dataset
Compile datagen tool (derived from maropu/spark-tpcds-datagen).
cd tpcds/datagen
mvn package -DskipTests
Generate 1TB dataset with spark.
# use correct SPARK_HOME and output data location
# --use-double-for-decimal and --use-string-for-char are optional, see dsdgen usage
SPARK_HOME=$HOME/software/spark ./bin/dsdgen \
--output-location /user/hive/data/tpcds-1000 \
--scale-factor 1000 \
--format parquet \
--overwrite \
--use-double-for-decimal \
--use-string-for-char
Run benchmark
Compile benchmark tool (derived from databricks/spark-sql-perf).
cd tpcds/benchmark-runner
mvn package -DskipTests
Edit your $SPARK_HOME/conf/spark-default.conf to enable/disable Apache Auron (Incubating) (see the following conf), then launch benchmark runner. If benchmarking with Apache Auron (Incubating), ensure that the Apache Auron (Incubating) jar package is correctly built and moved into $SPARK_HOME/jars. (How to build Apache Auron (Incubating)?)
# use correct SPARK_HOME and data location
SPARK_HOME=$HOME/software/spark ./bin/run \
--data-location /user/hive/data/tpcds-1000 \
--format parquet \
--output-dir ./benchmark-result
Monitor benchmark status:
tail -f ./benchmark-result/YYYYMMDDHHmm/log
Summarize query times of all cases:
./bin/stat ./benchmark-result/YYYYMMDDHHmm
Benchmark configuration
here is a simple configuration used for benchmarking, please notice that the benchmark result will slight differ when running on different environments.
spark.master yarn
spark.yarn.queue offline
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///home/spark-eventlog
spark.history.fs.logDirectory hdfs:///home/spark-eventlog
spark.shuffle.service.enabled true
spark.shuffle.service.port 7337
spark.driver.memory 20g
spark.driver.memoryOverhead 4096
spark.executor.instances 10000
spark.dynamicallocation.maxExecutors 10000
spark.executor.cores 8
spark.io.compression.codec lz4
spark.sql.parquet.compression.codec zstd
# benchmark without auron
#spark.executor.memory 20g
#spark.executor.memoryOverhead 4096
# benchmark with auron
spark.executor.memory 8g
spark.executor.memoryOverhead 16384
spark.sql.extensions org.apache.spark.sql.auron.AuronSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.auron.shuffle.AuronShuffleManager
spark.memory.offHeap.enabled false
spark.auron.enable true
spark.auron.memoryFraction 0.8
spark.auron.process.vmrss.memoryFraction 0.8
spark.auron.tokio.worker.threads.per.cpu 1
spark.auron.forceShuffledHashJoin true
spark.auron.smjfallback.enable true
spark.auron.smjfallback.mem.threshold 512000000
spark.auron.udafFallback.enable true
spark.auron.partialAggSkipping.skipSpill true
