82,634 questions
0
votes
1
answer
68
views
Optimize code to flatten meta ads metrics data in spark
I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same,
( id, ...
0
votes
0
answers
62
views
spark flatMapToPair reaching "no space left on device" due to large duplication of entries
First, my question is not on increasing disk space to avoid no space left error, but to understand what spark does, and hopefully how to improve my code.
In short, here is the pseudo code:
JavaRDD&...
1
vote
1
answer
91
views
Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec
I want to use a compression in bigdata processing, but there are two compression codecs.
Anyone know the difference?
Advice
0
votes
4
replies
72
views
Use RSA key snowflake connection options instead of Password
I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...
0
votes
1
answer
98
views
Does Databricks Spark SQL evaluate all CASE branches for UDFs?
I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups.
Each UDF branches on IPv4 vs IPv6 using a CASE expression like:
CASE
WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
...
...
1
vote
0
answers
108
views
Warning and performance issues when scanning delta tables
Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
1
vote
1
answer
46
views
How to detect Spark application failure in SparkListener when no jobs are executed?
I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...
0
votes
1
answer
62
views
How to delete specific version(s) from a Delta table? [closed]
When using the Delta format, it is possible to time-travel to a specific version of the table. In my case, some of these versions are corrupted. I would like to delete/remove/drop them. For instance, ...
0
votes
0
answers
37
views
How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?
I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy.
The Logic I compare the ...
2
votes
0
answers
52
views
How log model in mlflow using Spark Connect
I have the following setup:
Kubernetes cluster with Spark Connect 4.0.1 and
MLflow tracking server 3.5.0
MLFlow tracking server should serve all artifacts and is configured this way:
--backend-store-...
0
votes
1
answer
66
views
Handle corrupted files in spark load()
I have a spark job that runs daily to load data from S3.
These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...
-1
votes
2
answers
53
views
Connectivity issues in standalone Spark 4.0
In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program:
from pyspark.sql import SparkSession
...
1
vote
1
answer
134
views
PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook
I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code.
Background:
Spark Version 3.5.7
Java Version 11.0.29 (Eclipse ...
-1
votes
0
answers
380
views
Implementing Incremental Data Quality Validation in Large-Scale ETL Pipelines with Schema Evolution
I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks ...
4
votes
2
answers
281
views
Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config
I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs:
delta-spark_2.13-4.0.0.jar
delta-storage-4.0.0.jar
hadoop-aws-3.3.6.jar
aws-...