Newest 'apache-spark' Questions

Advice

0 votes

4 replies

48 views

Use RSA key snowflake connection options instead of Password

I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...

Prafulla

85

asked 2 days ago

0 votes

0 answers

49 views

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like: CASE WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path ... ...

YJCMS

1

asked 2 days ago

-1 votes

0 answers

47 views

Optimizing Spark job to detect and extract Enum values (low cardinality columns) across thousands of Hive/MySQL tables [closed]

I am building a data profiling tool to iterate through all tables in our data warehouse (a mix of Hive and MySQL tables) to identify and extract all possible values for "Enum-like" columns. ...

Unnamed

1

asked Dec 8 at 1:59

1 vote

0 answers

71 views

Warning and performance issues when scanning delta tables

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...

gaut

6,048

asked Dec 6 at 1:45

1 vote

1 answer

24 views

How to detect Spark application failure in SparkListener when no jobs are executed?

I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...

tnazarew

11

asked Dec 5 at 4:28

0 votes

1 answer

44 views

How to delete specific version(s) from a Delta table?

When using the Delta format, it is possible to time-travel to a specific version of the table. In my case, some of these versions are corrupted. I would like to delete/remove/drop them. For instance, ...

pgrandjean

796

asked Dec 2 at 14:50

0 votes

0 answers

28 views

How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?

I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy. The Logic I compare the ...

HoanggLB2k2

1

asked Dec 2 at 9:23

2 votes

0 answers

43 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage

6,213

asked Nov 26 at 13:39

0 votes

1 answer

57 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh

1,933

asked Nov 26 at 7:17

-1 votes

2 answers

47 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy

43

asked Nov 24 at 16:16

1 vote

1 answer

114 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100

11

asked Nov 24 at 1:47

-1 votes

0 answers

373 views

Implementing Incremental Data Quality Validation in Large-Scale ETL Pipelines with Schema Evolution

I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks ...

Vijay Savaliya

7

asked Nov 23 at 6:00

2 votes

1 answer

187 views

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...

Tutu ツ

155

asked Nov 22 at 12:54

0 votes

0 answers

27 views

Large variation in spark runtimes

Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...

Ben Fuqua

85

asked Nov 21 at 16:42

3 votes

1 answer

102 views

Spark-Redis write loses rows when writing large DataFrame to Redis

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...

gianfranco de siena

21

asked Nov 19 at 5:06

Collectives™ on Stack Overflow

Use RSA key snowflake connection options instead of Password

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

Optimizing Spark job to detect and extract Enum values (low cardinality columns) across thousands of Hive/MySQL tables [closed]

Warning and performance issues when scanning delta tables

How to detect Spark application failure in SparkListener when no jobs are executed?

How to delete specific version(s) from a Delta table?

How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?

How log model in mlflow using Spark Connect

Handle corrupted files in spark load()

Connectivity issues in standalone Spark 4.0

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

Implementing Incremental Data Quality Validation in Large-Scale ETL Pipelines with Schema Evolution

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

Large variation in spark runtimes

Spark-Redis write loses rows when writing large DataFrame to Redis

Hot Network Questions