Newest 'google-cloud-dataproc' Questions

0 votes

2 answers

94 views

Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...

Aakash Shrivastav

reputation score 1

asked Mar 11 at 10:08

1 vote

1 answer

153 views

How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery

I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster. I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...

Aleksander Lipka

reputation score 538

asked Dec 22, 2025 at 14:37

1 vote

1 answer

85 views

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...

Siddiq Syed

reputation score 11

asked Oct 2, 2025 at 8:07

0 votes

0 answers

67 views

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...

Sunil

reputation score 441

asked Sep 24, 2025 at 13:07

0 votes

1 answer

73 views

ModuleNotFoundError in GCP after trying to sumbit a job

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...

SofiaNiki

reputation score 1

asked Sep 3, 2025 at 11:16

3 votes

1 answer

166 views

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....

Anshul Dubey

reputation score 305

asked Jul 7, 2025 at 18:05

1 vote

0 answers

73 views

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...

Lê Văn Đức

reputation score 11

asked Jun 7, 2025 at 11:00

2 votes

1 answer

224 views

Spark memory error in thread spark-listener-group-eventLog

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...

Jesus Diaz Rivero

reputation score 332

asked May 7, 2025 at 12:16

1 vote

0 answers

76 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...

user16798185

reputation score 387

asked May 5, 2025 at 16:07

1 vote

2 answers

101 views

How do you run Python Hadoop Jobs on Dataproc?

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...

The Beast

reputation score 163

asked Apr 29, 2025 at 11:03

2 votes

0 answers

58 views

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...

AlexisBRENON

reputation score 3189

asked Mar 17, 2025 at 11:48

1 vote

0 answers

46 views

Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...

Bob

reputation score 383

asked Mar 15, 2025 at 22:26

2 votes

1 answer

143 views

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...

Abhilash

reputation score 31

asked Mar 10, 2025 at 18:15

1 vote

2 answers

190 views

How to pass arguments from GCP Workflows into Dataproc

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...

54m

reputation score 777

asked Mar 5, 2025 at 16:36

1 vote

0 answers

123 views

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....

Abhijit Aravind

reputation score 11

asked Mar 4, 2025 at 22:11

Collectives™ on Stack Overflow

Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure

How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

ModuleNotFoundError in GCP after trying to sumbit a job

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Spark memory error in thread spark-listener-group-eventLog

Out of memory for a smaller dataset

How do you run Python Hadoop Jobs on Dataproc?

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

Paritial records being read in Pyspark through Dataproc

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

How to pass arguments from GCP Workflows into Dataproc

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

Hot Network Questions