Different type of Read Modes and Write Modes in spark?

Omid Ebrahimi | TheCodeZ — Thu, 03 Apr 2025 19:06:32 GMT

In Apache Spark, when working with DataFrames, we need to define the type of data in each column, which is done using something called Schema. There are three ways to define a schema:

1. InferSchema (Automatically Inferring Data Types)

In this method, Spark automatically analyzes the data and tries to guess the type of each column. For example, if a column contains numbers, Spark will determine whether it should be an Integer or a Double. This method is easy to use but not always 100% accurate.

df = spark.read.option("inferSchema", "true").csv("data.csv")

Here, Spark automatically determines the type of each column.

2. Implicit Schema (Schema Already Defined in the File Format)

Some file formats, like JSON and Parquet, already contain schema information. When loading these files, Spark automatically uses the existing schema, so you don’t need to define the data types manually.

df = spark.read.json("data.json")

Since JSON files already have a structured format, Spark can infer the schema from the file itself.

3. Explicit Schema (Manually Defining the Schema)

Sometimes, it’s better to explicitly define the schema, especially if the data is not standardized or you want to have full control over the data types. There are two main ways to do this:

3.1. Programmatically Specified Schema

In this method, we use StructType and StructField to manually define the type of each column.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

df = spark.read.schema(schema).csv("data.csv")

Here, we explicitly set the "name" column as a String and the "age" column as an Integer.

3.2. DDL String Schema

Instead of using StructType, we can define the schema as a simple string.

df = spark.read.schema("name STRING, age INT").csv("data.csv")

This quickly defines "name" as a String and "age" as an Integer.

ALL IN One

Summary

InferSchema → Spark automatically guesses the data types.
Implicit Schema → If the file format (like JSON or Parquet) already has a schema, Spark will use it automatically.
Explicit Schema → You manually define the data types for better accuracy and control.

💡 Tip: Most developers prefer using Explicit Schema because it provides more control and is considered best practice for working with real-world data.

Case Study : How to quickly and accurately locate specific Telegram messages within a massive…

Omid Ebrahimi | TheCodeZ — Sun, 19 Jan 2025 15:01:07 GMT

Case Study : How to quickly and accurately locate specific Telegram messages within a massive volume of data?

Post Hunter ETL PipeLine

My client managed several large Telegram groups, each generating thousands of messages daily. They needed a system to identify specific messages containing certain keywords, phrases, or content amidst this overwhelming volume.
The manual approach they had been using was highly time-consuming, inefficient, and error-prone. Additionally, the client required a graphical and user-friendly interface to interact with the results easily. Their main needs included:

Finding key messages at different times.
Quickly and accurately analyzing data for better decision-making.
A system that could display the results in a simple interface, accessible to users without technical expertise.

Solution: Designing and Implementing the Post Hunter Pipeline to Search and Display Messages

To address this challenge, I designed an integrated and flexible system with the following components:

Data Extraction from Telegram:
I developed a scraper using the Telegram API and Python to extract messages from specific groups and channels. The scraper was designed to retrieve targeted data with high speed and accuracy.
Data Processing with Kafka:
The extracted messages were sent to Apache Kafka for efficient transfer and real-time processing of large volumes of data. Kafka also facilitated seamless coordination between various components of the system.
Storage and Search in ElasticSearch:
The processed messages were stored in ElasticSearch, enabling advanced querying capabilities to locate specific messages. This feature allowed the client to quickly find desired messages using filters and defined keywords.
Graphical User Interface (GUI):
To present the messages to the client, I developed a simple and user-friendly graphical application. This application allowed users to view and analyze the queried messages in detail. The interface was intuitive, making it accessible even to non-technical users.

Results and Achievements:

This system effectively resolved the client’s challenges, delivering the following outcomes:

Reduced Search Time: The time required to find specific messages dropped from hours to minutes.
High Accuracy in Results: With ElasticSearch and advanced queries, specific messages were located without errors.
Increased Efficiency: The client could focus on analysis and decision-making instead of manual searches.
Simple User Interface: The tool allowed the client to interact with the data effortlessly, even without technical expertise.

Key Takeaways and Project Highlights:

This project showcases my ability to design and implement advanced data-driven systems that solve complex challenges for clients. Post Hunter is now a powerful tool that enables the client to locate and analyze specific Telegram messages with speed, accuracy, and efficiency.

If you’re facing similar challenges and need a data engineer to build customized solutions, I’d love to collaborate with you. 😊

Stories by Omid Ebrahimi | TheCodeZ on Medium