How to save a DataFrame to PostgreSQL in pyspark

This recipe helps you save a DataFrame to PostgreSQL in pyspark

Recipe Objective: How to save a DataFrame to PostgreSQL in pyspark?

In most big data scenarios, data merging and aggregation are an essential part of the day-to-day activities in big data platforms. In this scenario, we will load the dataframe to the Postgres database table or save the dataframe to the table.

System requirements :

  • Install Ubuntu in the virtual machine click here
  • Install single-node Hadoop machine click here
  • Install pyspark or spark in Ubuntu click here
  • The below codes can be run in Jupyter notebook or any python console.

Learn to Build ETL Data Pipelines on AWS

Step 1: Import the modules

In this scenario, we are going to import the pyspark and pyspark SQL modules and create a spark session as below:

import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.config("spark.jars", "/usr/local/postgresql-42.2.5.jar") \ .master("local").appName("PySpark_Postgres_test").getOrCreate()

The output of the code:

bigdata_1.jpg

Step 2: Create Dataframe to store in Postgres

Here we will create a dataframe to save in a Postgres table for that the Row class is in the pyspark.sql submodule. As shown above, we import the Row from class.

studentDf = spark.createDataFrame([ Row(id=1,name='vijay',marks=67), Row(id=2,name='Ajay',marks=88), Row(id=3,name='jay',marks=79), Row(id=4,name='vinay',marks=67), ])

Explore SQL Database Projects to Add them to Your Data Engineer Resume. 

The output of the code:

bigdata_2.jpg

Step 3: To View Data of the Data Frame

Here we are going to view the data top 5 rows in the dataframe as shown below.

studentDf.show(5)

The output of the code:

bigdata_3.jpg

Step 4: To Save Dataframe to Postgres Table

Here we are going to save the dataframe to the Postgres table which we created earlier. To save, we need to use a write and save method as shown in the below code.

studentDf.select("id","name","marks").write.format("jdbc")\ .option("url", "jdbc:postgresql://localhost:5432/dezyre_new") \ .option("driver", "org.postgresql.Driver").option("dbtable", "students") \ .option("user", "hduser").option("password", "bigdata").save()

The output of the code:

bigdata_4.jpg

To check the output of the saved data frame in the Postgres table, log in Postgres database.

The output of the saved dataframe:

bigdata_5.jpg

As shown in the above image, we have written the dataframe to create a table in Postgres.

Conclusion

Here we learned to save a DataFrame to PostgreSQL in pyspark.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Build Data Pipeline using Azure Medallion Architecture Approach
In this Azure Project, you will build a data pipeline to analyze large sensor data collected from water bodies across different European countries over several years using Azure Services and SQL Server to generate visualizations to gain valuable insights into water quality trends and determinands.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.