In this article, we will see a scenario based question in Spark to understand the concept of windowing function in Spark.
Problem Statement:
Consider we have a CSV file with some duplicate records in it as shown in the picture. Our requirement is to find duplicate records or duplicate rows in spark dataframe and report the output Spark Dataframe as shown in diagram (output dataframe).Solution:
We can solve this problem to find duplicate rows by two Method,- PySpark GroupBy
- PySpark Window Rank Function
For the Explanation and demo on the above given two methods, please watch the video embedded below
Subscribe to my YouTube channel for more Spark related Question.
Code snippets:
Step 1;Initialize the SparkSession and read the sample CSV file
import findspark
findspark.init()
# Create SparkSession
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Report_Duplicate").getOrCreate()
#Read CSV File
in_df=spark.read.csv("duplicate.csv",header=True)
in_df.show()
Out[]:
Approach 1: GroupBy
in_df.groupby("Name","Age","Education","Year") \
.count() \
.where("count > 1") \
.drop("count").show()
Out[]:
Approach 2: Window Ranking Function
from pyspark.sql.window import Window
from pyspark.sql.functions import col,row_number
#Create window
win=Window.partitionBy("name").orderBy(col("Year").desc())
in_df.withColumn("rank", row_number().over(win)) \
.filter("rank > 1") \
.drop("rank").dropDuplicates().show()
out[]:
Happy Learning !!!
4 Comments
Kindly let me know how to do it in spark scala
ReplyDeletethe df.dropDuplicates() is enough.
ReplyDeleteThere is still a bug, use the below data set,
Name,Age,Education,Year
RAM,28,BE,2012
Rakesh,53,MBA,1985
Madhu,22,B.Com,2018
Rakesh,53,MBA,1985
Bill,32,ME,2007
Madhu,22,B.Com,2018
Rakesh,53,MBA,1985
RAM,25,MA,2012
Now you can see, there are two RAM, but they are different students not duplicate,
This is not a bug in code. U will have to add all the required columns inside dropDuplicate. In your scenario, we can add Name and Age which help you. Hope you understood this concept.
DeleteHello Bhai mujhe spark ke bare me theory to pata Chali ki usme spark SQL, pyspark hota hai lekin Bhai yah kahase sikhe step-by-step please reply
ReplyDelete