Spark SQL - Split and Concat columns in DF:
Today's topic for our discussion is How to Split the value inside the column in Spark Dataframe into multiple columns. In a banking domain and retail sector, we might often encounter this scenario and also, this kind of small use-case will be a questions frequently asked during Spark interviews. In this
chapter, we will learn how to achieve this requirement with help of one simple illustration. I would recommend you guys to try this yourself in your personal computer to get more insight of it. If you don't have Spark environment setup ready in your machine, follow the link to install Spark on Windows system.
Problem Statement:
We have a column with person's First Name and Last Name separated by comma in a Spark Dataframe.
and we need to,
a) Split the Name column into two columns as First Name and Last Name.
b) Create a Email-id column in the format like firstname.lastname@email.com.
Have a look at the above diagram for your reference,
Solution:
Before scrolling further, give a try by yourself to solve this simple and basic Spark problem. Well, this sort of requirement can be easily done with the help of Microsoft Excel itself. But, if the data that needs to be processed is quite huge in volume, then we can't perform this operations using MS-Excel and this is the point where Spark framework comes into picture. Come let's solve this problem statement and learn one simple technique in Apache Spark.
Input File:
Sample Input file is the CSV format file, having two columns Name, Age in it and holding 7 records in it. The data looks as shown in the below figure
Read the input file as dataframe after opening SparkSession. Code snippet for reading is given below.
Code Snippet:
#Read the input CSV file
input_df=spark.read.options(delimiter='|').csv('input.csv', header=True)
input_df.show(truncate=0)
Out[]:
a) Split Columns in PySpark Dataframe:
We need to Split the Name column into FirstName and LastName. This operation can be done in two ways, let's look into both the method
Method 1: Using Select statement:
We can leverage the use of Spark SQL here by using the select statement to split Full Name as First Name and Last Name. Code snippet to achieve the split result is as follows,
Code Snippet:
#Split logic
import pyspark.sql.functions as f
split_df=input_df.select('Name',
f.split('Name',',')[0].alias("First Name"),
f.split('Name',',')[1].alias("Last Name"),
)
split_df.show()
Out[]:
From the output it is observed that if there is a huge number of column list to be selected in the output, then this method will be some of difficult as we need to add all the columns manually to the select statement.
Method 2: Using Spark SQL API - withColumn():
To over the difficulty of selecting columns manually, Spark Dataframe provides us a Dataframe API, withColumn() to derive the above result. Follow the below snippet to do the same
Code Snippet:
input_df.withColumn('First Name',f.split('Name',',')[0]) \
.withColumn('Last Name',f.split('Name',',')[1]) \
.show()
Out[]:
We can observe the Spark DataFrame with splitted output columns in it. This method is also useful when there is a unknown number of splits that has to be made. For do so, you can use for loop like this. This is not the exact synatx, you need to have a slight modification to it.
for i in max(split_length):
df.withColumn("newCol",f.split('split_column')[i])
b) Email-Id Column - Concatenation using Spark SQL:
Now, we need to generate a email-id for each users by concatenating the First Name and Last Name as same as the format given in problem statement. We can achieve this by using Select statement as well as by using the withColumn() API. We using select statement to add the Email-Id.
Try yourself: Try getting the Email-Id column using withColumn() API
Using Select clause:
Before concatenation, we need to trim the left and right additional spaces observed in the column and also need to add additional string @email.com to the trimmed string. For this, we can use trim() and lit() functions available in pyspark.sql.functions.
Code Snippet:
#Generate Mail id
out_df=split_df.select('Name', 'First Name', 'Last Name',
f.concat(f.col('First Name'), \
f.lit('.'), \
f.trim(f.col('Last Name')), \
f.lit('@email.com')).alias("Email")
)
out_df.show(2,truncate=0)
Out[]:
We got the required output.
Full Program:
Hope you enjoyed reading this article. Try this in your own computer and comment your results. If you have any hurdles in solving this requirement, leave your comments below or contact us.
Happy Learning!!!
0 Comments