In this blog, we are gonna learn to answer the most frequently asked Spark interview question. I could say, 90 percent of people encounter this question in their interviews i.e. What is the main difference between Map and FlatMap in Spark? and in some cases, folks are asked to write a piece of code to illustrate the working principle behind Map vs FlatMap. If you are beginner to BigData and need some quick look at PySpark programming, then I would recommend you to read How to Write Word Count in Spark .Come let's learn to answer this question with one simple real time example.
Apache Spark provides basic operation to be performed on top of the basic Build block of the Spark Core called RDD. These operations are nothing but the functions or method with some logic in it to transform the RDD and get the expected output from it. It can be a simple logic to filter or to sort or else to summarize the overall results. The operations performed on top of our Spark RDD can be classified into two types namely,
- Transformation
- Actions
What is Transformation?
Spark Transformations are the process of converting or transforming the given data RDD, which is immutable in nature into an another data RDD by applying some transformation logic to it. If you need some more details on What is Spark RDD? follow the link to learn in detail about Spark RDD. Most important point to be noted down is that when we apply transformation on top of any RDD in Spark, the operation is not performed immediately. It will save the list of operations to be performed on top of source RDD in a sequence by creating a DAG, (Directed Acyclic Graph). Once the Spark Action is called out, all the Transformation in the sequence of DAG will be executed. This property of Spark is defined as a Spark Lazy Execution. We can learn more in detail about the Transformation, Action and Speculative Execution in our upcoming chapters. Now to move on, Transformations are the basic operation executed on top of Spark RDD and few examples of transformations are Map, Flatmap, Filter, Mappartition, etc.
UseCase to Understand:
Let us consider a input file as a text file and it contains some sentence in it as shown below. Our task is to apply both map and flat map transformation one by one and observe the results produced to understand the working and gain knowledge on where to use Map and Flatmap. We do this by applying split() function on top Map() and FlatMap() in PySpark. Same logic can be applied in Scala and Java programming as well with slight modification to syntax.
Code snippet to read the text file using SparkSession:
Map Operation:
Map is a type of Spark Transformation, which is used to perform operation on the record level. Spark Map operation applies logic to be performed, defined by the custom code of developers on each collections in RDD and provides the results for each row as a new collection of RDD.
In simple words, Map transformation transforms the collection of RDD of given length say L into processed RDD of the same length L. Typically, the number of records or elements between input and output RDD will remains same. Now it time for some hands-on, let us apply map operation to the given input file with split function in it and check the count of input and output lines of records produced.
Code Snippet:
#Map operation
map_RDD=input_RDD.map(lambda x: x.split(' '))
map_RDD.collect()
Out[]:
From the output, it is evident that while using map function number of output records will exactly match the number of input records passed to process. We can check the number of records by using count() function as shown in the below diagram. We can observe both input and output have record count of 4.
In real word scenario, Map function with split logic is often used to form spark dataframe for doing table level operation. To know more about DataFrames, go through this link How to create Dataframe in Spark.
FlatMap Operation:
FlatMap in Apache Spark is a transformation operation that results in zero or more elements to the each element present in the input RDD. It is similar to the Map function, it applies the user built logic to the each records in the RDD and returns the output records as new RDD. In flatmap(), if the input RDD with length say L is passed on to the user defined logic, then it produces the output RDD of different length say M.
Code snippet to perform split() function on flatmap() transformation is given below.
Code Snippet:
#FlatMap operation
flatmap_RDD=input_RDD.flatMap(lambda x: x.split(' '))
flatmap_RDD.collect()
Out[]:
We can observe that the number of input rows passed to flatmap is not equal to the number of output we got. By applying the count() function on top of flatmap_rdd, we can get the number of records in it.
We can notice the input RDD has 4 records whereas output flatten RDD has 12 records. Flatmap() is usually used in getting the number of words, count of words often used by the speaker in the given document which will be helpful in the field of text analytics.
Full Program:
Hope you observed the difference in output while using Map and Flatmap operations and learnt to answer in your upcoming Spark interview (All the Best). I would recommend you to practice the same in your machine to have a better understanding. Post your comments, if you need any further assistance in the above topic.
Happy Learning!!!
0 Comments