Tuning Performance of Spark by Parquet File Format:
In this chapter, we deal with the Spark performance tuning question asked in most of the interviews i.e. Read the give Parquet file format located in Hadoop and write or save the output dataframe as Parquet format using PySpark. Not only the answer to this question, but also look in detail about the architecture of parquet file and advantage of parquet file format over the other file formats commonly used in market.
Why you choose this particular file format in your project for handling this data? This quest has become a frequent question asked by interviewer. Apart from Parquet format, there are many other file formats which are being used widely in IT industry such as ORC, Sequential, Avro, CSV, JSON. Each format has its own advantage and application usage, which we will be discussing in our later chapters. Spark developers should be know the reason behind the selection of file format that fits their project requirement. Today we learn about the Parquet format used in reading and saving the data by Apache Spark. Let's get started with the importance of file format.
Why Selection of File Format is Important?
File format play a major role in optimizing the performance of spark code developed. Following measurement should be considered by a developer to select any one file format before starting with the development. From the file format that we choose, we should be able to,
- Read the data faster for processing.
- Write the data into the target location faster without any data loss, be it Hadoop Distributed File System(HDFS) or Amazon cloud(AWS s3 bucket) or Microsoft Azure blob storage or Google cloud platform(GCP).
- Data stored as any type of file should provide a property of Splittablity, which help the Application to run many task parallely across cluster.
- Schema changes should be made easily without any impact to the existing data.
- Should have enhanced compression which makes an add on significance while processing the data.
Parquet - Overview:
Apache Parquet file format mainly backed-up by Cloudera Service, is an open-source format. It is column-oriented data storage format inspired from Google Dremel paper. We can observe there is not much difference between Parquet and other columnar-storage file formats available in Hadoop namely RC File format and ORC format. It is efficient enough to handle huge volume of data, i.e. even Terabytes of data can be handled with Parquet, as it maintains a good compression ratio and schema encryption.
Parquet file format is broadly classified as three component, namely
- Parquet Header
- Data Block
- Parquet Footer
Header comprises of four-byte value as 'PAR1', this value indicates the Application that the file format processing is of type parquet.
Data Blocks in parquet is the group of records which consist of column chunks and column metadata. Column chunks are further divided into pages. Each pages consist of the value of the particular column of the particular record in the given data-set. Column metadata contains the information the column such as, type, path, encoding type, number of values, compressed size. Overall picture of data block is shown the below figure
Footer contains the footer metadata, footer length of 4-bytes and 'PAR1'. Footer's metadata consist of version of the format, the schema of the data blocks, any key-value pairs and the metadata of the each column present in the data block.
Save as parquet using PySpark:
Now as we learned the architecture of Parquet, time for some hands-on activity. Let the data be a CSV file format and we read this csv file as dataframe as shown in the below diagram. Our goal is to save this dataframe with students data as parquet file format using Pyspark.
Snippet for saving the dataframe as parquet file is given below.
#Save as parquet file
input_df.coalesce(1).write.format('parquet') \
.mode('overwrite') \
.save('out_parq')
Snippet for saving the dataframe as parquet file is given below.
#Save as parquet file
input_df.coalesce(1).write.format('parquet') \
.mode('overwrite') \
.save('out_parq')
Coalesce(1) in our snippet will combine all the partition and result with one single partitioned file written to the target location as parquet file. From the below figure, one can notice that the file format of data stored is in Parquet.
Note: Parquet File format unlike CSV or textfile format can't be read directly using hadoop commands. To read the data in Parquet, we need to create hive table on top of data or else use spark command to read the data.
Out[]:
Note: Parquet File format unlike CSV or textfile format can't be read directly using hadoop commands. To read the data in Parquet, we need to create hive table on top of data or else use spark command to read the data.
Out[]:
Read the parquet file using PySpark:
As I dictated in the above note, we cant read the parquet data using hadoop cat command. we can read either by having a hive table built on top of parquet data or use spark command to read the parquet data. We look in the method of reading parquet file using spark command. Let us read the file that we wrote as a parquet data in above snippet.
#Read the parquet file format
read_parquet=spark.read.parquet('out_parq\part*.parquet')
read_parquet.show()
Parquet File Format - Advantages:
Nested data: Parquet file format is more capable of storing nested data, i.e. if data stored in HDFS have more hierarchy then parquet serves the best as it stores the data in tree structure.
Predicate Pushdown efficiency: Parquet is very useful in query optimization. Suppose, if you have to apply some filter logic to the huge amount of data, then Spark Catalyst optimizer takes advantage of file format and push down lot of calculation to file format. Parquet as a part of metadata information it will store some stats such as min, max, avg value of the partitions, which makes the query to fetch data from metadata itself.
Compression: Parquet is more compression efficient data storage than other file format such as avro, json and csv etc.
Predicate Pushdown efficiency: Parquet is very useful in query optimization. Suppose, if you have to apply some filter logic to the huge amount of data, then Spark Catalyst optimizer takes advantage of file format and push down lot of calculation to file format. Parquet as a part of metadata information it will store some stats such as min, max, avg value of the partitions, which makes the query to fetch data from metadata itself.
Compression: Parquet is more compression efficient data storage than other file format such as avro, json and csv etc.
Hope you enjoyed learning one of the widely used file format in current Big data world. Leave your valuable comments about the article and contact us anytime if you face any struggle with respect to the concept we learned in this article.
Happy Learning !!!
0 Comments