In this tutorial, we will learn a trick in databricks on How to download the output result located in Databricks FileSystem (DBFS) into Local System.
Databricks Community Edition:
Databricks is an unified Spark platform that helps Data Engineers and Data Scientist to perform ETL operations and build machine learning model easily. We can access the Databricks community Edition without spending any money and all we need is just a verified Email ID. If you need step by step guide to open a Databricks Community Edition account, follow the below embedded video.
How to Download Data From DBFS to Local:
We have three options to download the files to our local machine. They are,
- using Display option
- using Web-URL for the File
- using Databricks CLI
Note: First two options can be tried in Databricks Community Edition, whereas using Databricks CLI requires paid version of Databricks. This is because Databricks CLI is not available in Databricks Community Edition.
We will discuss on all the above method one by one and understand the working of Databricks utility.
Video Explanation:
Method - 1 : Using Display Option
We can use display option to download the resultant Spark dataframe as a CSV file. It has some limitation too. We can only download maximum of one million records from the Spark Dataframe as CSV file into our local machine.
Display option gives us the facility to download first 1000 records or Download full result less than or equal to 1 million. The screenshot given below gives you the clear picture on this method.
This method this suitable for the small dataset, where the output will not exceed 1 million records. If the resultant data contains more than 1 million then proceed with other two available options.
Method - 2: Using Web-URL to download files
We can generate a https:// URL of the the data file location in databricks and using that link we can download the file into your local machine.
Note; This method can only be used when we store our resultant Spark dataframe into the /FileStore/ path in DBFS or else path should mounted to DBFS. If not we can't use this method to download file to local.
Let us check how to form the web URL. Below are the steps to generate a link.
URL Formation Guide:
Step 1: Get the Host URL and authentication number from address bar. The link will look like as shown in the above figure.
Step 2: Copy the DBFS url of the file you need to copy to local machine.
Step 3: Add keyword files in between the host and dbfs path as shown in the above figure. The URL will look something like Final URL to download. Paste it in a new tab to start the download.
Method - 3: Using Databricks CLI
As, I said earlier, we don't have Databricks CLI for community edition and we can only set up it in paid version. We will see, how to setup the databricks CLI and copy the file from dbfs to local.
Step 1: Open command prompt and install the CLI using below command
pip install --upgrade databricks-cli
Step 2: Setup the token to authenticate session. Run the below command which prompts you to enter the host and token as shown below.
databricks configure --token
Note: You can find the config file named .databrickscfg is created with the details that you entered
To generate Personal Access Token, sign in to your databricks UI and navigate to User Settings --> Access Token and click on Generate New Token. Note down the token as we cant retrieve the same token again.
Step 3: Use DBFS utility to copy the data between databricks dbfs to local system.
#to list all the directories and files
dbfs ls dbfs:/
Above command will list all the files inside your databricks filesystem [dbfs].
#command to copy file to local
dbfs cp dbfs:/FileStore/shared_uploads/azar.s91@gmail.com/<file-name> <local-file-path>
Conclusion:
Thus with above three methods, we can download the data from databricks filesystem to local system. Try this on your own and let me know if you face any issues in comment box below.
Happy Learning!!!
2 Comments
Thanks to share. Its's good way to share in video format as well.
ReplyDeleteThanks out of these three ways first way easy way.
Regards
Venu
Spark training institute in Hyderabad
1st option is the best way to download. However, in organisations, we have data which is more than a million rows. What to do if organisations do not allow community access?
ReplyDelete