Spark SQL basics

Spark SQL

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

Creating free databricks cluster

You can create free spark cluster on databricks community cloud. You can sign up there and start creating your cluster. Your cluster will automatically gets dropped after 120 minutes of inactivity. You can create a new cluster every-time you work. The cluster which you create on the community cloud will have only one node and can be useful for learning purpose only. (Create Cluster)

 Mapping Azure storage and query data

There are two ways of accessing data from azure storage.
  • Mount storage in databricks FS.
  • Use spark conf context to access storage.
Here we will try to access a csv data using spark conf context and data frame.

Configuring azure storage using PySpark

You can use the below code to configure azure storage.  (Steps to create azure storage account)


account = "vehicleparking" container = "salesdata" spark.conf.set( "fs.azure.account.key.vehicleparking.blob.core.windows.net", "") path = "wasbs://"+ container + "@"+ account +".blob.core.windows.net/Sales.csv"

Code to read csv from azure blob storage

ds = spark.read.csv(path,header="true",inferSchema="true"); ds.createOrReplaceTempView("SalesData") mydata = spark.sql("SELECT * FROM SalesData WHERE City ='Hyderabad'") mydata.show()

You overall code looks like the below 


No comments:

Post a Comment

T-SQL LEAD LAG and SUM function based query

  Query on T-SQL window clause Below is the sales table Order_Date Name Product SubCategory ...