Data Engineering Technical Blog

Posts

Showing posts from 2020

Converting flat data to complex json inPySpark

We will try to convert a table data with repeating rows for an employee to nested json using spark You can created a csv file like the below. EmployeeID,Name,Color 1,Guru,Red 1,Guru,Green 2,Gova,Red 2,Giva,Yellow You can import the data to databricks file system, I have imported the same and wrote the below code to validate if its there. spark.read.text("/FileStore/tables/EmpData.csv").show() below is the output +--------------------+ | value| +--------------------+ |EmployeeID,Name,C...| | 1,Guru,Red| | 1,Guru,Green| | 2,Gova,Red| | 2,Giva,Yellow| +--------------------+ Code to read csv and create dataframe Pschema = "EmployeeID INTEGER, EmpName string, Color string" srcDF = spark.read\ .option("header","true")\ .schema(Pschema)\ .csv("/FileStore/tables/EmpData.csv") Code to store the data into nested json from pyspark.sql.functions import *...

Types of Analytics: descriptive, predictive, prescriptive analytics

TYPES OF ANALYTICS The process of discovering meaningful pattern or information from a set of data is called analytics. Another question we always hear is, what is the difference between data and information ? - Data can termed as the set of actions stored in a system like transactions and information is the inferences taken out from data with calculation e.g. quarterly sales of Maruti Suzuki car for 2019. So the analytics can be for 4 types Descriptive Prescriptive predictive Inferential Descriptive analytics The descriptive analytics is vary simple process of data analysis like Sales By Location for the year 2019. It examines the historical data and gives you information like how was my sales in the year 2019 ? Which Locations had better sales ? Which products were sold and what was the amount. Descriptive analytics is the interpretation of historical data to better understand changes that have occurred in a business. Prescriptive Analytics Prescript...

SendGrid email setup

Create SendGrid Account Azure customers can unlock 25,000 free emails each month. These 25,000 free monthly emails will give you access to advanced reporting and analytics and all APIs (Web, SMTP, Event, Parse and more). For information about additional services provided by SendGrid, visit the SendGrid Solutions page. To sign up for a SendGrid account 1. Sign in to the Azure portal. 2. In the menu on the left, click Create a resource . 3. Click Add-ons and then SendGrid Email Delivery . 4. Complete the signup form and select Create . 5. Enter a Name to identify your SendGrid service in your Azure settings. Names must be between 1 and 100 characters in length and contain only alphanumeric characters, dashes, dots, and underscores. The name must be unique in your list of subscribed Azure Store Items. 6. Enter and confirm your Password . 7. Choose your Subscription . 8. Create a new Res...

Creating free databricks cluster

Databricks community cloud You can use databricks community cloud to create free databricks cluster. The cluster which you create will have 1 node only. You can run spark,scala codes on the cluster. you mount storages and extrernal file system like azure datalake and azure storage. Steps to create free cluster. navigate to - https://community.cloud.databricks.com/login.html Click on signup, you will be redirected to page https://databricks.com/try-databricks Click on the get started icon below the community edition https://databricks.com/signup/signup-community Fill in all the required details and continue using the cluster. You can run your code and save your notebooks. you cluster automatically gets dropped after 120 minutes of inactivity. You can create and new cluster and attach the same with your notebook to run your code.

Spark SQL basics

Spark SQL Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation. Creating free databricks cluster You can create free spark cluster on databricks community cloud. You can sign up there and start creating your cluster. Your cluster will automatically gets dropped after 120 minutes of inactivity. You can create a new cluster ever...

Creating storage account in azure

How to create azure storage account ? Prerequisite : You must have azure subscription. Below are the steps to create azure storage account. Login to Azure Portal. Click on +Create Resource icon. On next page click on the "Storage account - blob, file, table, queue" Select the subscription details and fill the account name, the account should be in small letters and start with alphabet and must be unique. There are other options like (Hot/Cold) , (V2/V1) . you can fill them all according to your requirement. Once you can filled all the details, click on the "review + create". Creating containers in storage account Navigate to blob services and click on Containers, click on +Containers follow the steps. Collecting accessing key. Navigate to the storage account you created in earlier step. You can view all the basic details in the overview tab. Navigate to Settings section and click on the "Access Keys". On the window you c...

Introduction to Apache Spark

Introduction to Apache Spark Spark is a general distributed data processing engine built for speed, ease of use, and flexibility. The combination of these three properties is what makes Spark so popular and widely adopted in the industry. In terms of flexibility, Spark offers a single unified data processing stack that can be used to solve multiple types of data processing workloads, including batch processing, interactive queries, iterative processing needed by machine learning algorithms, and real-time streaming processing to extract actionable insights at near real-time. A big data ecosystem consists of many pieces of technology including a distributed storage engine called HDFS, a cluster management system to efficiently manage a cluster of machines, and different file formats to store a large amount of data efficiently in binary and columnar format. Spark integrates really well with the big data ecosystem. This is another reason why Spark adoption has been growing at a rea...