Data Engineering Technical Blog

Posts

Showing posts from December, 2020

Converting flat data to complex json inPySpark

We will try to convert a table data with repeating rows for an employee to nested json using spark You can created a csv file like the below. EmployeeID,Name,Color 1,Guru,Red 1,Guru,Green 2,Gova,Red 2,Giva,Yellow You can import the data to databricks file system, I have imported the same and wrote the below code to validate if its there. spark.read.text("/FileStore/tables/EmpData.csv").show() below is the output +--------------------+ | value| +--------------------+ |EmployeeID,Name,C...| | 1,Guru,Red| | 1,Guru,Green| | 2,Gova,Red| | 2,Giva,Yellow| +--------------------+ Code to read csv and create dataframe Pschema = "EmployeeID INTEGER, EmpName string, Color string" srcDF = spark.read\ .option("header","true")\ .schema(Pschema)\ .csv("/FileStore/tables/EmpData.csv") Code to store the data into nested json from pyspark.sql.functions import *...