We will try to convert a table data with repeating rows for an employee to nested json using spark
You can created a csv file like the below.
EmployeeID,Name,Color
1,Guru,Red
1,Guru,Green
2,Gova,Red
2,Giva,Yellow
You can import the data to databricks file system, I have imported the same and wrote the below code to validate if its there.
spark.read.text("/FileStore/tables/EmpData.csv").show()
below is the output
+--------------------+
| value|
+--------------------+
|EmployeeID,Name,C...|
| 1,Guru,Red|
| 1,Guru,Green|
| 2,Gova,Red|
| 2,Giva,Yellow|
+--------------------+
Code to read csv and create dataframe
Pschema = "EmployeeID INTEGER, EmpName string, Color string"
srcDF = spark.read\
.option("header","true")\
.schema(Pschema)\
.csv("/FileStore/tables/EmpData.csv")
Code to store the data into nested json
from pyspark.sql.functions import *
df = srcDF.groupBy("EmployeeID","EmpName").agg(collect_list("Color").alias("ColorFile"))
df.write\
.format("json")\
.mode("overwrite")\
.save("/FileStore/tables/EmpData.json")
Code to read json data and convert to flat dataframe
from pyspark.sql import functions as f
df = spark.read.json("/FileStore/tables/EmpData.json")
df.show()
df2 = df.select("EmployeeID","EmpName",f.explode(col("ColorFile")).alias("Color"))
df2.show()