Converting flat data to complex json inPySpark

 

We will try to convert a table data with repeating rows for an employee to nested json using spark

You can created a csv file like the below.

EmployeeID,Name,Color
1,Guru,Red
1,Guru,Green
2,Gova,Red
2,Giva,Yellow

You can import the data to databricks file system, I have imported the same and wrote the below code to validate if its there.


spark.read.text("/FileStore/tables/EmpData.csv").show()

below is the output 

+--------------------+ | value| +--------------------+ |EmployeeID,Name,C...| | 1,Guru,Red| | 1,Guru,Green| | 2,Gova,Red| | 2,Giva,Yellow| +--------------------+

Code to read csv and create dataframe

Pschema = "EmployeeID INTEGER, EmpName string, Color string"
srcDF = spark.read\
        .option("header","true")\
        .schema(Pschema)\
        .csv("/FileStore/tables/EmpData.csv")

Code to store the data into nested json

from pyspark.sql.functions import * 

df = srcDF.groupBy("EmployeeID","EmpName").agg(collect_list("Color").alias("ColorFile"))

df.write\
  .format("json")\
  .mode("overwrite")\
  .save("/FileStore/tables/EmpData.json")

Code to read json data and convert to flat dataframe

from pyspark.sql import functions as f

df = spark.read.json("/FileStore/tables/EmpData.json")
df.show()
df2 = df.select("EmployeeID","EmpName",f.explode(col("ColorFile")).alias("Color"))
df2.show()

1 comment:

T-SQL LEAD LAG and SUM function based query

  Query on T-SQL window clause Below is the sales table Order_Date Name Product SubCategory ...