Data Engineering Technical Blog

Posts

Showing posts from July, 2021

Databricks magic commands

Magic commands in databricks notebook We create a databricks notebook with a default language like SQL, SCALA or PYTHON and then we write codes in cells. Databricks notebooks allows us to write non executable instructions or also gives us ability to show charts or graphs for structured data. Databricks gives ability to change language of a specific cell or interact with the file system commands with the help of few commands and these are called magic commands. What are these magic commands in databricks ? %fs This command allows us to write file system commands in a cell after writing the above command. Lets say we have created a notebook with python as default language but we can use the below code in a cell and execute file system command. %fs ls mnt/data %Python In case if you have selected default language other than python but you want to execute a specific python code then you can use %Python as first line in the cell and write down your python code b...

Set operations in Sql server

SQL Server set operation set operation allows us combine or select common data from more 2 tables/ queries. If you want to get data from more than two tables then you will have to repeat set operator between each query. There are basically 3 types of set operations like Union Union all (special case A + B) Intersection Except (Minus) For demo we will use the below query to create a table and use it throughout out examples CREATE TABLE Product1 (ProductID INT, ProductName VARCHAR(15)) CREATE TABLE Product2 (ProductID INT, ProductName VARCHAR(15)) INSERT INTO Product1 VALUES (1,'Pen'),(3,'Notebook'),(4,'Pin'),(6,'Punch'),(7,'Glue') INSERT INTO Product2 VALUES (1,'Pen'),(2,'Stapler'),(4,'Pin'),(8,'Book'),(9,'Glitter') UNION union allows us to combine data from two different queries/tables. If the data is common between 1st and 2nd query then it gives distinct data. The output of union is sorted. A = {1,...

Correlated subqueries

What is a correlated subquery ? Correlated subquery is a sql query where the subquery portion depends on the main/outer query to interpret its result. For example if you want to get 2nd max SalesAmt you can use the correlated subquery as below. Create table #temp ( Country VARCHAR(10), [State] VARCHAR(10), SalesAmt Money ) INSERT INTO #temp VALUES('India','Karnataka',50000),('India','Andhra',30000),('India','Telangana',15000), ('USA','Texas',50000),('USA','NY',32000),('USA','Redmond',15000), ('UK','London',50000),('UK','Reading',30100),('UK','Edinburg',15000) SELECT * FROM #temp T WHERE T.SalesAmt = ( SELECT MAX(SalesAmt) Amt FROM #temp A WHERE SalesAmt < (SELECT MAX(SalesAmt) FROM #temp WHERE A.Country = Country) AND A.Country = T.Country) There are easier methods to get 2nd max like using Ranking functions. You can also combine Gro...

How to delete duplicate from a table ?

Delete duplicates from table in SQL There are different ways to delete duplicates from a table, we often get duplicates while loading data in tables which does not contain any primary key. Lets say I have below table and we need to delete duplicates from the table. EmpName Ram Gopi Ram Radha Delete duplicates with help of temp table We first try to load the distinct records in a temporary table and then delete all records from main table and the reload the main table from temporary table. SELECT DISTINCT EmpName INTO #Data FROM Employee DELETE FROM Employee INSERT INTO Employee (EmpName) SELECT EmpName FROM #Data Delete duplicate using CTE We use row_number() function and common table expression to delete duplicate. ;with cte as ( SELECT *,ROW_NUMBER() OVER (PARTITION BY EmpName ORDER BY EmpName) AS RNK FROM Employee ) DELETE FROM cte WHERE RNK > 1

SQL Server interview questions on data types

Common Interview Questions on SQL Server DataTypes What is the difference between CHAR and VARCHAR ? CHAR always allocates fixed data length as defined to store strings but varchar uses variable length. So If we have CHAR(5) and VARCHAR(5) and we set it with 'AB' then CHAR will append 3 spaces after 'AB' and will be 'AB ' but VARCHAR(5) will store 'AB'. Then why CHAR ? Reason is CHAR with fixed length can give you better indexing and compression than VARCHAR which may lead to better query performance. When should we use CHAR and VARCHAR ? If we are using short code like State abbreviations which are 2-3 character long then CHAR is suggested but we have names, description then VARCHAR is recommended. What is the difference between CHAR and NCHAR, VARCHAR and NVARCHAR ? In the above question N represents Unicode character when we are sure that we are not going to store any unicode character like Chinese or arabic letter then we use VARCHAR but for unic...