How to handle data load failure in SSIS?

 There are different ways to handle data load failure. 

1. On dataflow you can redirect failure rows to another location and later retrieve it. 

2. You can also set ignore failure rows on destination settings. 

3. Easiest way is configure destination columns to easily convertible datatype like Varchar/strong with proper length. 

4. If failure can happen due to network issue or connection issues then you can set up retry mechanism. 

5. If it's happening due to memory then you can review the batch size and go for sequential load instead of parallel loads. 

Accessing azure blob storage in azure databricks

 Accessing azure blob storage using azure databricks

There are two different ways to access azure blob storage.
  1. Mounting
  2. Directly accessing storage 

Mounting azure storage 

When we mount azure storage in databricks it behaves like DBFS. We can run all the DBFS commands on the mount point. The mount point persist till we unmount it. If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS tokens derived from the storage account key when it accesses this mount point. Below is the code to mount azure storage.

Python code

dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")}) 

  • <conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net

Scala code 

dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))

Directly access azure blob storage

Using account key (python)

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
  "<storage-account-access-key>")

Using SAS (python)

spark.conf.set(
  "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
  "<complete-query-string-of-sas-for-the-container>")

When you directly access the azure storage then scope of access is limited to your notebook but if you mount the storage then you can access data across different notebooks.

Databricks magic commands

 Magic commands in databricks notebook 

We create a databricks notebook with a default language like SQL, SCALA or PYTHON and then we write codes in cells. Databricks notebooks allows us to write non executable instructions or also gives us ability to show charts or graphs for structured data. 
Databricks gives ability to change language of a specific cell or interact with the file system commands with the help of few commands and these are called magic commands. 

What are these magic commands in databricks ?

%fs 

This command allows us to write file system commands in a cell after writing the above command. Lets say we have created a notebook with python as default language but we can use the below code in a cell and execute file system command.

%fs
ls mnt/data

%Python

In case if you have selected default language other than python but you want to execute a specific python code then you can use %Python as first line in the cell and write down your python code below that.

%md

It is called markdown and specifically used to write comment or documentation inside the notebook to explain what kind of code we are writing.
example

%md
<a href="$./myNotebook">Link to notebook in same folder as current notebook</a>
<a href="$../myFolder">Link to folder in parent folder of current notebook</a>
<a href="$./myFolder2/myNotebook2">Link to nested notebook</a>

%sh

%sh is used as first line of the cell if we are planning to write some shell command.

%scala

similar to python you can write %scala and write the scala code.

%r

you can use R code in a cell with this magic command.

%sql

If you are using python/scala notebook and have a dataframe, you can create a temp view from the dataframe and use %sql command to access and query the view using SQL query

Set operations in Sql server

 SQL Server set operation

set operation allows us combine or select common data from more 2 tables/ queries. If you want to get data from more than two tables then you will have to repeat set operator between each query.

There are basically 3 types of set operations like 
  1. Union 
  2. Union all (special case A + B)
  3. Intersection
  4. Except (Minus) 
For demo we will use the below query to create a table and use it throughout out examples

CREATE TABLE Product1 (ProductID INT, ProductName VARCHAR(15)) 

CREATE TABLE Product2 (ProductID INT, ProductName VARCHAR(15)) 

INSERT INTO Product1 VALUES (1,'Pen'),(3,'Notebook'),(4,'Pin'),(6,'Punch'),(7,'Glue')

INSERT INTO Product2 VALUES (1,'Pen'),(2,'Stapler'),(4,'Pin'),(8,'Book'),(9,'Glitter')

UNION

union allows us to combine data from two different queries/tables. If the data is common between 1st and 2nd query then it gives distinct data. The output of union is sorted. 

A = {1,2,3,6}, B = {2,3,4}

UNION - A U B = {1,2,3,4,6}

Example

SELECT ProductID, ProductName FROM Product1
UNION
SELECT ProductID, ProductName FROM Product1

UNION ALL

union all is slightly different than union where data from 2nd query directly gets appended below the first query. Union is always faster than union operation as the data is simple appended and not distinct or sorting operation is done. 

SELECT ProductID, ProductName FROM Product1
UNION ALL
SELECT ProductID, ProductName FROM Product1

INTERSECT

Intersect gives you the common data between two queries. 
A = {1,3,5}
B = {1,5,7}

A ∩ B = {1,5}

SELECT ProductID, ProductName FROM Product1
INTERSECT
SELECT ProductID, ProductName FROM Product1

EXCEPT (MINUS)

This operator gives the result from first query which are not present in 2nd query 

A = {1,3,5}
B = {1,2,5}

A - B ={3}

SELECT ProductID, ProductName FROM Product1
EXCEPT
SELECT ProductID, ProductName FROM Product1

Considerations of set operations

While applying set operator make sure the data type of the column position between the query must be same. for e.g in above queries ProductID in query 1 and 2 at position 1 are both INT and ProductName is in position 2 with data type as VARCHAR. Since set operation allows columns of same data types as same position, make sure that you places columns in correct position, for instance you should not swap two varchar field else you might get wrong result.
 although the query will compile correctly.

Common questions which you can get from set operations are below. 
  • Difference between union and union all
  • Get list of products which were sold in july-2021 but not in june-2021.
  • Get list of product which were sold both in june-2021 and july-2021

Correlated subqueries

What is a correlated subquery ?

Correlated subquery is a sql query where the subquery portion depends on the main/outer query to interpret its result. 

For example if you want to get 2nd max SalesAmt you can use the correlated subquery as below.


Create table #temp
(
Country VARCHAR(10),
[State] VARCHAR(10),
SalesAmt Money
)

INSERT INTO #temp VALUES('India','Karnataka',50000),('India','Andhra',30000),('India','Telangana',15000),
('USA','Texas',50000),('USA','NY',32000),('USA','Redmond',15000),
('UK','London',50000),('UK','Reading',30100),('UK','Edinburg',15000)


SELECT * FROM #temp T WHERE T.SalesAmt = (
SELECT MAX(SalesAmt) Amt FROM #temp A
WHERE SalesAmt < (SELECT MAX(SalesAmt) FROM #temp WHERE A.Country = Country)
AND  A.Country = T.Country)

There are easier methods to get 2nd max like using Ranking functions. You can also combine Group By and joins to do so. 

In correlated subquery you can use operators like =,<>, < ,> or you can use in and not in.

for details you can refer to MSDN

How to delete duplicate from a table ?

Delete duplicates from table in SQL

There are different ways to delete duplicates from a table, we often get duplicates while loading data in tables which does not contain any primary key.

Lets say I have below table and we need to delete duplicates from the table.

EmpName
Ram
Gopi
Ram
Radha

Delete duplicates with help of temp table

We first try to load the distinct records in a temporary table and then delete all records from main table and the reload the main table from temporary table.

SELECT DISTINCT EmpName INTO #Data FROM Employee

DELETE FROM Employee

INSERT INTO Employee (EmpName)
SELECT EmpName FROM #Data

Delete duplicate using CTE

We use row_number() function and common table expression to delete duplicate.

;with cte as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY EmpName ORDER BY EmpName) AS RNK
FROM Employee
)
DELETE FROM cte WHERE RNK > 1




SQL Server interview questions on data types

 Common Interview Questions on SQL Server DataTypes

What is the difference between CHAR and VARCHAR ?

CHAR always allocates fixed data length as defined to store strings but varchar uses variable length.
So If we have CHAR(5) and VARCHAR(5) and we set it with 'AB' then CHAR will append 3 spaces after 'AB' and will be 'AB   ' but VARCHAR(5) will store 'AB'. Then why CHAR ? Reason is CHAR with fixed length can give you better indexing and compression than VARCHAR which may lead to better query performance. 
When should we use CHAR and VARCHAR ?
If we are using short code like State abbreviations which are 2-3 character long then CHAR is suggested but we have names, description then VARCHAR is recommended.

What is the difference between CHAR and NCHAR, VARCHAR and NVARCHAR ?

In the above question N represents Unicode character when we are sure that we are not going to store any unicode character like Chinese or arabic letter then we use VARCHAR but for unicode character we use it as NVARCHAR.

You can try the below SQL to see the difference 

DECLARE @UnicodeChar NVARCHAR(5) = N'٭'
DECLARE @NonUnicodeChar VARCHAR(5) = '٭'
SELECT @UnicodeChar, @NonUnicodeChar

Can we convert an int to bit ?

Yes we can convert an int to bit, any non zero value gets converted to 1 and 0 to 0.

Can we convert bigint to int ?

We can convert only those values which lies in int range else we get arithmetic overflow error.

What is the difference between DATETIME and DATETIME2 ?

DateTime2 gives better time accuracy by extending the decimal point from milliseconds to microseconds.

DECLARE @DT2 DATETIME2 = GETDATE()
DECLARE @DT DATETIME = GETDATE()
SELECT @DT2 [DATETIME2],@DT [DATETIME]

T-SQL LEAD LAG and SUM function based query

  Query on T-SQL window clause Below is the sales table Order_Date Name Product SubCategory ...