Data Engineering Technical Blog: 2018

Combine multiple features after CountVectorizer and TFIDF

Combine multiple feature using CountVectorizer in python

Case : I need to run NLP on title, description and product_url column and combine then for prediction.

Using FeatureUnion

To do so we can pipeline the steps of field extraction and conversion through
from sklearn.pipeline import FeatureUnion, Pipeline

FeatureUnion helps you to combine multiple features together and pipeline will help you to pipeline the steps.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

X_train, X_test, y_train, y_test = train_test_split(data_subset, y, random_state = 50,test_size = 0.2)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False)

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 100)

transformer = [('FeatureUnion',FeatureUnion([
('title_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['title'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('description_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['description'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('producturl_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['product_url'],
validate=False)),
('tfidf',
TfidfVectorizer())])),]))]
transformer.append(('rfc',RandomForestClassifier(n_estimators = 100)))
newPipeline = Pipeline(transformer)
newPipeline.fit(X_train.loc[:,['title','description','product_url']],y_train)
print(newPipeline.score(X_test.loc[:,['title','description','product_url']],y_test))

Azure Data Factory V2 - SSIS Deployment

How can we deploy SSIS in Azure Data Factory V2 ?

With Azure Data Factory V2 (Preview) you can deploy integration services packages easily and schedule them.
Prerequisite : Azure Subscription (You can use free trials subscription)

Steps to deploy SSIS in Azure

1. Login to azure portal and go to menu button and click on new as shown

Data Factory selection in the "New" pane

2. You can also search for Data Factory from the search box and click on it, you will be redirected to data factory creation page as below.
"New data factory" page

3. Fill in all the details (Version should be V2 (Preview)) and click on create.
4. Once the data factory is created you can navigate to it and click on the Autor & monitor widget as below.
Home page for the data factory

5. Now you will be redirected to authoring page where you will get multiple options like Create pipeline, copy data, configure SSIS runtime. Since we want to deploy SSIS packe we need to configure the SSIS runtime. Click on the same as shown below.
"Configure SSIS Integration Runtime" tile

"Configure SSIS Integration Runtime" tile

6. Fill in the desired details on the configuration on the run time setting page like name, number of nodes location etc as below.
General settings

7. Click on next and you will get the SQL settings page, fill in required details (You need to have one Azure SQL database envisioned for the SSIS to be deployed, so create the same accordingly) .

8. Click on next and fill the maximum number of parallel execution per node.
Advanced settings

In case if you are using VNET then select the check box accordingly.

Finally click on finish. Now you runtime setup is completed, next you need to deploy the SSIS package.

To deploy SSIS package remember the SQL database details you had filled for SSIS runtime and fill in the same details while deploying through the SSDT.

T-SQL Query Script to monitor Memory Usage of a SQL Server Instance

`How we can monitor memory usage by a SQL Server instance ?`


 -- We don't need the row count 
 SET NOCOUNT ON 
 
 -- Get size of SQL Server Page in bytes 
 DECLARE @pg_size INT, @Instancename varchar(50) 
 SELECT @pg_size = low from master..spt_values where number = 1 and type = 'E' 
 
 -- Extract perfmon counters to a temporary table 
 IF OBJECT_ID('tempdb..#perfmon_counters') is not null DROP TABLE #perfmon_counters 
 SELECT * INTO #perfmon_counters FROM sys.dm_os_performance_counters 
 
 -- Get SQL Server instance name 
 SELECT @Instancename = LEFT([object_name], (CHARINDEX(':',[object_name]))) FROM #perfmon_counters WHERE counter_name = 'Buffer cache hit ratio' 
 
 -- Print Memory usage details 
 PRINT '----------------------------------------------------------------------------------------------------' 
 PRINT 'Memory usage details for SQL Server instance ' + @@SERVERNAME + ' (' + CAST(SERVERPROPERTY('productversion') AS VARCHAR) + ' - ' + SUBSTRING(@@VERSION, CHARINDEX('X',@@VERSION),4) + ' - ' + CAST(SERVERPROPERTY('edition') AS VARCHAR) + ')' 
 PRINT '----------------------------------------------------------------------------------------------------' 
 SELECT 'Memory visible to the Operating System' 
 SELECT CEILING(physical_memory_in_bytes/1048576.0) as [Physical Memory_MB], CEILING(physical_memory_in_bytes/1073741824.0) as [Physical Memory_GB], CEILING(virtual_memory_in_bytes/1073741824.0) as [Virtual Memory GB] FROM sys.dm_os_sys_info 
 SELECT 'Buffer Pool Usage at the Moment' 
 SELECT (bpool_committed*8)/1024.0 as BPool_Committed_MB, (bpool_commit_target*8)/1024.0 as BPool_Commit_Tgt_MB,(bpool_visible*8)/1024.0 as BPool_Visible_MB FROM sys.dm_os_sys_info 
 SELECT 'Total Memory used by SQL Server Buffer Pool as reported by Perfmon counters' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Total Server Memory (KB)' 
 SELECT 'Memory needed as per current Workload for SQL Server instance' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Target Server Memory (KB)' 
 SELECT 'Total amount of dynamic memory the server is using for maintaining connections' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Connection Memory (KB)' 
 SELECT 'Total amount of dynamic memory the server is using for locks' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Lock Memory (KB)' 
 SELECT 'Total amount of dynamic memory the server is using for the dynamic SQL cache' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'SQL Cache Memory (KB)' 
 SELECT 'Total amount of dynamic memory the server is using for query optimization' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Optimizer Memory (KB) ' 
 SELECT 'Total amount of dynamic memory used for hash, sort and create index operations.' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Granted Workspace Memory (KB) ' 
 SELECT 'Total Amount of memory consumed by cursors' 
 SELECT cntr_value as Mem_KB, cntr_value/1024.0 as Mem_MB, (cntr_value/1048576.0) as Mem_GB FROM #perfmon_counters WHERE counter_name = 'Cursor memory usage' and instance_name = '_Total' 
 SELECT 'Number of pages in the buffer pool (includes database, free, and stolen).' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name= @Instancename+'Buffer Manager' and counter_name = 'Total pages' 
 SELECT 'Number of Data pages in the buffer pool' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Database pages' 
 SELECT 'Number of Free pages in the buffer pool' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Free pages' 
 SELECT 'Number of Reserved pages in the buffer pool' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Reserved pages' 
 SELECT 'Number of Stolen pages in the buffer pool' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Stolen pages' 
 SELECT 'Number of Plan Cache pages in the buffer pool' 
 SELECT cntr_value as [8KB_Pages], (cntr_value*@pg_size)/1024.0 as Pages_in_KB, (cntr_value*@pg_size)/1048576.0 as Pages_in_MB FROM #perfmon_counters WHERE object_name=@Instancename+'Plan Cache' and counter_name = 'Cache Pages' and instance_name = '_Total' 
 SELECT 'Page Life Expectancy - Number of seconds a page will stay in the buffer pool without references' 
 SELECT cntr_value as [Page Life in seconds],CASE WHEN (cntr_value > 300) THEN 'PLE is Healthy' ELSE 'PLE is not Healthy' END as 'PLE Status' FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Page life expectancy' 
 SELECT 'Number of requests per second that had to wait for a free page' 
 SELECT cntr_value as [Free list stalls/sec] FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Free list stalls/sec' 
 SELECT 'Number of pages flushed to disk/sec by a checkpoint or other operation that require all dirty pages to be flushed' 
 SELECT cntr_value as [Checkpoint pages/sec] FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Checkpoint pages/sec' 
 SELECT 'Number of buffers written per second by the buffer manager"s lazy writer' 
 SELECT cntr_value as [Lazy writes/sec] FROM #perfmon_counters WHERE object_name=@Instancename+'Buffer Manager' and counter_name = 'Lazy writes/sec' 
 SELECT 'Total number of processes waiting for a workspace memory grant' 
 SELECT cntr_value as [Memory Grants Pending] FROM #perfmon_counters WHERE object_name=@Instancename+'Memory Manager' and counter_name = 'Memory Grants Pending' 
 SELECT 'Total number of processes that have successfully acquired a workspace memory grant' 
 SELECT cntr_value as [Memory Grants Outstanding] FROM #perfmon_counters WHERE object_name=@Instancename+'Memory Manager' and counter_name = 'Memory Grants Outstanding'

SQL Server Get procedure by text content

Scenario : Find the list of procedures which contains the text - "datafinder"

Solution : To get the procedure with text content we need to query two tables - sys.procedures and sys.Sql_modules.

The query to get all the procedure containing text - "datafinder" is below

SELECT [Name] FROM sys.procedures AS A
INNER JOIN sys.sql_modules AS B
ON A.object_id = B.object_id
AND B.definition like '%datafinder%'

Explanation : The sys.sql_modules keep the definition details of any sql module like view, procedure or function by object_id and sys.procedure contains the procedure with its object_id. Since we are specifically looking for procedure containing a specific text so we joined sys.sql_modules with sys.procedures. Similarly we can do the same for SQL Server views as shown below.

SELECT [Name],definition FROM sys.views AS A

INNER JOIN sys.sql_modules AS B

ON A.object_id = B.object_id

AND B.definition like '%datafinder%'

Or for function we can do as below

SELECT B.name,definition FROM sys.sql_modules AS A

INNER JOIN sys.SYSobjects AS B

ON A.object_id = B.id

WHERE B.type = 'FN'

AND B.definition like '%datafinder%'

SYSOBJECTS contains list of all system objects like function, table, procedures and views etc with different type column.

SSIS 2017 scale out deployment

1. Install SQL Server features

In the SQL Server installation wizard, on the Feature Selection page, select the following items:

· Database Engine Services

· Integration Services

o Scale Out Master

o Scale Out Worker

On the Server Configuration page, click Next to accept the default service accounts and startup types.

On the Database Engine Configuration page, select Mixed Mode and click Add Current User.

On the Integration Services Scale Out Configuration - Master Node and Integration Services Scale Out Configuration - Worker Node pages, click Next to accept the default settings for the port and certificates.

Finish the SQL Server installation Wizard.

2. Install SQL Server Management Studio

Download and install SQL Server Management Studio (SSMS).

3. Enable Scale Out

Open SSMS (in Administrator Mode) and connect to the local Sql Server instance. In Object Explorer, right-click Integration Services Catalogs and select Create Catalog.

In the Create Catalog dialog, the option Enable this server as SSIS scale out master is selected by default.

4. Enable a Scale Out Worker

In SSMS, right-click SSISDB and select Manage Scale Out.

The Integration Services Scale Out Manager app opens.

1. Open Scale Out Manager from SQL Server Management Studio

Open SQL Server Management Studio (SSMS) and connect to the SQL Server instance of Scale Out Master.

In Object Explorer, right-click SSISDB, and select Manage Scale Out.

2. Open Scale Out Manager by running ISManager.exe

Locate ISManager.exe under

%SystemDrive%\Program
Files (x86)\Microsoft SQL Server\140\DTS\Binn\Management

. Right-click ISManager.exe and select Run as administrator.

After Scale Out Manager opens, enter the SQL Server instance name of Scale Out Master and connect to it to manage your Scale Out environment.

Tasks available in Scale Out Manager

In Scale Out Manager, you can do the following things:

Enable Scale Out

After connecting to SQL Server, if Scale Out is not enabled, you can select Enable to enable it.

View Scale Out Master status

The status of Scale Out Master is shown on the Dashboard page.

View Scale Out Worker status

The status of Scale Out Worker is shown on the Worker Manager page. You can select each worker to see the individual status.

Add a Scale Out Worker

To add a Scale Out Worker, select + at the bottom of the Scale Out Worker list.

Enter the computer name of the Scale Out Worker you want to add and click Validate. The Scale Out Manager checks whether the current user has access to the certificate stores on the Scale Out Master and Scale Out Worker computers

If validation succeeds, Scale Out Manager tries to read the worker server configuration file and get the certificate thumbprint of the worker. For more info, see Scale Out Worker. If Scale Out Manager can't read the worker service configuration file, there are two alternative ways for you to provide the worker certificate.

1. You can enter the thumbprint of worker certificate directly.

2. Or, you can provide the certificate file.

After gathering information, Scale Out Manager describes the actions to be performed. Typically, these actions include installing the certificate, updating the worker service configuration file, and restarting the worker service.

In case the worker certificate is not accessible, you have to update it manually and restart the worker service.

Select the Confirm checkbox and then select OK to start adding a Scale Out Worker.

Delete a Scale Out Worker

To delete a Scale Out Worker, select the Scale Out Worker and then select - at the bottom of the Scale Out Worker list.

Enable or disable a Scale Out Worker

To enable or disable a Scale Out Worker, select the Scale Out Worker and then select Enable Worker or Disable Worker. If the worker is not offline, the status of the worker displayed in Scale Out Manager changes accordingly.

Edit a Scale Out Worker description

To edit the description of a Scale Out Worker, select the Scale Out Worker and then select Edit. After you finish editing the description, select Save.

5. Run packages in Scale Out

Run packages with the Execute Package in Scale Out dialog box

Open the Execute Package In Scale Out dialog box.

In SQL Server Management Studio, connect to the Integration Services server. In Object Explorer, expand the tree to display the nodes under Integration Services Catalogs. Right-click the SSISDB node or the project or the package you want to run, and then click Execute in Scale Out.
Select packages and set the options.

On the Package Selection page, select one or more packages to run. Set the environment, parameters, connection managers, and advanced options for each package. Click a package to set these options.

On the Advanced tab, set a Scale Out option called Retry count to specify the number of times a package execution will retry if it fails.
Select worker computers.

On the Machine Selection page, select the Scale Out Worker computers to run the packages. By default, any computer is allowed to run the packages.
1. Run the packages and view reports.
  
  Click OK to start the package executions. To view the execution report for a package, right-click the package in Object Explorer, click Reports, click All Executions, and find the execution.
Run packages with stored procedures

Create executions.

Call [catalog].[create_execution] for each package. Set the parameter @runinscaleout to True. If not all Scale Out Worker computers are allowed to run the package, set the parameter @useanyworker to False. For more info about this stored procedure and the @useanyworker parameter.
Set execution parameters.

Call [catalog].[set_execution_parameter_value] for each execution.
Set the Scale Out Workers.

Call [catalog].[add_execution_worker]. If all computers are allowed to run the package, you don't have to call this stored procedure.
Start the executions.

Call [catalog].[start_execution]. Set the parameter @retry_count to set the number of times a package execution will retry if it fails.

For details you can visit - https://docs.microsoft.com/en-us/sql/integration-services/scale-out/integration-services-ssis-scale-out?view=sql-server-2017

Data Engineering Technical Blog

Pages