Quantcast
Channel: Informatics @ Northwestern Weblog » SSIS
Viewing all articles
Browse latest Browse all 5

Using the MERGE Statement in SSIS Via Stored Procedure

$
0
0

(this is a modified version of the article I wrote for SQLServer.com published on 1/23/012, ‘Using the MERGE Statement in SSIS Via A Stored Procedure‘)

Background

In our Enterprise Data Warehouse (EDW) within the Northwestern University Biomedical Informatics Center (NUBIC), exact copies of many source systems exist. From that source system data we create several data marts to organize clinical data in meaningful ways to assist with reporting. Our current process to populate such data marts is to do the following.

  1. Stage the data in a staging table
  2. Delete rows from the destination/production table that exist in the staging table
  3. Insert the remaining data from the staging table into the production table

Many times in our SSIS ETL packages we will overlap slightly with the last load date/time to make sure all new data is collected. This practice causes numerous delete transactions on the production table on a daily basis before the new rows are inserted from staging. For the most part, these deletes are unnecessary since many of the deleted rows are re-inserted with the same data (because of the deliberate overlap mentioned earlier). In the long-term these unnecessary deletes can cause issues with fragmentation.

Problem

To avoid these excessive deletes we investigated the option of a MERGE statement in place of the 3-step process listed previously.  This functionality initially looked very promising since it would allow us to stage whatever data we needed and then that data could simply be merged into the production data.  The MERGE would update rows that already existed in production and would insert the rows that did not exist.  This would eliminate the unnecessary deletes that occur in our current process.

Our evaluation process uncovered some issues with the MERGE statement and SSIS.

  1. There is currently no simple MERGE task available in SSIS 2008. The easiest way to do this is to create an SSIS Execute SQL task and manually type out the MERGE statement.
  2. SSIS Execute SQL Task MERGE statement maintenance. If we were to use the Execute SQL Task to execute a MERGE statement, the statements would be rather large. Many of our data mart tables are very wide and are prone to having new columns added to track new clinical information. Manually entering the MERGE statement in an Execute SQL Task creates maintenance issues if a data mart were ever to be modified. If changes to the data mart occur, the MERGE statement would need to be manually updated as well to reflect such changes.
  3. SSIS Execute SQL Task character limit. Manual entry of the merge statement into the query editor window of the SSIS Execute SQL task is subject to a 32,767 character limit. Again, many of our data mart tables are very wide and contain several columns where this could potentially be an issue.
  4. SSIS Execute SQL Task manual entry errors. Even if the 32,767 character limit is not met, the MERGE statement itself can still be very long. To type all of the statement manually creates several opportunities for typographical errors, including omission of specific columns.

If we were going to use the MERGE statement via SSIS a solution needed to be created that would be easy to use but also would resolve the issues we uncovered in our investigation.

Proposed Solution

Our solution was to create a stored procedure that could dynamically generate the MERGE statement outside of the SSIS Execute SQL Task and execute it.  This article will walk through step-by-step how we built our stored procedure.  An example MERGE statement is provided below.  As we review each step we will see how the stored procedure would generate this example MERGE statement.

To begin, let’s first review the MERGE statement and its construction (Here is a link to a very useful SQLServerCentral technical article which provides a thorough overview of the MERGE statement). A MERGE is comprised of the following parts.

  • A source table  (this would be our staging table)
  • A target table (this would be our production table)
  • A predicate (how to join the source and target tables)
  • An Update command to be executed when the source and target rows match on the predicate
  • An Insert command to be executed when the source row does not match a target row
  • A Delete command to be executed when the target rows do not exist in the source

For our purposes there was no need for the delete command because we want to keep the rows on the target table that do not exist in the source table. Here is the example MERGE statement that we will attempt to generate and execute via this stored procedure.  Within this example we have a staging/source table called edw.adventure_hospital_dm.visits and a production/target table called edw.staging.stg_visits.

MERGE INTO
--1)  Target table
edw.adventure_hospital_dm.visits tgt
USING
--2)  Source table
edw.staging.stg_visits src
ON
--3)  Predicate
src.financial_nbr = tgt.financial_nbr
--4)  Update into Source table when matched
WHEN matched THEN
UPDATE
SET tgt.name                 = src.name,
    tgt.financial_nbr        = src.fininancial_nbr,
    tgt.medical_number       = src.medical_number,
    tgt.registration_date    = src.registration_date,
    tgt.meta_orignl_load_dts = src.meta_orignl_load_dts,
    tgt.meta_update_dts      = src.meta_update_dts
--5)  Insert into Source table when not matched
WHEN NOT matched THEN
INSERT (
    name,
    financial_nbr,
    medical_number,
    registration_date,
    meta_orignl_load_dts,
    meta_update_dts
    )
VALUES (
    src.name,
    src.fininancial_nbr,
    src.medical_number,
    src.registration_date,
    src.meta_orignl_load_dts,
    src.meta_update_dts);

Before the stored procedure was built we determined that a useful MERGE stored procedure would need to satisfy the following requirements.

  1. Accept parameters to enter the Source database/schema/table and the Target database/schema/table.
  2. Automatically determine the predicate between the Source and Target tables
  3. Automatically determine whether a Source column data type matches a Target column data type and store the matching columns in a temp table
  4. Is smart enough to NOT update primary key column(s) on the Target table
  5. Generate a dynamic SQL MERGE statement based on the matched columns stored in the temp table
  6. Execute the dynamic SQL MERGE statement

Step 1: Accept parameters to enter the Source database, schema, and table along with the Target database, schema, and table.

Our goal for this solution was for it to be easy to use.  We wanted to be able to execute a stored procedure and simply tell it what two tables to merge.  To do this we chose to create a stored procedure with 8 parameters.  The first 3 parameters (database, schema, and table name) are used to pass in the Source table.  The second 3 parameters (database, schema, and table name) are used to pass in the Target table.  Even though the stored procedure will automatically generate the matching predicate (which is discussed in Step #2), we created the 7th parameter as an option to manually pass in a comma-separated list of predicate items to match on if the tables being merged either do not have a primary key or the user would like to match on something other than the primary key.

Here is the section of code that was used to create the stored procedure along with its arguments and necessary variables.

CREATE PROCEDURE [adventure_hospital].[generate_merge]
                @SrcDB          SYSNAME,         --Name of the Source database
                @SrcSchema      SYSNAME,         --Name of the Source schema
                @SrcTable       SYSNAME,         --Name of the Source table
                @TgtDB          SYSNAME,         --Name of the Target database
                @TgtSchema      SYSNAME,         --Name of the Target schema
                @TgtTable       SYSNAME,         --Name of the Target table
                @predicate      SYSNAME  = NULL, --(optional)Override to automatic predicate generation.  A comma-separated list of predicate match items
                @debug          SMALLINT = NULL  --(optional)Pass in 1 to kick out just the MERGE statement text without executing it
AS
BEGIN
	DECLARE @merge_sql      NVARCHAR(MAX);  --overall dynamic sql statement for the merge
	DECLARE @columns_sql    NVARCHAR(MAX);  --the dynamic sql to generate the list of columns used in the update, insert, and insert-values portion of the merge dynamic sql
	DECLARE @pred_sql       NVARCHAR(MAX);	--the dynamic sql to generate the predicate/matching-statement of the merge dynamic sql (populates @pred)
	DECLARE @pk_sql         NVARCHAR(MAX);  --the dynamic sql to populate the @pk table variable that holds the primary keys of the target table
	DECLARE @updt           NVARCHAR(MAX);  --contains the comma-seperated columns used in the UPDATE portion of the merge dynamic sql (populated by @columns_sql)
	DECLARE @INSERT         NVARCHAR(MAX);  --contains the comma-seperated columns used in the INSERT portion of the merge dynamic sql (populated by @insert_sql)
	DECLARE @vals           NVARCHAR(MAX);  --contains the comma-seperated columns used in the VALUES portion of the merge dynamic sql (populated by @vals_sql)
	DECLARE @pred           NVARCHAR(MAX);  --contains the predicate/matching-statement of the merge dynamic sql (populated by @pred_sql)
	DECLARE @pred_param     NVARCHAR(MAX) = @predicate;  --populated by @predicate.  used in the dynamic generation of the predicate statment of the merge
	DECLARE @pred_item      NVARCHAR(MAX);  --used as a placeholder of each individual item contained within the explicitley passed in predicate
	DECLARE @done_ind       SMALLINT = 0;   --used in the dynamic generation of the predicate statment of the merge
	DECLARE @dsql_param     NVARCHAR(500);  --contains the necessary parameters for the dynamic sql execution

Step 2: Automatically determine the predicate used between the Source and Target tables
Before the predicate is automatically generated there must be a check to see whether or not a predicate was passed in as a parameter. If the predicate parameter is populated then the comma-separated list that was passed in is broken down into its individual items and the predicate matching statement is generated and assigned to @pred.

If the predicate was not passed in as a parameter a dynamic SQL (@pred_sql) is generated to query the INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE table using the Target database, schema, and table parameters. The INFORMATION_SCHEMA table is automatically available in SQL Server 2000 and above. Since the output of this the dynamic SQL (@pred_sql) needs to be collected and assigned to a variable (@pred) a parameter (@dsql_param) needs to be passed in so that the output can be returned and assigned appropriately. Once the dynamic SQL (@pred_sql) and the dynamic SQL parameter (@dsql_param) have been created, they can be executed. The output can then be assigned to @pred. We now have our predicate statement.

/****************************************************************************************
* This generates the matching statement (aka Predicate) statement of the Merge.        *
* If a predicate is explicitly passed in, use that to generate the matching statement. *
* Else execute the @pred_sql statement to decide what to match on and generate the     *
* matching statement automatically.                                                    *
****************************************************************************************/
 
IF @pred_param IS NOT NULL
  -- If a comma-separated list of predicate match items were passed in via @predicate
  BEGIN
  -- These next two SET statements do basic clean-up on the comma-separated list of predicate items (@pred_param)
  -- if the user passed in a predicate that begins with a comma, strip it out
  SET @pred_param = CASE WHEN SUBSTRING(ltrim(@pred_param),1,1) = ',' THEN SUBSTRING(@pred_param,(charindex(',',@pred_param)+1),LEN(@pred_param)) ELSE @pred_param END
  --if the user passed in a predicate that ends with a comma, strip it out
  SET @pred_param = CASE WHEN SUBSTRING(rtrim(@pred_param),LEN(@pred_param),1) = ',' THEN SUBSTRING(@pred_param,1,LEN(@pred_param)-1) ELSE @pred_param END
  -- End clean-up of(@pred_param) *
  -- loop through the comma-seperated predicate that was passed in via the paramater and construct the predicate statement
  WHILE (@done_ind = 0)
    BEGIN
    SET @pred_item = CASE WHEN charindex(',',@pred_param) > 0 THEN SUBSTRING(@pred_param,1,(charindex(',',@pred_param)-1)) ELSE @pred_param END
    SET @pred_param = SUBSTRING(@pred_param,(charindex(',',@pred_param)+1),LEN(@pred_param))
    SET @pred = CASE WHEN @pred IS NULL THEN (COALESCE(@pred,'') + 'src.[' + @pred_item + '] = ' + 'tgt.[' + @pred_item + ']') ELSE (COALESCE(@pred,'') + ' and ' + 'src.[' + @pred_item + '] = ' + 'tgt.[' + @pred_item + ']') END
    SET @done_ind = CASE WHEN @pred_param = @pred_item THEN 1 ELSE 0 END
    END
  END
ELSE
  -- If an explicite list of predicate match items was NOT passed in then automatically construct the predicate
  -- match statement based on the primary keys of the Source and Target tables
  BEGIN
  SET @pred_sql = ' SELECT @predsqlout = COALESCE(@predsqlout+'' and '','''')+' +
                  '(''''+''src.''+column_name+'' = tgt.''+ccu.column_name)' +
                  ' FROM ' +
                  @TgtDB + '.INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc_tgt' +
                  ' INNER JOIN ' + @TgtDB +'.INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu' +
                  ' ON tc_tgt.CONSTRAINT_NAME = ccu.Constraint_name' +
                  ' AND tc_tgt.table_schema = ccu.table_schema' +
                  ' AND tc_tgt.table_name = ccu.table_name' +
                  ' WHERE' +
                  ' tc_tgt.CONSTRAINT_TYPE = ''Primary Key''' +
                  ' and tc_tgt.table_catalog = ''' + @TgtDB + '''' +
                  ' and tc_tgt.table_name = ''' + @TgtTable + '''' +
                  ' and tc_tgt.table_schema = ''' + @TgtSchema + ''''
  SET @dsql_param = '	@predsqlout nvarchar(max) OUTPUT'
 
  EXEC sp_executesql
  @pred_sql,
  @dsql_param,
  @predsqlout = @pred OUTPUT;
END

The benefit of generating the predicate this way is that this automatically handles multiple primary key constraints on the Target table. All primary key constraints are evaluated and added into the predicate statement. The example output from the @pred variable is listed here.

src.financial_nbr = tgt.financial_nbr

Step 3: Automatically determine if a Source column data type matches the corresponding Target column data type and store the matching columns in a temp table
To do this we first need to create the custom table (@columns) that will be used to hold all the columns that exist between the Source and the Target tables. These columns must have the same name and same data type. By ‘same data type’ I mean both that the data type is the same and that the precision of the Target table column is at the same level or greater than the precision of the Source table column. For example, the medical_number column on the Source table may be varchar(10). As long as the corresponding medical_number column on the Target/Production table is at least a varchar(10) or greater the logic will consider that to be a match and store that column on the temp table.

--Create the temporary table to collect all the columns shared
--between both the Source and Target tables.
 
DECLARE @COLUMNS TABLE (
 table_catalog            VARCHAR(100) NULL,
 table_schema             VARCHAR(100) NULL,
 TABLE_NAME               VARCHAR(100) NULL,
 column_name              VARCHAR(100) NULL,
 data_type                VARCHAR(100) NULL,
 character_maximum_length INT NULL,
 numeric_precision        INT NULL,
 src_column_path          VARCHAR(100) NULL,
 tgt_column_path          VARCHAR(100) NULL
)

When the @columns temp table has been created a dynamic SQL statement is generated (@columns_sql) and executed to populate the temp table. The @columns_sql dynamic SQL statement takes the values passed in via parameters and uses them to query the INFORMATION_SCHEMA.COLUMNS table to find matching columns between the Source and Target tables. Once the @columns_sql dynamic SQL has been constructed it is executed and the values returned are automatically inserted into the @columns temp table.

/************************************************************************************************
* Generate the dynamic sql (@columns_sql) statement that will                                  *
* populate the @columns temp table with the columns that will be used in the merge dynamic sql *
* The @columns table will contain columns that exist in both the source and target             *
* tables that have the same data types.                                                        *
************************************************************************************************/    
 
SET @columns_sql =
'SELECT
tgt.table_catalog,
tgt.table_schema,
tgt.table_name,
tgt.column_name,
tgt.data_type,
tgt.character_maximum_length,
tgt.numeric_precision,
(src.table_catalog+''.''+src.table_schema+''.''+src.table_name+''.''+src.column_name) AS src_column_path,
(tgt.table_catalog+''.''+tgt.table_schema+''.''+tgt.table_name+''.''+tgt.column_name) AS tgt_column_path
FROM
     ' + @TgtDB + '.information_schema.columns tgt
     INNER JOIN ' + @SrcDB + '.information_schema.columns src
       ON tgt.column_name = src.column_name
       AND tgt.data_type = src.data_type
       AND (tgt.character_maximum_length IS NULL OR tgt.character_maximum_length >= src.character_maximum_length)
       AND (tgt.numeric_precision IS NULL OR tgt.numeric_precision >= src.numeric_precision)
     WHERE tgt.table_catalog     = ''' + @TgtDB + '''
     AND tgt.table_schema        = ''' + @TgtSchema + '''
     AND tgt.table_name          = ''' + @TgtTable + '''
     AND src.table_catalog       = ''' + @SrcDB + '''
     AND src.table_schema        = ''' + @SrcSchema + '''
     AND src.table_name          = ''' + @SrcTable + '''
     ORDER BY tgt.ordinal_position'
 
     --execute the @columns_sql dynamic sql and populate @columns table with the data
     INSERT INTO @COLUMNS
     EXEC sp_executesql @columns_sql

Now that the @columns temp table has been populated we have a list of all the columns that will be referenced in the overall MERGE statement.

Step 4:  Has logic to NOT update the primary key column(s) on the target table

While investigating the efficiency gains of using the merge statement we found that we were not getting the gains we were expecting.  We already knew that we were updating all columns within the merge statement when the predicate match was satisfied.  The problem is that when you update a row’s primary key (even if you are updating the primary key to its original value) that the update is essentially handled as an insert and therefore you end up losing some efficiency, actually a lot of efficiency. I wanted to thank Paul White for his blog post regarding “The Impact of Non-Updating Updates“.

To prevent the primary key from being updated within the MERGE update we created a temp table to record the primary key column(s) from the target table.  This temp table is later referenced to filter out the primary key column(s) when the update portion of the MERGE statement (@updt) is generated in the next step.

/**************************************************************************************
* Create the temporary table to collect all the primary key columns                  *
* These primary key columns will be filtered out of the update portion of the merge  *
* We do not want to update any portion of clustered index for performance            *
**************************************************************************************/
 
DECLARE @pk TABLE (
  column_name              VARCHAR(100) NULL
);
 
SET @pk_sql = 'SELECT ' +
              'ccu.column_name ' +
              'FROM ' +
              @TgtDB + '.INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc_tgt ' +
              'INNER JOIN ' + @TgtDB +'.INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu ' +
              'ON tc_tgt.CONSTRAINT_NAME = ccu.Constraint_name ' +
              'AND tc_tgt.table_schema = ccu.table_schema ' +
              'AND tc_tgt.table_name = ccu.table_name ' +
              'WHERE ' +
              'tc_tgt.CONSTRAINT_TYPE = ''Primary Key'' ' +
              'and tc_tgt.table_catalog = ''' + @TgtDB + ''' ' +
              'and tc_tgt.table_name = ''' + @TgtTable + ''' ' +
              'and tc_tgt.table_schema = ''' + @TgtSchema + ''' ' 
 
INSERT INTO @pk
EXEC sp_executesql @pk_sql

Step 5:  Generate a dynamic SQL MERGE statement based on the matched columns stored in the temp table

For the overall MERGE statement there are 3 sets of comma-separated columns that need to be generated from the data collected in the @columns temp table. These will be the update columns, insert columns, and the insert-values columns. In our example MERGE statement the 3 sets of columns are outlined below in red.

For each one of the 3 sets of columns a query needs to be executed against the @columns temp table. We needed a way to query the columns on the @columns temp table and loop through them creating a comma separated list. Since such a query will return more than row, if you try to assign the output of this query to a variable you will receive the error “Subquery returned more than 1 value.”

To get around this we used SQLServer’s FOR XML functionality. For anyone not familiar with FOR XML, simply put FOR XML allows you to run a query and format all of the output as XML. It then assigns that output to a variable of type XML. We essentially ran the same query but used FOR XML PATH(”) which allowed us to loop through the data results by generating an XML version of the data with absolutely no XML formatting. Since FOR XML automatically returns a data type of XML, we then cast the results to convert them from XML to NVARCHAR.

What you are left with is a string that contains all the comma-separated columns that can be used for the MERGE statement. Obviously, this is not the intended use of FOR XML, but this suited our needs nicely.

Here is the section of the stored procedure used to generate the comma-separated columns used for the update portion of the overall MERGE statement.  You can see below that the @pk temp table we created in the previous step is now referenced to filter out primary key column(s) from the update.

--1) List of columns used for Update Statement
--Populate @updt with the list of columns that will be used to construct the Update Statment portion of the Merge
 
SET @updt = CAST((SELECT ',tgt.[' + column_name + '] = src.[' + column_name + ']'
            FROM @COLUMNS c
            WHERE c.column_name != 'meta_orignl_load_dts'                               --we do not want the original time the row was created to be overwritten
            AND NOT EXISTS (SELECT 'x' FROM @pk p WHERE p.column_name = c.column_name)  --we do not want the primary key columns updated for performance
            FOR XML PATH(''))
            AS NVARCHAR(MAX)
            )

**Special Note**   For the columns used in the update portion of the MERGE statement, we filtered out the meta_orginl_load_dts column. The EDW uses this column as meta data to record when a row was originally written out to the data mart. Upon an update, we do not want this column to be updated with the meta_orignl_load_dts from the staging/source table. We still want to preserve the original load date time of that row. The date and time the row was updated will be represented in the meta column meta_update_dts. Second, the comma-separated list of columns used for the insert portion of the overall MERGE statement need to be generated. Here is section of the stored procedure used to generate the comma-separated columns used for the insert portion of the overall MERGE statement.

--2) List of columns used for Insert Statement
--Populate @insert with the list of columns that will be used to construct the Insert Statment portion of the Merge
 
SET @INSERT = CAST((SELECT ',' + '[' + column_name + ']'
              FROM @COLUMNS
              FOR XML PATH(''))
              AS NVARCHAR(MAX)
              )

Here is the output of the @insert variable for our example.

,name
,financial_nbr
,medical_number
,registration_date
,meta_orignl_load_dts
,meta_update_dts

Here is section of the stored procedure used to create the comma-separated list of columns used for the insert-values portion of the overall MERGE statement.

--3) List of columns used for Insert-Values Statement
--Populate @vals with the list of columns that will be used to construct the Insert-Values Statment portion of the Merge	
 
SET @vals = CAST((SELECT ',src.' + '[' + column_name + ']'
            FROM @COLUMNS
            FOR XML PATH(''))
            AS NVARCHAR(MAX)
            )

Here is the output of the @vals variable for our example.

,src.name
,src.fininancial_nbr
,src.medical_number
,src.registration_date
,src.meta_orignl_load_dts
,src.meta_update_dts

Step 6: Execute the dynamic SQL MERGE statement
Now that we have the Source database/schema/table, the Target database/schema/table, the predicate matching statement, and the update/insert/values comma-separated lists of columns, we have everything we need to generate the entire dynamic SQL MERGE statement. Once the MERGE statement is generated it then gets executed automatically.

Here is the section of the stored procedure used to pull everything together into the overall MERGE statement (@merge_sql) and then execute it.

/*************************************************************************************
*  Generate the final Merge statement using the following...                        *
*    -The parameters (@TgtDB, @TgtSchema, @TgtTable, @SrcDB, @SrcSchema, @SrcTable) *
*    -The predicate matching statement (@pred)                                      *
*    -The update column list (@updt)                                                *
*    -The insert column list (@insert)                                              *
*    -The insert-value column list (@vals)                                          *
*    -Filter out Primary Key from the update (updating primary key essentially      *
*     turns the update into an insert and you lose all efficiency benefits)         *
*************************************************************************************/
 
SET @merge_sql = (' MERGE into ' + @TgtDB + '.' + @TgtSchema + '.' + @TgtTable + ' tgt ' +
                  ' using ' + @SrcDB + '.' + @SrcSchema + '.' + @SrcTable + ' src ' +
                  ' on ' + @pred +
                  ' when matched then update ' +
                  ' set ' + SUBSTRING(@updt, 2, LEN(@updt)) +
                  ' when not matched then insert (' + SUBSTRING(@INSERT, 2, LEN(@INSERT)) + ')' +
                  ' values ( ' + SUBSTRING(@vals, 2, LEN(@vals)) + ');'
                 );
 
--Either execute the final Merge statement to merge the staging table into production
--Or kick out the actual merge statement text if debug is turned on (@debug=1)
IF @debug = 1
  BEGIN
  -- If debug is turned on simply select the text of merge statement and return that
  SELECT @merge_sql;
  END
ELSE
  BEGIN
  -- If debug is not turned on then execute the merge statement
  EXEC sp_executesql @merge_sql;
END

The substring() function is used on @updt, @insert, and @vals to remove the preceding comma from each set of columns. This could have been done earlier in the stored procedure but we decided to take care of it here.

Here is the output of the @merge_sql variable from our example.

MERGE INTO edw.adventure_hospital_dm.visits tgt
USING edw.staging.stg_visits src
ON src.financial_nbr = tgt.financial_nbr
WHEN matched THEN UPDATE
SET tgt.name = src.name,
tgt.medical_number = src.medcial_nubmer,
tgt.registration_date = src.registration_date,
tgt.meta_orignl_load_dts = src.meta_orignl_load_dts,
tgt.meta_update_dts = src.meta_update_dts
WHEN NOT matched THEN INSERT (
name,
financial_nbr,
medical_number,
registration_date,
meta_orignl_load_dts,
meta_update_dts
)
VALUES (
src.name,
src.financial_nbr,
src.medical_number,
src.registration_date,
src.meta_orignl_load_dts,
src.meta_update_dts);

Conclusion

We are currently working towards utilizing this stored procedure within our SSIS packages for populating datamarts. Using our current process, a simplified version of one of our SSIS packages would look like the example below.

By using the stored procedure to merge the data from staging into the production table we can replace the last 2 tasks with one Execute SQL task. That Execute SQL task simply executes the stored procedure. Using this method the example SSIS package would now look like this.

Here is the SQL code that is executed by the Execute SQL task.

Although this stored procedure as a bit tedious to build, it is worth it. We now have a stored procedure to run from SSIS that suits our data warehousing needs for merging staging data into production.

There are many benefits to merging data via a stored procedure.

  • Easily executed through an SSIS SQL Task
  • Easily reusable
  • Automatically builds and executes the merge statement so there is no need for manual creation or maintenance of the merge SQL statement.
  • Reduces fragmentation since there are less deletes performed on the production table
  • Since rows are updated instead of deleted and reinserted, the ETL package runs faster
  • Overcomes SSIS SQL Task 32,676 character limit
  • Preserves original row load date and time (since the row is updated instead of deleted and re-inserted)
Here is a link to the entire generate_merge stored procedure sql code, generage_merge.

In Practice
The largest table the EDW imports from the primary inpatient Electronic Health Record (EHR) source system stores clinical results and contains about 2.4 billion rows. Currently we merge between 600,000 – 900,000 rows of data from that source system table into the corresponding table within our EDW. With our old 2-step process of deleting rows from the production table that exist in the staging table and then inserting new rows, the whole process took around 2 hours to complete each night. Once we implemented the merge stored procedure into the ETL package that time was reduced to about 6 minutes!A bulk of the processing time in the old process was due to the deletes that were performed on a table that contained over 2 billion rows.

Limitations
This stored procedure is meant for data warehousing type functionality.  It assumes you are using a Staging-To-Production type of model where the staging table and production table are almost identically built with the same column names and very similar data types. Also, this stored procedure does not utilize the delete functionality of the MERGE statement.

Resources
Dhaneenja, T. (2008). Understanding the MERGE DML Statement in SQL Server 2008.

  SQLServerCentral.com. Retrieved from

  http://www.sqlservercentral.com/articles/SQL+Server+2008/64365/


Viewing all articles
Browse latest Browse all 5

Trending Articles