SQL Server 2008- SSIS Data Profiling task

By using SQL Server 2008's new Data Profiling task, you can do a lot to ensure that data being imported via SSIS is valid, and you can develop a system that can take the necessary steps to correct the commonest problems. Bob Sheldon shows you how to get started with it

One of the new features in SQL Server 2008 is the Data Profiling task, a control flow component in SQL Server Integration Services (SSIS). The task lets you analyze data in a SQL Server database and, from the results of that analysis, generate XML reports that can be saved to a file or an SSIS variable. By configuring one or more of the task’s profile types, you can generate a report that provides details such as a column’s minimum and maximum values, or the number and percentage of null values.

Despite the variety of statistics that the Data Profiling task can provide, the task’s practical applications might appear limited when you first try to implement it. After all, what use are data statistics to the automated processes in an SSIS package? However, by accessing the results generated by the Data Profiling task, you can design a workflow that automatically determines the appropriate actions to take based on the validity of the source data.

In this article, I describe a sample SSIS package that includes the Data Profiling task. The task is configured to generate a report based on the Value Inclusion profile type, which provides details that let you determine whether one or more columns in the source table are suitable as a foreign key. Although this type of profile might not seem particularly useful to an SSIS package, it can actually be quite valuable when loading data because you can use it to compare source data with valid data in your target database to determine the validity of the source data.

The article assumes that you have a working knowledge of SQL Server, Transact-SQL, and SSIS and full source code for the SSIS solution can be obtained here.

An Overview of the Data Profiling Task

For our example data profiling task, let’s suppose we’re extracting product descriptions (such as colors, sizes, and other details) from a text file and want to ensure that the product IDs in that file match the product IDs in the target database before loading the description data. We can stage the data in a SQL Server table and then use the Data Profiling task to generate a Value Inclusion report, which will tell us whether the source IDs are valid. After we determine the data’s validity, we can then take the appropriate actions within our SSIS package. Figure 1 shows what such a solution might look like, as it appears in the Control Flow tab of SSIS Designer (in Business Intelligence Development Studio).


Figure 1: The Control Flow tab in SSIS Designer

In this solution, we first stage the data because the Data Profiling task requires that the data be in a SQL Server database in order to perform the analysis. Next, we generate the Value Inclusion report (saved to a variable), and then use a Script task to extract the analysis results from that variable. If one or more IDs are invalid, we generate the report as an XML document and email it to an administrator. If all the IDs are valid, we load the data into the target database.

Let’s take a closer look at each of these steps to better understand how they work; As you work through each one, refer back to Figure 1 as necessary.  For this example, I created an SSIS project and solution named DataProfiler in Business Intelligence Development Studio. I then opened the default package (Package.dtsx) in SSIS Designer and configured the control flow shown in Figure 1.

Staging the Data

As described above, the Data Profiling task requires that the data be in a SQL Server database. So, the first step is to configure an Execute SQL task to create a staging table. For this solution, I’m using data from the AdventureWorks2008 sample database, so that’s where I’m creating the staging table (you can download the database from http://codeplex.com/SqlServerSamples)

The following Transact-SQL drops the staging table (Production.Descriptions), if it exists, and then re-creates it:

Next, we need to add a Data Flow task and connect the precedence constraint from the Execute SQL task to the Data Flow task. (A precedence constraint is the arrowed line that connects two tasks to define the workflow from one task to the next.) After we connect the precedence constraint, we can configure the Data Flow task to extract data from a tab-delimited text file, and insert it into the Descriptions table.

Note: I created the text file by retrieving data from the Production.vProductAndDescription view in the AdventureWorks2008 database. I then changed the product ID on three rows to ensure that the source file included mismatched data.

Of course, you might take a different approach to staging the data. For instance, you might truncate an existing table or implement the staging area in a different database. As long as you stage the data in a SQL Server database, how you do that is up to you.

Generating the Inclusion Profile

The next step is to add the Data Profiling task and connect the precedence constraint from the Data flow task to the Data Profiling task. When you set up a Data Profiling task, you can configure it to save data to an XML file or to an SSIS string variable. In this case, I want to save the report information to a variable so I can later use that variable in the Script task.

Ultimately, the package will require two string variables-one to hold the report, and a second one to hold the analysis results retrieved through the Script task (this process is explained later in the article). So before configuring the Data Profiling task, we need to create these two string variables, both of which will be configured at the package scope:

  • InclusionRpt, to hold the Value Inclusion report
  • Results, to hold the analysis results retrieved through the Script task.

When configuring the Data Profiling task, select Variable as the destination type and then specify the InclusionRpt variable as the destination, as shown in Figure 2:


Figure 2: Creating the InclusionRpt Variable

Figure2 shows the General page of the Data Profiling Task editor. Notice that the OverwriteDestination property is grayed out when the destination type is a variable.

After specifying our variable, we can now start configuring the Value Inclusion profile. Click to the Profile Requests page, and select Value Inclusion Profile Request in the Profile Type drop-down list (in the first cell of the grid). You can then configure the data profile in the bottom section of the Profile Requests page, as shown in the Figure 3:


Figure 3: Configuring the Value Inclusion Profile

In this example, I select the ADO.NET connection manager that connects to the AdventureWorks2008 database (note that you must use an ADO.NET connection manager when configuring the profiles). For the SubsetTableOrView property, I select our staging table (Production.Descriptions). The table must exist before you can configure the Value Inclusion profile and the easiest way to do this is to run the Execute SQL task individually (rather than execute the entire package).

For the SupersetTableOrView property, I select a table that contains the correct product IDs, which for this example is Production.Product. Finally, for the InclusionColumns property, I select the ProductID column from both tables so that the comparisons can be made. SSIS will compare the ProductID values in the Descriptions table with the ProductID values in the Product table.

The next step I take is to set the InclusionThresholdSetting and SupersetColumnsKeyThresholdSetting properties to None, in order to ensure that necessary information is returned whether or not there are mismatched rows. Specifically, the option returns the IsExactInclusion element in the results whether or not there are mismatched rows. If there are mismatched rows, the element value is false. If there are no mismatched rows, the element value is true. The other two options-Exact and Specified-return the IsExactInclusion element only when specified conditions are met, so in this case, these options are not as useful as None. It is also why the InclusionStrengthThreshold and the SupersetColumnsKeyThreshold properties are grayed out; the thresholds don’t apply when None is selected. In general, the None option helps to make the package configuration simpler, at least for this solution. For more details about the specific options, see the “Value Inclusion Profile Request Options (Data Profiling Task)” topic in SQL Server Books Online.

One other property that is important to note is MaxNumberOfViolations. Because I’m using this report only to determine whether there are violations, and not which product IDs are mismatched, I do not need to retrieve these product IDs themselves, at least not at this time. Later in the control flow, we will add another Data Profiling task that will capture the mismatched IDs, but for now, all we need to know is whether there are any mismatches, so I set the option to 0.

That’s all there is to configuring the Value Inclusion profile type and the Data Profiling task. When you execute the task, it generates an XML report that includes a DataProfileOutput element and its subelements, as shown in the following XML:

This section of the XML document includes details about property settings, column information, and the information about value mismatches. The element that we’re specifically concerned with is IsExactInclusion, which will contain a value of either true or false. In this case, it is false, which means that at least one product ID in the source data is invalid. Only if all values in the source column match values in the lookup column does the IsExactInclusion element return a value of true. As a result, you can use the IsExactInclusion value to control your package’s workflow, so let’s take a look at how that’s done.

Retrieving the Profile Results

After you set up the Data Profiling task, you’re ready to set up the Script task retrieve the true or false value from the IsExactInclusion element in the report results. Add the task to your control flow, connect the precedence constraint from the Data Profiling task to the Script task, and open the Script task editor. In the editor, specify the necessary variables, as shown in Figure 4:


Figure 4: Specifying variables for the Script task

As you can see, I add InclusionRpt as a read-only variable so that the script has access to the Value Inclusion report. I then add Results as a read-write variable so I can save the analysis results to that variable.

One other interesting property to point out in the Script Task editor is ScriptLanguage. In SSIS 2005, you could write your scripts only in Visual Basic. However, in SSIS 2008, now that the Script task (and Script component) uses Visual Studio 2005 Tools for Applications (VSTA), rather than Visual Studio for Application (VSA), you can write your scripts in C# or Visual Basic.NET.

Now let’s take a look at the script itself (which you access by clicking the Edit Script button). When you add a Script task, SSIS automatically generates the files and most of the code necessary to get started. For this solution, I needed only to add code to the ScriptMain.cs file in VSTA. The file contains the ScriptMain class, which is the class used by default to interface with SSIS. Most of the changes we make to the file are within this class. Specifically, the changes we make extract the true or false value from the IsExactInclusion element in the report results. The final code in the ScriptMain.cs file is as follows:

Let’s take a closer look at this code. My first step is to add the following statement to the set of predefined using statements at the beginning of the script:

This simply allows me to more easily reference the System.Xml namespace within my script. That way, I don’t have to use the fully qualified class name within my code. The other using statements are included by default, so I can easily access any of those namespaces as well.

Next, I add several statements to the Main() method. In my first statement, I declare a C# variable (InclusionRpt) and assign it a value retrieved from the SSIS variable, InclusionRpt:

In the next two statements, I instantiate an XmlDocument object and assign it to the variable, XmlDoc. I then use the object’s LoadXml() method to load the InclusionRpt string value into XmlDoc:

Once the XML data is loaded into XmlDoc, I can use the GetElementsByTagName() method to retrieve the IsExactInclusion element and save it to the incl variable, which is an XmlNodeList object:

Finally, I access the element’s value by retrieving the InnerText property of the incl variable:

Notice that I specify [0] after incl. XmlNodeList is actually an object array that uses a zero-based index to reference the elements in that array. Because there is only one value in incl, I simply call that value by specifying [0]. After I retrieve that value, I assign it to the SSIS Results variable. The Results variable will always be assigned a value of either true or false. As a result, I can then use the variable later in the package to determine the workflow.

And that’s all there is to the script. All the other C# code is already included in ScriptMain.cs. The original code also includes placeholders-as comments-that instruct you where to add your additional code. It’s worth noting, however that, for the sake of brevity, I did not include any error handling in my script. However, in a production environment, I would include the code necessary to deal with any exceptions that might be thrown. That issue aside, we can now move on to the remaining workflow.

Determining the Workflow

After we’ve configured the Script task, we can use the value in the Results variable to determine what steps to take next. (As you’ll recall, the value for the Results variable is generated in the Script task.) For this example, I use the Results variable within the precedence constraints that define the workflows that follow the Script task. (Refer back to Figure 1.) There are two precedence constraints after the Script task: one to create a workflow that generates a full inclusion profile and sends it to an administrator and one to create a workflow that loads the data into the target database.

To set up the first workflow, I add a second Data Profiling task and name it Create full inclusion profile. I then connect a precedence constraint from the Script task to the new Data Profiling task. To set up the second workflow, I add a second Data Flow task and name it Load staged data. I then connect  a precedence constraint from the Script task to the new Data Flow task

The first step is to configure the precedence constraints to use expressions to determine the workflow. For example, for the precedence constraint that connects the Script task to the Data Profiling task, I double-click the precedence constraint to open the Precedence Constraint editor. Then, I set the Evaluation operation option to Expression, and then I add the expression @Results == “false”, as shown in Figure 5:


Figure 5: Configuring the precedence constraints.

The Results variable must equal false for the precedence constraint to evaluate to true and run the Data Profiling task. In other words, if the source data contains any mismatches, the precedence constraint will evaluate to true and the second Data Profiling task will run.

I then configure the precedence constraint that connects the Script task to the new Data Flow task with the expression @Results == “true”. In this case, for the precedence constraint to evaluate to true (and subsequently have the Data Flow task run), all the source data must be valid.

Generating a Full Inclusion Profile

 In the last step, we created two workflows, one to generate a full inclusion profile and the other to load the staged data. Now let’s take a closer look at the first workflow. As described above, if the value in Results variable is false (and the data is invalid), we will run the first workflow, which includes the second Data Profiling task (named Create full inclusion profile). The task is configured almost identically to the first Data Profiling task (named Create initial inclusion profile) except for two important differences. First, we save the results to a file, rather than a variable, so we can easily email the report to an administrator for further analysis. Second, we set the MaxNumberOfViolations to 100 (the default), rather than 0, so that the report includes a list of all the mismatched product IDs. This, of course, assumes that there will be fewer than 100 mismatched product IDs. It is up to you how many errors you want to specify. However, I would suggest that if your source data contains excessively high numbers of mismatched values, you should lobby for a different method for generating that source data!

 After configuring the new Data Profiling task, simply add a Send Mail task to send an email and the XML report (as an attachment) to the appropriate recipient. In this case, I also added a second  Execute SQL task (named Drop staging table), which I use to drop the staging table. Whether or not you include such a task depends on how you set up your staging environment, but usually some sort of clean-up will be necessary.

Note, however, that I also connect a precedence constraint from the second Data Flow task (named Load staged data) to the new Execute SQL task. When you have two different workflows converging on a single task, you have two options that determine the logic of how the precedence constraints will evaluate to true (you can view the two options at the bottom of the Precedence Constraint editor in Figure 5). The first option lets you configure the constraints so that all constraints must evaluate to true in order for the next task to run. This is the default configuration. As an alternative, you can configure the constraints so that only one constraint needs to evaluate to true for the next task to run. In my solution only one of the two workflows will run, so I select the second option, which means that only one of the two constraints connecting to the Execute SQL task must evaluate to true.

Loading the Staged Data

If the Results variable is true (and the data is valid), you can then run the second workflow, which includes a Data Flow task (named Load staged data) that loads the data from the staging table to the target database. The exact configuration of the data flow will depend on your solution, as is true for many of the elements in this package. More important, however, than the package’s specific elements is that you have an overall understanding of how you can use a Value Inclusion report and the Script task to determine that workflow.

You can also use the Data Profiling task to generate other types of reports (such as a Column Null Ratio report or a Column Pattern report) in order to validate your source data and then create the scripts necessary to retrieve the relevant analysis results. In the meantime, the solution I’ve demonstrated here should, I hope, provide you with a good foundation in understanding how to use the Data Profiling task to validate data. Be sure to check out SQL Server Books Online for more details about the task and the various profile types that it supports.