Administrating SQL Server Integration Services – Planning, Documenting and Troubleshooting

SQL Server Integration Services is an essential component of SQL Server, and designed to perform a variety of routine and occasional tasks. It majors on ETL tasks and for administrative jobs across servers. The DBA needs also to be aware of their role in optimising SSIS by planning, trouble-shooting, optimising performance, and in documenting installations.

SSIS (SQL Server Integration Services)  as a component has been an intrinsic part of the SQL Server package since the 2005 release. It is not ‘instance aware’, in that the one service of SSIS can be used for all instances of SQL Server on that machine, but each instance can be used as a separate repository for the SSIS packages. We will talk more about this in the Installation section later on.

SSIS replaced DTS (Data Transformation Services) which was the main ETL tool in previous SQL Server editions. On a personal note, I still have flashbacks from the time when I used to do the back-breaking work of developing DTS packages and writing long VB scripts to clean data.

Without going into a detailed comparison of DTS and SSIS, SSIS is a far superior product than DTS, due to the following reasons amongst others

  • .NET integration
  • Better transactional management
  • Improved performance and better hardware handling
  • Improved security
  • Full-feature integrated development environment (BIDS)
  • Allowing the developer to compose or develop customized tasks in addition to all the available predefined ones. For examples, see SSIS Community Tasks and Components on CodePlex.

The full list would be very long.

Application and architecture:

SSIS is a vast product and has many relevant applications, both routine and on-demand.

When I say ‘routine’, I mean processes which are developed, established and are executed on a certain schedule regularly. For example, all SSIS packages created with BIDS can be part of a daily data extraction, transformation and loading (ETL). Notice that SQL Server Maintenance plans are also using parts of SSIS.

Note: It is fairly easy to turn a SSIS package into a routine; all that is needed is to save the package in a repository and setup an agent job which configures, authenticates (with or without a proxy) and executes the package via DTExec.exe process.

When I say ‘on-demand’, I mean that the integrated Import-Export wizard in Management Studio is also part of SSIS.

Note: Even the Integrated import-Export wizard can be turned into a permanently stored package in a repository and scheduled later on as a routine.

Whether used as part of a daily routine or on-demand, SSIS can be used to meet many data transformation requirements.

For example, some of the most common uses of SSIS are:

  • Cleaning and standardizing data – regardless of the data source, the data is rarely in a 100% pure state, and it needs to be looked up, converted and trimmed before it can be loaded to its destination. SSIS provides many built-in components for doing this: Lookup transformations, Fuzzy lookups, data conversion transformations etc.
  • Performing administrative tasks over multiple servers – the SSIS provides conditional flows logic and loop containers which help enumerate and perform the same routine to a number of servers; for example, examine index fragmentation and defrag only the indexes over a certain threshold, or perform statistics update.
  • Data warehouse bulk loading and data pre-aggregation – SSIS gives great power in the hands of the developer, especially by giving a chance to create tasks to manage locking, check constraints, set network package sizes, determine rows per batch, manage concurrency and so on.
  • Merging data from multiple data sources – let’s face it, all you need is a proper driver to get the data into SSIS from any source and from there on the data can be managed with ease.

From an architectural perspective, the SSIS is a complicated product due to the vast range covered by its functionality and due to its extensibility (custom workflow elements can be designed to be used in the data flow process; just look at the list of SSIS Tasks and components on Codeplex: http://ssisctc.codeplex.com/)

The building block of the SSIS workload is the Package, which is a collection of predefined connections, data flow operations, precedence constraints, variables and conditional logic. The Package is composed in BIDS (usually) and is rendered into a XML file with the dtsx extension. The dtsx files can be saved on the file system, or in a repository (I will cover repositories later on).

Installation:

As mentioned, SSIS ships with SQL Server and it is fairly easy to install as part of the SQL Server instance installation.

SSIS consists of several components:

  • a service, which is added as part of the OS services and needs to be started and secured by a proper account with sufficient permissions
  • an executable DTExec.exe, which is used to execute the SSIS packages
  • repository options: While the SSIS Service is not instance aware and there is only a single SSIS Service per server (regardless of how many instances there are), there can be multiple options to store the packages: either on the file system, or in a repository in the msdb in each instance (there are several options to save and secure SSIS packages, but this will be covered later on in the article).

Here are two installation caveats, which are often misunderstood or ignored: SSIS is not a cluster aware service and also the fact that one service of SSIS provides the support to all instances of SQL Server on the machine.

SSIS service is not cluster aware; hence its service cannot be set as part of the clustered resources. In other words, if a cluster failover occurs during the execution of a SSIS package, then the data connections are lost, the cache is lost (the cache lookups for example) and the process is generally interrupted.

What helps in this case is the ability of the SSIS package to re-start at a predetermined checkpoint. I.e. in a case of failover, if the package is developed properly, it will know what was the last completed task executed and it will pick up from there when it is run next time.

The SSIS service is shared between all instances on the server, i.e. when the SSIS is installed it takes the name of the instance it is installed with and it uses the repository of that instance.

If there is no default instance, or if you would like to use the repository of a different instance, then you would need to make some changes to the SSIS config file (since there is no UI which helps the administrator to configure the access to each SSIS instance).

The SSIS config file is located in the DTS\Binn folder under the SQL Server installation folder and it is called MsDtsSrvr.ini.xml. The steps are easy:

  • Locate and make a copy of the file to a different location (for example, to your Desktop)
  • Open it with a text editor (notepad, Notepad++ or similar)
  • Find the <folder> tag with the ‘SqlServerFolder’ value and create additional ones for each instance and specify the <Name> and <ServerName> accordingly.
  • Copy the new config file over the old one (after backing up the original)
  • Restart the SSIS service

After you do this, you will be able to connect to SSIS from Management Studio by using the name of the Server, but then, under the Stored Packages node, you will see a repository for each instance on the server.

Performance:

A few words on SSIS’s performance: keep in mind that the SSIS packages are executed by the DTExec.exe process, and within the security context of the executing account.

If you need the SSIS package to run in a specific execution context every time, then you will need to setup a credential as part of the SQL Server security and then you will need to use this credential to authenticate a proxy account under the SQL Server Agent context.

Once again, keep in mind that the SSIS packages are not executed by the SSIS Service, but by the DTExec.exe process. The SSIS Service is used to control the ability to execute packages on the system (depending on the configuration settings in the MsDtsSrvr.ini.xml file).

When a SSIS package is executed, all resources needed for the execution are allocated by the DTExec.exe process, and during the execution the resource allocation may happen dynamically (i.e. as more resources are needed, they may be allocated). Furthermore, for each SSIS package being executed, there is a new instantiation of the DTExec.exe process in the OS.

What this means is that you have to be very careful and plan for the performance of your SSIS packages. The transformation caching options, the CPU usage and so on will reflect the DTExec process ,and you may see the resource allocation in real-time as the package executes.

There are some catches, though:

  • there is no way to limit the memory (especially if a wrong caching option is used for large datasets being processed); i.e. there is no way to guarantee that the DTExec is not going to take over the entire memory of the server
  • there is no way to guarantee that the DTExec will not be pushed out of the memory if other process takes all memory
  • there is no way to control the disk system from overloading
  • the network settings play very important role: check for the TDS packet sizes and network packet settings

Troubleshooting:

Troubleshooting the SSIS package performance may not be always easy, but here are a few suggestions:

Monitoring performance

If you would like to monitor the performance of the SSIS package which is currently executing, you can do so by using the Perfmon counters in the Process class and filter them by DTexec process. There are CPU and IO counters.

The problem here is that for each package which is running, there is yet another DTexec process started on the host system, and there is no way to tell which process is related to which package (unless you are familiar with the execution patterns of your packages and somehow guess the process).

So if you have one package running or if you are just testing the package on a development machine, then you can look at the Perfmon counters for the Process class for the DTExec process.

Note: In Perfmon there are also 2 other classes which give information about the SSIS execution statistics: SQLServer:SSIS Pipeline and SQLServer:SSIS Service. The first one gives information about the execution of the pipeline (buffers and blobs, as well as cumulative count of data rows which are processed) and the second object class gives information about how many packages are currently running.

Monitoring errors

SSIS writes errors, warnings and information to the Application log, which is part of the OS (which can be accessed through the Event Viewer). Review this log often to monitor and prevent serious errors with your data processing.

Performance Tuning

Aside from the scarce information in Perfmon and the errors and warnings in the App log, the SSIS is not so transparent when it comes to performance tuning. In order for you to tune the performance of your SSIS package, I would recommend to visit the following areas and fine-tune each one of them:

  • The network context of the data: network throughput, network load, packet size (yet again, keep in mind that the data flow speed is just as fast as the slowest component involved in it!)
  • The first-come-first-processed concept: keep in mind that SSIS does not process any data until the first set of data rows comes into the pipeline. In other words, if you have a data flow which consists of DataSource, DataConversion and DataDestination, then the DataConversion will not start working until the DataSource task has fed the DataConversion task with the first set of rows. So, if the DataSource task is performing some heavy processing of the data before feeding the pipeline with the first set of rows, the DataConversion will spend all this time waiting and the entire execution duration of the package may be prolonged.

    In some cases, the use of a query option OPTION (FAST N) might get you the first N number of rows much faster than the query without this option, and thus the DataConversion might start working on the data much earlier, which will most likely reduce the execution duration of your SSIS package.

    This is not an ultimate solution, though. Test before releasing to production.

  • Tune the individual queries in your data sources: as mentioned earlier, the SSIS cannot start processing data before the data has reached the pipeline; make sure your data sources return data fast and in the most efficient state. Try to avoid sorting, cleaning (string manipulations) and duplicates removal in the SSIS pipeline, unless you have a real proof that it is more efficient than performing it on the data source engine’s side.
  • Consider the indexing overhead: when loading data, consider the index updates overhead. Depending on the storage system, data volume, and the consistency of the loaded data, the underlying non-clustered indexes will drag performance more or less because they need to be updated with the new records and eventually they will need to be defragmented. Test and consider removing / disabling / rebuilding and managing your indexes in the most efficient way.
  • Consider the fragmentation: in the context of the mechanical spindle drives, this is a real problem for both data sources and destinations. As mentioned above – test, test and test your data load and defragmentation options.

Documentation:

Documenting the SSIS environment is a task that is difficult to generalize because it varies so much due mostly to the variety of topologies and solutions.

What I would recommend though is that you backup your SSIS Service config file (MsDtsSrvr.ini.xml) and also document the connection configurations for each SSIS package on the server.

I would like to end this article with a few scripts which can help you extract the XML definition of SSIS packages from the msdb repository.

The first script shows the Package Name, description, creator and created date, as well as the package type and XML definition.

By developing this script further, we should be able to parse the XML and automate the documentation of the SSIS packages to some extent. We can get a variety of information including:

  • data source strings:

  • package configurations:

  • list of package variables:

In another article I will go in detail into SSIS scripting and automation since there is so much more to be explored.

In this article I have covered some of the benefits and caveats in SSIS planning and performance troubleshooting. I have also provided a script to extract the SSIS package definition from the msdb repository which can be used as backup / documentation option.

In conclusion I would say that SSIS is one of the components of SQL Server with the most impressive growth and development in the past years. SSIS is a robust platform which allows users to build custom-made additions on top.

There are still plenty of caveats and unanswered questions when it comes to performance information gathering and ease-of-troubleshooting. However, the product is robust and provides endless ETL possibilities.