Oracle to SQL Server: Crossing the Great Divide, Part 1

When a SQL expert moves from Oracle to SQL Server, he can spot obvious strengths and weaknesses in the product that are too familiar to be apparent to the SQL Server DBA. Jonathan Lewis is one such expert: In this article he records his train of thought whilst investigating the mechanics of the SQL Server database engine. The result makes interesting reading.

After many years of trouble-shooting Oracle databases, making them run more efficiently and teaching other people how to do the same, I was suddenly asked to give a presentation discussing the suitability of SQL Server 2008 as an “Enterprise Database Solution”. It was an interesting challenge. I had no previous experience with the product, and only a short time to prepare. After downloading and installing the 180-day trial version of SQL Server Enterprise Edition, along with “Books Online”, I managed to find a total of 21 hours to investigate whether SQL Server 2008 could give me everything that I would want from a database management system.

I got out of the presentation alive, and the audience even seemed happy with the result, even though I had to leave them with a few open questions. Furthermore, the experience of having to learn as much as I could about SQL Server, as fast as I could, was fascinating enough that I decided to write it up as a short series of articles. The result is essentially a “journey of SQL Server discovery”, but from the point of view of someone who understands the essential requirements of an enterprise RDBMS.

The article series will ultimately cover the key requirements of the Enterprise RDBMS which are (in no specific order of importance):

  • Availability – will the database be there for the people who need it
  • Security – will the database be available ONLY to the people who are supposed to use it
  • Correctness – how good is the database at ensuring that you see the right results
  • Performance – will the results appear fast enough to serve their purpose

First, however, I had to establish a few basics: how to create a database, a table, an index, and then get some test data into that table. More generally, I had to get a basic feel for the whole structure of a SQL Server RDBMS; so the first few topics I cover will be biased towards performance.

Setting Targets

My experience with Oracle means that I know the sort of things I’m looking for in a database even if, in SQL server, I didn’t know exactly where to find them, or what form they would take. I was able to list a set of key terms, pursue them through the manuals and, in most cases, find concepts and explanations very quickly. As I learned and experimented, there were numerous cases (noted in the text) where I’ve ‘flagged’ an issue as something I’ll need to investigate in more detail, a little later in the journey.

The SQL Server technology is a little different from Oracle, of course, and some of the details are bound to vary for all sorts of reasons, but at a fundamental performance level there are only three aspects of data handling in either database management system that you have to address:

  1. Minimize the work done by each individual SQL statement
  2. Minimize the number of individual SQL statements you use to get a job done
  3. Minimize the degree of concurrent activity in the database and avoid contention issues

If you accept this simplistic, but fairly accurate, viewpoint then you will recognize the need for a few tools that let you see the impact of current database activity (points 2 and 3 above) and a few that let you see how an SQL statement is being executed, why it’s being executed that way, and how it could be executed more efficiently (point 1 above).

Initially I’ll concern myself only with tracking individual SQL statements, so I’ll be looking for methods to see how the data is physically located in the database, what structures exist to make data access more efficient, what statistics the optimizer uses to choose an execution plan, how I can see the execution plan and the work it does, and how I can over-ride the optimizer’s choice of plan if necessary.

In Oracle terms, I’ll be looking for the equivalent of querying the data dictionary for information about data segments, extents and blocks; I’ll be looking for heap tables, index clusters and hash clusters, index-organized tables, partitioned tables, indexes and materialized views; and I’ll want to investigate the SQL Server equivalent of object statistics, execution plans, hints and dynamic performance views (the “v$” tables).

But first, I’ll have to work out how to create a database, a table, some data, and an index.

Getting Started

After setting up an old laptop (2 CPUs, 2GB RAM) with Windows XP Pro, downloading and unzipping the trial product (Download Trial Software) and “Books Online” (SQL Server 2008 Books Online), I found that installing the software was a simple point-and-click operation, taking just a few minutes.

I installed everything under the “administrator” account on my laptop and run all my tests while logged on as the administrator. Of course, this isn’t a sensible strategy for serious development, although I have heard that it’s considered sensible for the SQL Server DBA to be the System Administrator because of the way the database and the O/S are so tightly coupled.

When I start learning about security and audit I’ll strip the laptop down and start again with a proper layering of privileges. For the moment, though, I just want to be able to browse around the system with an extreme level of privilege, and “install and run as administrator” is a quick and easy temporary strategy.

Instances, Databases, Schemas…

Once the software was running, I needed to find out how to “do something”. A Google search immediately led me a series of useful Microsoft video offerings, SQL Server 2008 for Oracle DBA. However, I quickly decided that I didn’t have the time to go at the speed of the lecturer and so I turned to “Books Online”, typed in a search for “Create Database” (including the quotation marks), and took it from there.

Item seven on the hit-list from this search was: “How to create a Database (SQL Server Management Studio)”, and item 11 was “Creating a Database (Database Engine)”. Both looked promising titles. The same search also gave me leads on creating database schemas and creating database users (a little surprise there, because, in Oracle, a schema is synonymous with a user).

Browsing my way through the “Creating a database” links, I soon learned a little bit about what SQL Server means by an “instance”, and how an instance manages several databases (unlike Oracle where an instance manages a single database), which are made up of “files groups” and files. I also found the names of a few of the dictionary structures that would help me find information about databases, files and schemas.

I also discovered the sqlcmd utility, and a couple of the command-line options, including :setvar, that go with it. This meant that I didn’t have to do everything from a GUI (Real Oracle DBAs don’t use GUIs!). As I progressed, however, it became clear that sqlcmd is not intended to be a “proper” reporting tool like Oracle’s SQL*Plus; it is a pain to use and has no features like “breaking” or “computing” on column values, it can’t impose formats on individual columns of the output data, and it took some time before I worked out that basically you have to format everything in the SQL rather than in the tool. Still, it does all I want, produces an audit trail of my experiments, and makes it easy for me to refine and extend repeatable test cases, even if the output is not as tidy as I’d like it to be.

After a couple of hours I’d pulled together my first SQL Server 2008 script, had a couple of surprises, and learned a few key differences between SQL Server 2008 and Oracle:

I may have been a little profligate with my use of “go”; it marks the end of a batch of commands that SQL Server should execute, and frankly I find I have an urge to type it after virtually every statement I write at present, this being an approximate way of emulating Oracle’s SQL*Plus tool, where you end each SQL statement with a semi-colon and each statement is executed as you terminate it.

In any event, this is the effect of running my script from sqlcmd:

Notice that the physical_name output is truncated so it’s not possible to see the actual names of the various files. As noted, sqlcmd is a major pain to use when you’ve been used to the formatting and presentation commands available to SQL*Plus, and I still needed to work out how to get a mixture of short and long text columns in the output when I first ran this script.

So what had I managed to learn so far?

Instances and Databases

  • Every SQL Server instance comes with some pre-determined databases including:
    • a temporary database (tempdb) which will be used by all the other databases under the instance for “scratchpad” data and for the “version store” (the equivalent of Oracle’s undo records)
    • a model database, which is used as the template for each new database you create
    • a hidden database (i.e. one that does not appear in the list above) called mssqlsystemresource which, I’ve been told, makes it possible to apply upgrades and patches to SQL Server 2008 very quickly.
  • Contrary to “common knowledge” in the Oracle world, an SQL Server database does not equate to an Oracle schema. It seems to be exactly the same as an Oracle database. Of course, if a single instance manages several databases, the distinction between schema and database becomes a little blurred, but a critical point that highlights the distinction is that there are explicit commands to backup and restore a single database in SQL Server, but the same is not true for a schema (in SQL Server or Oracle).

Filegroups and Log Files

  • When you create a database you automatically create a primary “file group” (roughly speaking the equivalent of an Oracle “tablespace”) and a transaction log file (very similar in concept to the Oracle “redo logs”). The primary file group will, by default, consist of a single file called {database_name}.mdf and the single associated transaction log file (Oracle requires at least two “log file groups”) will be called {database_name}_log.ldf.
  • It is possible to create a database with multiple file groups, and multiple files in each file group. Physical database objects are each associated with a single file group so it’s probably a good idea to look on the file group as a key unit of database recovery. The Oracle equivalent is the “tablespace” which is made up of one or many files, and Oracle has a recovery feature known as “tablespace point in time recovery”. In fact, both SQL Server and Oracle can do backup and recovery at the single file level and at the page (block) level, but the file-group (tablespace) level is a good level to work with when thinking about self-consistent data sets.

Schemas

  • Given the appropriate level of privilege, you can query across all the databases known to the instance, but if you want to keep things simple you can “use” a database, and then all references to {schema}.{object} will default to that database. My query against sys.schemas, for example, was implicitly testdata.sys.schemas because of the preceding use testdata command, but I could just as easily have issued a query against master.sys.schemas. Note to self: are transactions across multiple databases under the same instance considered as distributed transactions; if not, then the boundary between schemas and databases becomes more indistinct.
  • Every database contains the following schemas (and the repetition of the same schema name across multiple databases is another point in favor of the “databases are not schemas” argument):
    • An INFORMATION_SCHEMA (an ANSI standard requirement which gives you some limited information about the metadata)
    • A sys schema (which tells you  a lot more about what’s in the database
    • A dbo (DataBaseOwner) schema which is where objects get created if you’re (a) allowed to create objects and (b) not connected to any specific schema.
  • The view sys.schemas also reports db_owner as a schema, but it’s possible that this is a role name, rather than a schema name. Note to self: notice the odd pattern of values in the list of schema ids above that suggests dbo might be different from db_owner

Generating Test Data

Here’s an example of how I might generate a fairly typical set of test data in Oracle:

What I’ve done is extremely simple and very quick, taking about one minute to generate and index a million rows. If you replace the occurrences of 1000 in my sample with 5 so that the code generates just 25 rows, this is what they would look like (excluding the padding column with its string of 100 ‘x’s).

Unfortunately, there are at least eight reasons why my data generation code is not going to work in SQL Server, and even when I change the SQL to something that can run, it won’t do the same as it does in Oracle (for reasons which I’ll cover later). Clearly, the sort of investigative work I do with Oracle has to wait until I can work out how to translate this Oracle code into SQL Server code. So let’s take a look at the code and some of the problems with translating it to SQL Server.

The intention of the code is to create a simple heap table of one million rows of ‘random’ numerical and string data. Let’s take a look first at the CREATE TABLE portion of the script, where we use a common table expression (CTE), a mechanism usually referred to as “subquery factoring” in Oracle. In this CTE, we generate a small data set (using a couple of features that are completely dependent on Oracle), which subsequently will be ‘exploded’ into a large data set.

  • CREATE TABLE…AS SELECT – Oracle’s ‘create as select’ construct is not supported by SQL Server. Instead, I have to ‘create table’, then ‘insert into table’. (Note to self – is there anything in SQL Server that equates to the “append” hint and “nologging” option that you could use in Oracle to minimize the costs of data loading by disabling the generation of “undo” (version store) and “redo” (log))
  • The generator CTE – if I do use the common table expression mechanism to insert data after creating the table then the overall shape of the Oracle insert statement would be: ‘insert into table XXX with {alias} as {query} select …’ while the SQL Server code would be “with {alias} as {query} insert into table XXX select …”. It’s just one of those minor details where SQL Server and Oracle differ in the way you write the SQL.
  • The dual table -is a very special construct in Oracle. It is a publicly visible table with exactly one row and one column that is guaranteed to exist (and in 10g it is a memory only construct). I might have to create something explicitly in my SQL Server code to emulate it, but since SQL Server allows you to “select” a value without selecting it from anything, there may be no need for such a special table after all.
  • The “connect by” clause – is an Oracle extension to the SQL standard to deal with “hierarchical”, or “parts explosion / bill of materials” queries. SQL Server has a special HierarchyID data type for dealing with this type of thing.
  • Rownum -is a special Oracle “pseudo-column” feature; it’s convenient for adding a sequential row number to data as it comes out of a rowsource (“rowset” in SQL Server dialect). The closest approximation for generating that rownum value in SQL Server is the (standard) analytic function rownumber() over(…), which Oracle also supports; however this function requires a nominal sort to be applied to the data and could introduce an undesirable overhead. There is an “identity” type, of course, which could replace the rownum as a column generator, and we could think about using the “top N” mechanism to get rid of the appearance of rownum in the predicate.

In the subsequent select, we explode this initial small data set into a much larger one. Although I’ve used an internalized Oracle function to generate it, the id column is there to represent a user-supplied unique identifier for each row. The overall effect, if we walk through the data in order of id, is that the scattered, clustered, and randomized columns will show repeated occurrences of the values from 0 to 999, but with extremely different patterns of data distribution:

  • clustered – the clustered data will be 1,000 (five) zeros followed by 1,000 (five) ones and so on, generated using the trunc() function. As a minor irritant, “clustered” is a reserved word in SQL Server, so I’d have to use a different name for this column
  • scattered – the scattered data, generated using the mod() function, will cycle from zero to 999 (four), and then repeat the cycle 1,000 (five) times
  • randomized – the built-in Oracle package, dbms_random supplies a convenient way to generate pseudo-random data. Used with the trunc() function, we’ll end up with roughly equal numbers of each value scattered pseudo-randomly through the data set. For small sample sizes the distribution is likely to be a little uneven, as it is in the previous 25-row output.

The mod() and trunc() functions don’t exist in SQL Server, but the first hit of my search for “functions” on Books Online was a link to a comprehensive set of pages about the different types of functions available to SQL Server. The number of functions seems to be a little limited compared to Oracle. However, the arithmetic functions do include floor() and ceiling() so we do have a reasonable starting point, and there is a modulo  operator (the % operator), which I didn’t find until I thought of searching for “arithmetic operators” in Books Online.

The semantics barrier
As a side note, my first search for a replacement for mod() was just the word “operators” – and the first hit was a list of the optimizer operations, a truly lucky find, as I don’t think I would have thought of searching for a list of the optimizer operations, and probably wouldn’t have thought of using the word “operators” as a way of finding them.

The random_string column will be 12 character strings of randomly selected upper-case characters. (The call to dbms_random.seed(0) in the code ensures that the same “random” set of data is produced each time I run the script.) The vc_small column will be a 10 character representation of the id, space padded on the left and, finally, the vc_padding column is just a string of 100 repetitions of the letter ‘x’ that I am using to pad the rows out to a reasonable size.

In the from clause, our generator CTE reappears in a Cartesian (or cross) join as our means of starting with a small data set and exploding it to a large data set. In passing, Oracle will probably operate the join as a merge join while SQL Server will probably report a nested loop join.

Once I’ve created the data, I’ve defined a primary key on the id column, which will implicitly generate a unique index on the column and add a NOT NULL constraint. Even though the syntax is common to Oracle and SQL Server, the effects will be different because SQL Server will default to a clustered index, which means (in Oracle terms) that it will rebuild the table as something very similar to an “index-organized table”.

Clustered indexes and rebuilds…
It was something of a surprise to realize that if you populate a table in SQL server and then add a clustered index, the table – and any other current indexes – will be physically rebuilt. I’ll have more to say about clustered indexes in a future article.

In the next article in this series, I’ll describe the options I investigated while trying to work out a convenient way of generating data quickly. For the moment I’ll just leave you with my first attempt to create a very small amount of data:

I doubt if anyone from the Oracle world would have problems understanding this code, although if exists is not available in that form in SQL or PL/SQL (the “procedural language extension to SQL”), and that procedural part of the script would have been created as an “anonymous PL/SQL block”.

I’ve written an existence check to look in the dbo schema, which picks up my earlier point that dbo is the default schema if you don’t connect to, or otherwise specify, a schema when you create an object. There seem to be several popular ways of writing code to avoid an error on “drop table”; but the only reason I’ve used one at all is to give an example of mixing pure SQL with procedural code in SQL Server.

The function schema_name() is an interesting SQL Server detail and seems, like many other functions relating to the use of the data dictionary, effectively to take the place of a join. Note to self: does the use of the function change the execution plan or does it somehow extend the SQL with a join to sys.schema view that would be the more “pure”, and possibly more efficient, way of writing the SQL?

There are a couple of other details with which the Oracle user will not be familiar. In particular, the nonclustered keyword that appears in the constraint definition is worth noting: primary keys are clustered by default in SQL Server, which in Oracle-speak would result in an “index-organized table”. My choice here was to make the table a simple heap table with separate index – even though this is seems to be close-to-heretical behavior in the world of SQL Server.

The other interesting feature is the ability to load multiple rows of data into a table with a single insert statement. For small tables this is very convenient, but it doesn’t address my need to generate a large amount of data very quickly.

The last surprise for me, on this first leg of my journey, was to discover the effect of that commit command; not only was it redundant, it actually produced an error message:

By default my SQL Server database isn’t currently running in what I would call “transaction-based, read-consistent” mode. I’ve executed a call to insert some data, and it’s committed and visible the moment the command completes. I can’t roll back the change and everyone else can see the data immediately. In the short term, as I look for ways to generate test data, this isn’t going to matter; but it’s an area that I’m going to have to examine very carefully if I want to make a proper comparison between Oracle and SQL Server. Read-consistency and transaction isolation are very important features, and at some point I’ll need to work through all the different isolation levels that SQL Server offers to find out how best to use them.

Summary

My initial brief was to talk about the requirements of an Enterprise Database Management system, and this encouraged me to start learning about SQL Server 2008. In this article, I’ve tried to capture my train of thought as I started my research, although I’ve restricted my writing to my investigation into the mechanics of the database engine.

So far I’ve learnt how to use the manuals, create a database, pick a little information from the data dictionary, and create a small amount of data. My next target is to find an effective way of generating large volumes of data fairly quickly so that I can start to examine the different structures I can use to store data and understand the options I have for optimizing storage and retrieval of data.