A MongoDB to SQL Server Migration: Data Typing and Un-nesting

Importing data from one system to another is a common task for database professionals. It’s possible to spend a lot of time figuring out what the schema should be for the target table. In this article, Rodney Landrum demonstrates an interesting solution to automatically figure out the data types and un-nest delimited data.

When recently assigned a project to migrate data from MongoDB to SQL Server, my initial concern was my lack of expertise with MongoDB itself. I can say now after completing the project, MongoDB was the absolute least of my worries and certainly not a challenge. In fact, the ultimate solution that was implemented bypassed the MongoDB service altogether and used a combination of a MongoDB dump file and the native command line tool bsondump to convert the dump files from BSON (Binary JSON) to JSON for direct load into SQL Server tables. I was fortunate to avoid the C# coding required for that initial load step to pull the data into staging tables whose columns were all NVACHAR(MAX). (I have a very smart colleague, Lou Talarico, who noodled through that chore, and I have some references at the end of this article for further information). My tasks were simple comparatively: all I had to do was generate CREATE TABLE scripts with correct data types from the data in the staging tables and transpose the delimited array data stored in many of the columns. In this article, I will provide the sample data and code that will accomplish these two tasks.

The MongoDB Data

Most of us have had it drilled into our SQL brains for good reason that while NVARCHAR(MAX) has its place, it is not ideal for every column in every table. There is plenty of supporting evidence the you can find with a quick search that shows that the query optimizer does not always play nice when generating an optimal execution plan for these large data types. While it was the path of least resistance to use this data type to populate the SQL Server staging tables from the Extended JSON files MondoDB produced, it would not have been ideal for the final data types that the analysts would be writing queries against.

I decided to try and generate the code to create the final tables using a combination of metadata and a sampling of the actual data in staging tables. I would need to do this for each column and pull in any non-null values to determine what data type it should be including its max length. Before reviewing the code that will generate the final CREATE TABLE statements, take a look at some sample data so that you may understand the challenge.

Listing 1: Create and Populate Sample Table, doc_staging

The above query creates the table and inserts 12 sample records. You can see that all of the columns use nvarchar(max) as the data type. The data itself contains common data types that will need to be converted, such as int, datetime and decimal.

Several fields are double pipe “||” delimited strings, each containing concatenated values. For example, doc_id in the first row has the value N’63||93||28||50||38′. In MongoDB, these were nested records in the collection, and this is how the conversion brought these over to the staging table, as one flat record. Further, the doc_id field is ordinally related to the doc_type and doc_requirements fields. The requirement is to transpose these values to individual rows for the final conversion. But first you have to build the ultimate, properly data typed “production” table.

The Solution

The following code builds the create table script and uses a simple iterative technique (thankfully not cursors) that reads each non-null value from each column, gleaned from the metadata in INFORMATION_SCHEMA.COLUMNS, and tries to ascertain the data type of the values. It is limited to a small set of data types like datetime, int, decimal and character-based values, but I have found it to be reliable and fast. Listing 2 shows the code in its entirety, commented to explain the process flow.

Listing 2: The code to create a CREATE TABLE script

The code itself simply iterates through one or more tables, gathering metadata for column names and ordinal positions, interrogates each column to determine its max length and assumed data type and then generates a CREATE TABLE script, adding 20 to the rounded character-based fields. This version of the code uses Unicode data types. For the small sample table, it immediately returns the following:

 

Notice that the _staging suffix has been removed and the final table will be dbo.doc. This, of course, is easily modifiable.

In the real project, there were tens of thousands of rows and the code still performed surprisingly well.

After running the CREATE TABLE statement to create the empty base table, appropriately data typed, it is time to populate it from the same staging data. Recall the requirement to transpose the delimited values from the doc_id, doc_type and doc_requirments columns so that these become a separate row.

You will need to transpose this:

43||79||51||36

Into this:

43

79

51

36

And further, you will need to join these doc_ids to their precise doc_type and doc_requirement values.

The figure below shows the first row, using the sample data.

The next figure shows what the final result should be. Notice that the doc_ids have maintained their ordinal positions despite their numerical order, which is crucial because the values in both doc_type and doc_requirement share the same positional location as the corresponding doc_id that they are related to.

Fortunately, there is a new function in SQL Server 2016 and higher called STRING_SPLIT() that handles this type of pivoting of delimited values with ease and does so very efficiently. The trick to getting these values to line up uses a combination of cross-applying the results of the STRING_SPLIT function with the base table and using the window function ROW_NUMBER to maintain the position.

The following listing shows the simple query to perform the work.

Listing 3: Splitting the field

Notice that ROW_NUMBER is partitioned and ordered on the ID value of each row, which will return a running tally based on the number of delimited values that are cross applied.

You can then use the row number to join to the same cross applied row numbers for doc_type and doc_requirement. Remember each field to transpose has a variable number of actual delimited values; some rows may have five or more values, and others may only have one, but each row will contain the same number of doc_ids as there are doc_type and doc_requirement values.

It is also worth pointing out that the REPLACE function changes the double pipe to a single pipe character. Using the double pipe was a choice we made to have more assurance that there would not be actual data values that contain this combination of characters, but the downside is that the STRING_SPLIT function will only take a single byte value for the delimiter. Hence, the double pipe must be replaced with a single character delimiter. The final code to return exactly what is needed for the sample data, all 46 new rows, is in the following listing, which includes a CTE-based solution.

Listing 4: CTE to transpose the data values from the delimited values

Each CTE splits one of the delimited columns, and the CTEs are joined together in the outer query. Since the OVER clause for ROW_NUMBER is identical in each case, they join together perfectly on the ID and row number.

By using this code, you can easily insert the results into the newly created dbo.doc table. I have commented out that insert statement so that you can see the results first.

Summary

I like the idea of having reusable code, even if it is to pick out certain pieces from a larger scope. I believe the code I am sharing here, which certainly has much room for improvement and extension, can be readily used for a variety of other purposes with some slight modification. One such goal would be to generate create table scripts for each table in an existing database. Instead of deriving the data type, it is possible just to use an existing data type. I will most likely go back to it many times for future projects that may involve a REST API, for example, rather than a MongoDB conversion. I know, too, that I will find a need to transpose delimited data somewhere down the road. Even if you do not have a specific need for a MongoDB conversion, I hope that the techniques shown here will help you in your projects.

References

For further information on the JSON framework used for the C# code as well as the bsondump utility, please see the following links.

https://www.newtonsoft.com/json

https://docs.mongodb.com/manual/reference/program/bsondump/

It is worth pointing out that the documentation for BSDUMP says that it is a diagnostic tool and not a tool for data ingestion or other application use.