Register Log in

A MongoDB to SQL Server Migration: Data Typing and Un-nesting

Rodney Landrum 20th June 2019

Comments 0

When recently assigned a project to migrate data from MongoDB to SQL Server, my initial concern was my lack of expertise with MongoDB itself. I can say now after completing the project, MongoDB was the absolute least of my worries and certainly not a challenge. In fact, the ultimate solution that was implemented bypassed the MongoDB service altogether and used a combination of a MongoDB dump file and the native command line tool bsondump to convert the dump files from BSON (Binary JSON) to JSON for direct load into SQL Server tables. I was fortunate to avoid the C# coding required for that initial load step to pull the data into staging tables whose columns were all NVACHAR(MAX). (I have a very smart colleague, Lou Talarico, who noodled through that chore, and I have some references at the end of this article for further information). My tasks were simple comparatively: all I had to do was generate CREATE TABLE scripts with correct data types from the data in the staging tables and transpose the delimited array data stored in many of the columns. In this article, I will provide the sample data and code that will accomplish these two tasks.

The MongoDB Data

Most of us have had it drilled into our SQL brains for good reason that while NVARCHAR(MAX) has its place, it is not ideal for every column in every table. There is plenty of supporting evidence the you can find with a quick search that shows that the query optimizer does not always play nice when generating an optimal execution plan for these large data types. While it was the path of least resistance to use this data type to populate the SQL Server staging tables from the Extended JSON files MondoDB produced, it would not have been ideal for the final data types that the analysts would be writing queries against.

I decided to try and generate the code to create the final tables using a combination of metadata and a sampling of the actual data in staging tables. I would need to do this for each column and pull in any non-null values to determine what data type it should be including its max length. Before reviewing the code that will generate the final CREATE TABLE statements, take a look at some sample data so that you may understand the challenge.

Listing 1: Create and Populate Sample Table, doc_staging

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

CREATE TABLE [dbo].[doc_staging](

[ID] [nvarchar](max) NULL,

[CreateDate] [nvarchar](max) NULL,

[mversion2] [nvarchar](max) NULL,

[Version] [nvarchar](max) NULL,

[doc_id] [nvarchar](max) NULL,

[doc_type] [nvarchar](max) NULL,

[doc_requirement] [nvarchar](max) NULL

) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'7E3BB454-80B1-406D-9D8B-7223A412CEF8', N'4/10/2017 8:47:17 PM'

, N'11.72', N'17', N'63||93||28||50||38', N'PDF||PDF||PDF||PDF||PDF'

, N'True||True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'6BA95F5A-9A92-41AF-AD1E-588AEA39600C', N'3/11/2017 9:54:00 PM'

, N'7.59', N'11', N'16||33||79||85||11', N'PDF||PDF||PDF||PDF||PDF'

, N'True||True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'567FCEA7-F0E0-473B-94EA-127923FA0D36', N'3/14/2017 1:56:22 PM'

, N'10.34', N'15', N'72||30||9||82||79||81||37'

, N'PDF||PDF||PDF||PDF||PDF||PDF||PDF'

, N'True||True||True||True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'236FBDFD-857D-46E1-89EC-8A6911736145', N'3/4/2016 4:15:04 AM'

, N'19.31', N'28', N'35||42||16||45||13||91||13'

, N'PDF||PDF||PDF||PDF||PDF||PDF||MS Word'

, N'True||True||True||True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'EEBB9163-B88F-44F1-9F9F-ABEF0418EF78', N'3/31/2017 3:45:31 AM'

, N'4.83', N'7', NULL, NULL, NULL)

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'630644C3-6001-4B32-AF45-3C7EA6F0CA70', N'6/6/2016 1:30:11 AM'

, N'5.52', N'8', N'63||18||92||54', N'PDF||PDF||PDF||PDF'

, N'True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'46998AA0-CCEA-41C5-9482-E94384FB11B0', N'8/11/2016 3:58:40 AM'

, N'13.79', N'20', N'57', N'MS Word', N'True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'695523CE-6DA4-4228-848C-3D92096AE18B', N'3/16/2017 3:27:06 PM'

, N'13.10', N'19', N'95||22||12||81||86||94||67'

, N'PDF||PDF||PDF||PDF||PDF||PDF||PDF'

, N'True||True||True||True||True||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'C7802FCC-95D0-48D6-BC02-DE0704D64F45', N'5/12/2016 2:32:26 AM'

, N'2.76', N'4', NULL, NULL, NULL)

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'80D248F3-A6DC-440E-8845-3413E4ADDB88', N'5/24/2016 2:33:25 PM'

, N'2.07', N'3', NULL, NULL, NULL)

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'8FCE0A51-770B-43AF-997E-7B0D28236372', N'3/9/2017 9:57:52 PM'

, N'11.72', N'17', N'98||90||89||2||97||50'

, N'PDF||PDF||PDF||MS Word||MS Word||PDF'

, N'True||True||True||False||True||True')

INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],

[Version], [doc_id], [doc_type], [doc_requirement])

VALUES (N'29318D29-1FB1-4C91-B7A3-76A6A8185497', N'5/23/2016 8:46:08 PM'

, N'8.97', N'13', N'43||79||51||36', N'PDF||PDF||PDF||MS Word'

, N'True||True||True||True')

The above query creates the table and inserts 12 sample records. You can see that all of the columns use nvarchar(max) as the data type. The data itself contains common data types that will need to be converted, such as int, datetime and decimal.

Several fields are double pipe “||” delimited strings, each containing concatenated values. For example, doc_id in the first row has the value N’63||93||28||50||38′. In MongoDB, these were nested records in the collection, and this is how the conversion brought these over to the staging table, as one flat record. Further, the doc_id field is ordinally related to the doc_type and doc_requirements fields. The requirement is to transpose these values to individual rows for the final conversion. But first you have to build the ultimate, properly data typed “production” table.

The Solution

The following code builds the create table script and uses a simple iterative technique (thankfully not cursors) that reads each non-null value from each column, gleaned from the metadata in INFORMATION_SCHEMA.COLUMNS, and tries to ascertain the data type of the values. It is limited to a small set of data types like datetime, int, decimal and character-based values, but I have found it to be reliable and fast. Listing 2 shows the code in its entirety, commented to explain the process flow.

Listing 2: The code to create a CREATE TABLE script

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

SET nocount ON;

--DECLARE variables and tables

DECLARE @Src NVARCHAR(max)

DECLARE @Col NVARCHAR(max);

DECLARE @SQL NVARCHAR(max);

DECLARE @SQLCUR NVARCHAR(max);

DECLARE @SQL_CASE NVARCHAR(max);

DECLARE @maxlen INT;

DECLARE @DB_Col TABLE

(

rownum INT,

column_name NVARCHAR(max)

);

DECLARE @Tablesrc NVARCHAR(max)

DECLARE @MAX_Rec INT,

@CURRENTRECORD INT;

DECLARE @MAX_RecTAB INT,

@CURRENTRECORDTAB INT;

DECLARE @DT NVARCHAR(max)

CREATE TABLE #temp

(

#temp NVARCHAR(max)

)

CREATE TABLE #temp2

(

#temp NVARCHAR(max)

)

--Create temp table #db_tab to hold table metadata

--from INFORMATION_SCHEMA.TABLES

SELECT ROWNUM =Row_number()

OVER (

ORDER BY table_name),

table_schema,

table_name

INTO #db_tab

FROM INFORMATION_SCHEMA.TABLES

--optional WHERE clause. If commented out, each user table in the

--database will be returned

WHERE table_name = 'doc_STAGING'

--Set max record value for iteration through tables

--(sample include only 1 table)

SET @CURRENTRECORDTAB = 1

SET @MAX_RecTAB = (SELECT Max(rownum)

FROM #db_tab)

--Iterate through each table

WHILE @CURRENTRECORDTAB <= @MAX_RecTAB

BEGIN

SELECT @Src = table_schema + '.' + table_name

FROM #db_tab

WHERE rownum = @CURRENTRECORDTAB

--Remove "_staging" from the new table name

SELECT @Tablesrc = Replace(table_name, '_STAGING', '')

FROM #db_tab

WHERE rownum = @CURRENTRECORDTAB

--Populate next temp table variable "@DB_Col" with column metadata

--from INFORMATION_SCHEMA.COLUMNS

INSERT INTO @DB_Col

SELECT ROWNUM = Row_number()

OVER (

ORDER BY ordinal_position),

column_name

FROM INFORMATION_SCHEMA.COLUMNS

WHERE table_schema + '.' + table_name = @Src

--Nested iteration for each column in the staging table

SET @CURRENTRECORD = 1;

--Begin building dynamic CREATE TABLE statement. DROP TABLE

--can be optional

--SET @SQL = N'DROP TABLE IF EXISTS ' + @Tablesrc + ';'

-- + Char(13) + Char(13) + ' CREATE TABLE dbo.'

-- + @Tablesrc + N' (' + Char(13);

SET @SQL = ' CREATE TABLE dbo.'

+ @Tablesrc + N' (' + Char(13);

--Get the max number of columns (ordinal_position field in

--INFORMATION_SCHEMA.COLUMNS)

SET @MAX_Rec = (SELECT Max(ordinal_position)

FROM INFORMATION_SCHEMA.COLUMNS

WHERE table_schema + '.' + table_name = @Src);

--Start column iteration

WHILE @CURRENTRECORD <= @MAX_Rec

--Set current column name using rownum

BEGIN

SELECT @Col = N'[' + column_name + N']'

FROM @DB_Col

WHERE rownum = @CURRENTRECORD;

--Get the max length of all non null values for current column

SET @SQLCUR = N'Select max(len(' + @Col + ')) from '

+ @Src + ' where ' + @Col + ' is not null';

INSERT #temp

EXEC (@SQLCUR);

SET @maxlen = (SELECT TOP 1 *

FROM #temp);

--Round max length to nearest 10

SET @maxlen = Round(@maxlen, -1);

--This does all of the data type guessing work and writing the

-- output to a second temp table #temp2

SET @SQL_CASE = N'Select TOP 1 CASE WHEN Isdate(' + @Col

+ N') = 1 AND ISNUMERIC(' + @Col

+ N') <> 1 THEN ''DATETIME2'' WHEN ISNUMERIC(' + @Col

+ N') = 1 and ' + @Col + N'

NOT LIKE ''%.%''THEN ''INT'' WHEN ISNUMERIC(' + @Col

+ N') = 1 and ' + @Col

+ N' LIKE ''%.%''THEN ''decimal(18,2)'' WHEN (isdate('

+ @Col + N') <> 1 and ISNUMERIC(' + @Col

+ N') <> 1) AND len(' + @Col + N') > 0 AND '

+ Cast(@maxlen AS VARCHAR(5))

+ ' < 4000 THEN ''NVARCHAR('

+ Cast(@maxlen + 20 AS VARCHAR(5))

+ N')'' ELSE ''NVARCHAR(MAX)'' END From '

+ @Src + ' where ' + @Col + N' is not null';

INSERT #temp2

EXEC (@SQL_CASE);

--Get the data type (@DT) from #temp2

SET @DT = (SELECT TOP 1 *

FROM #temp2);

SET @SQL = @SQL + N' ' + @Col + N' '

+ Isnull(@DT, 'NVARCHAR(255)') + N',' + Char(13);

--We are finished with current column, let's delete these

--and get new ones

DELETE FROM #temp;

DELETE FROM #temp2;

SET @CURRENTRECORD = @CURRENTRECORD + 1;

END;

--Build final SQL statement used to print CREATE TABLE

--statement (optional EXEC)

SET @SQL = LEFT(@SQL, ( Len(@SQL) - 2 ))

SET @SQl = @SQL + Char(13) + ')'

--Print the final CREATE TABLE SCRIPT

PRINT @SQL;

SET @CURRENTRECORDTAB = @CURRENTRECORDTAB + 1;

END

DROP TABLE [#db_tab];

DROP TABLE [#temp];

DROP TABLE [#temp2];

The code itself simply iterates through one or more tables, gathering metadata for column names and ordinal positions, interrogates each column to determine its max length and assumed data type and then generates a CREATE TABLE script, adding 20 to the rounded character-based fields. This version of the code uses Unicode data types. For the small sample table, it immediately returns the following:

1

2

3

4

5

6

7

8

9

CREATE TABLE dbo.doc (

[ID] NVARCHAR(60),

[CreateDate] DATETIME2,

[mversion2] decimal(18,2),

[Version] INT,

[doc_id] NVARCHAR(80),

[doc_type] NVARCHAR(140),

[doc_requirement] NVARCHAR(110)

)

Notice that the _staging suffix has been removed and the final table will be dbo.doc. This, of course, is easily modifiable.

In the real project, there were tens of thousands of rows and the code still performed surprisingly well.

After running the CREATE TABLE statement to create the empty base table, appropriately data typed, it is time to populate it from the same staging data. Recall the requirement to transpose the delimited values from the doc_id, doc_type and doc_requirments columns so that these become a separate row.

You will need to transpose this:

43||79||51||36

Into this:

43

79

51

36

And further, you will need to join these doc_ids to their precise doc_type and doc_requirement values.

The figure below shows the first row, using the sample data.

The next figure shows what the final result should be. Notice that the doc_ids have maintained their ordinal positions despite their numerical order, which is crucial because the values in both doc_type and doc_requirement share the same positional location as the corresponding doc_id that they are related to.

Fortunately, there is a new function in SQL Server 2016 and higher called STRING_SPLIT() that handles this type of pivoting of delimited values with ease and does so very efficiently. The trick to getting these values to line up uses a combination of cross-applying the results of the STRING_SPLIT function with the base table and using the window function ROW_NUMBER to maintain the position.

The following listing shows the simple query to perform the work.

Listing 3: Splitting the field

1

2

3

4

SELECT *, arv.value AS doc_id_split,

Row_number() OVER (PARTITION BY id ORDER BY id ) AS RN

FROM [dbo].[doc_staging]

CROSS apply String_split(Replace([doc_id], '||', '|'), N'|') arv

Notice that ROW_NUMBER is partitioned and ordered on the ID value of each row, which will return a running tally based on the number of delimited values that are cross applied.

You can then use the row number to join to the same cross applied row numbers for doc_type and doc_requirement. Remember each field to transpose has a variable number of actual delimited values; some rows may have five or more values, and others may only have one, but each row will contain the same number of doc_ids as there are doc_type and doc_requirement values.

It is also worth pointing out that the REPLACE function changes the double pipe to a single pipe character. Using the double pipe was a choice we made to have more assurance that there would not be actual data values that contain this combination of characters, but the downside is that the STRING_SPLIT function will only take a single byte value for the delimiter. Hence, the double pipe must be replaced with a single character delimiter. The final code to return exactly what is needed for the sample data, all 46 new rows, is in the following listing, which includes a CTE-based solution.

Listing 4: CTE to transpose the data values from the delimited values

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

WITH split_cte

AS (SELECT *,

arv.value AS doc_id_split,

ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN

FROM [dbo].[doc_staging]

CROSS APPLY STRING_SPLIT(REPLACE([doc_id], '||', '|'), N'|')

arv),

split_cte2

AS (SELECT id,

arv2.value AS doc_type,

ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN

FROM [dbo].[doc_staging]

CROSS APPLY STRING_SPLIT(REPLACE([doc_type], '||', '|'), N'|')

arv2),

split_cte3

AS (SELECT id,

arv3.value AS doc_requirement,

ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN

FROM [dbo].[doc_staging]

CROSS APPLY STRING_SPLIT(REPLACE([doc_requirement], '||', '|')

, N'|') arv3)

--INSERT INTO dbo.doc

SELECT sc1.ID,

sc1.CreateDate,

sc1.mversion2,

sc1.Version,

--,sc1.doc_id

sc1.doc_id_split AS doc_id,

sc2.doc_type,

sc3.doc_requirement

FROM split_cte sc1

INNER JOIN

(SELECT id, split_cte2.doc_type, split_cte2.RN FROM split_cte2) sc2

ON sc1.id = sc2.id

AND sc1.RN = sc2.RN

INNER JOIN

(

SELECT id,

split_cte3.doc_requirement,

split_cte3.RN

FROM split_cte3

) sc3

ON sc1.id = sc3.id

AND sc1.RN = sc3.RN;

Each CTE splits one of the delimited columns, and the CTEs are joined together in the outer query. Since the OVER clause for ROW_NUMBER is identical in each case, they join together perfectly on the ID and row number.

By using this code, you can easily insert the results into the newly created dbo.doc table. I have commented out that insert statement so that you can see the results first.

Summary

I like the idea of having reusable code, even if it is to pick out certain pieces from a larger scope. I believe the code I am sharing here, which certainly has much room for improvement and extension, can be readily used for a variety of other purposes with some slight modification. One such goal would be to generate create table scripts for each table in an existing database. Instead of deriving the data type, it is possible just to use an existing data type. I will most likely go back to it many times for future projects that may involve a REST API, for example, rather than a MongoDB conversion. I know, too, that I will find a need to transpose delimited data somewhere down the road. Even if you do not have a specific need for a MongoDB conversion, I hope that the techniques shown here will help you in your projects.

References

For further information on the JSON framework used for the C# code as well as the bsondump utility, please see the following links.

https://www.newtonsoft.com/json

https://docs.mongodb.com/manual/reference/program/bsondump/

It is worth pointing out that the documentation for BSDUMP says that it is a diagnostic tool and not a tool for data ingestion or other application use.

Think something needs changing?
If you've spotted something that needs update or review, please let us know by reaching out to the editor.

Article tags

About the author

Rodney Landrum

Rodney Landrum went to school to be a poet and a writer. And then he graduated so that dream was crushed. He followed another path, which was to become a professional in the fun filled world of Information Technology. He has worked as a systems engineer, UNIX and network admin, data analyst, client services director and finally as a database administrator. The old hankering to put words on paper, while paper still existed, got the best of him and in 2000 he began writing technical articles, some creative and humorous, some quite the opposite. In 2010 he wrote SQL Server Tacklebox, a title his editor disdained, but a book closest to the true creative potential he sought; he still yearns to do a full book without a single screen shot, which he accomplished in 2019 with his first novel, Chronicles of Shameus He currently works from his castle office in Pensacola, FL as a senior DBA consultant for Ntirety, a division of Hostway/Hosting.

Rodney's contributions

Articles

21
Books

0
Top topics

Rodney's latest contributions:

Rodney Landrum

Rodney Landrum in BI

Azure Data Factory pipelines: Filling in the gaps

Azure Data Factory is a cloud based data orchestration tool that many ETL developers began using instead of SSIS. In this article, Rodney Landrum recalls...

18 September 2020 17 min read

Rodney Landrum

Rodney Landrum in Blogs

Ready Player Tattoo

I am sitting in my hotel room in Nashville getting ready for #SQLSatNash as I write this. I know there are a lot of #SQLFamily...

17 January 2018 7 min read

Rodney Landrum

Rodney Landrum in Blogs

A Quick Look At Two New DMV Columns for Tuesday

I have learned over the years that some people really like it when I am verbose, long-winded, redundant and carry an otherwise small sentence on...

14 February 2017 3 min read