Saving Data to Various Formats, Using SQL

Comments 0

Share to social media

I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? How long does it take to output in MongoDB’s Extended JSON? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.

In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.

My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.

How? Why?

The reason I like AdventureWorks is that it has a bit of an assault course there. There are illegal column names, reserved words used as columns, CLR types, and free text search. Not as much of this as I’d like, of course, but it makes things a bit more tricky, so it is good for testing, along with old Pubs with its awkward obsolete data types.

Because the SQL Server implementation of XML and JSON are rather delicate creatures that quail at the smell of a CLR Datatype, or a column name with an embedded space, we generally need some supporting logic, rather than belt out the tables to file just by SELECT *.

As most people know, you can use BCP very simply, just specifying each table. Here we output in Tab-delimited format in utf-8, which is a format that Excel likes. It is safe as long as you don’t include tab characters in your spreadsheet cells! Note that xp_cmdshell must be enabled to try this example.

Native and “comedy-limited” formats aren’t much harder. Proper CSV format output from the tables of a database is only possible via ODBC.

Creating the Test Harness

For this task, I set up several directories on the server for the different types of output

Then ran the script from The Code section in this article which wrote out the data from AdventureWorks into each directory in the different formats. I included within the batch all the necessary temporary procedures along with a routine that switched in the configuration that allowed me to use xp_cmdshell and then switched it off again.

Into each directory was written all the files, one for each table, like this.

One can, and should of course, open these files up and check on them. Here is the Tab-delimited format.

OK, it would be good not to have the quoted identifiers in the filename [Human Resources]. [Employee].Tab, perhaps, but they are legal. If they irritate you, it is very quick and easy to remove them.

Mixed Usefulness of the Different Formats

Not all these formats are particularly useful. Tab-delimited is fine just so long as you don’t have tabs in any strings. The comma-separated format isn’t CSV. If you run …

… you will see instantly that it hasn’t coped. It isn’t CSV.

The Results

The XML via BCP and the native BCP both worked very rapidly. Tab delimited and comma-delimited took a little bit longer. SQLCMD and the object-in-array JSON were slightly laggard but respectable. Extended JSON had a third-more to write than ordinary JSON. Bringing up the rear was the array-in-array-format. I originally tried exporting array-in-array JSON via XML. However, it turned out to be too slow. Although XML is quick to export, the overhead of converting it to JSON is slow. The method of using OpenJSON was again prohibitively slow, taking two or three minutes.

A word of caution with these results. They can vary considerably, even on the same server and general setup.

The performance of array-in-array JSON was as good as I can manage to get. CLR, or even PowerShell could easily do better. I’d prefer to wait for SQL Server to support the format. However, this is only five times as long as the fastest, BCP. It is usable for most purposes.

What about the space they take up?

The advantage of the array-in-array format of JSON is that the text takes half the space of the object -in-array format. This advantage almost vanishes when the files are zipped, though.

XML and Json Object-in-Array format both take around twice as much space as the simpler tabular formats, which is unsurprising as they are more versatile in the structure of the information they can hold. The large space used by comma-delimited files is due to the fact that I had difficulty storing the file in UTF-8 format. I opted for Unicode instead. This disappeared when the files were zipped. Comma-delimited format was only there for the sake of completion.

All the formats, when zipped, take roughly the same space. This shouldn’t be a surprise because they all held exactly the same information. It is just that some formats repeated it a lot!

The Code

So, here is the SQL to run the comparative tests. They take some time to run. Something like three and a half minutes for AdventureWorks. You’ll see that we start by hurriedly ensuring that we are allowed to use XP_CmdShell and changing the configuration if necessary. Then we run the tests and finally, if necessary, close off the configuration to prevent it being used.

To run this, change the path values to a directory on your server if required. Remember that you are writing these files to your server, not the workstation.

I describe the methodology for the test harness, and how to set it up as a SQL Prompt snippet, in an article ‘How to record T-SQL execution times using a SQL Prompt snippet’.

These routines need to access temporary stored procedures. These are all stored on GitHub, and I’ll keep those updated. They are here, at JSONSQLServerRoutines along with several others that I use for processing JSON. For this example, create the #SaveJsonDataFromTable, #SaveExtendedJsonDataFromTable,  and #ArrayInArrayJsonDataFromTable temporary stored procedures in the same query window before running the following code.

Conclusions

It is very useful to run tests like these, and I caught several bugs while doing so. However, I much prefer testing things rather than waving my hands and arguing.

What I’d conclude from all this is that JSON is a good way of storing data and isn’t really any bulkier as a format than any other way of doing it. When data is zipped, it seems to boil down to the same size whatever the format you use. Obviously, Native BCP is the quickest and neatest way of ‘dumping’ data but it isn’t easy to read, and you cannot prepare data from another source and import it as native BCP because it isn’t a data transport medium. If you want an unzipped tabular format with the robustness of JSON and the size of CSV, then JSON Array-in-array is the one to go for.

None of the delimited formats are safe in that it is always possible that column value could contain an unescaped character that is used as a delimiter. CSV, in contrast, would be fine if there were a reliable way of doing it in BCP. If you want a safe way of transporting data, then XML or JSON is an obvious choice but who would use XML when there was an alternative?

About the author

Phil Factor

See Profile

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 40 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Phil Factor's contributions