{"id":91469,"date":"2021-07-01T17:45:57","date_gmt":"2021-07-01T17:45:57","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=91469"},"modified":"2021-07-21T20:05:26","modified_gmt":"2021-07-21T20:05:26","slug":"query-blob-storage-sql-using-azure-synapse","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/query-blob-storage-sql-using-azure-synapse\/","title":{"rendered":"How to query blob storage with SQL using Azure Synapse"},"content":{"rendered":"<p><strong>The series so far:<\/strong><\/p>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/query-blob-storage-sql-using-azure-synapse\/\">How to query blob storage with SQL using Azure Synapse<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/query-private-blob-storage-sql\/\">How to query private blob storage with SQL and Azure Synapse<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/performance-of-querying-blob-storage-with-sql\/\">Performance of querying blob storage with SQL<\/a>\u00a0<\/li>\n<\/ol>\n\n<p>When you deal with <strong>Big Data<\/strong> in <strong>Azure<\/strong>, you will store lots of files on <strong>Azure Storage<\/strong>. From time to time, you may need to investigate these files. Wouldn&#8217;t it be cool if you could just open a window and execute a SQL Query over the storage? In this article, I explain how to query blob storage with SQL using Azure Synapse.<\/p>\n<p><strong>Synapse Analytics<\/strong> makes this not only possible but also very affordable since you pay as you go. At the time of this writing, it estimates only $5.00 USD for each terabyte processed. Synapse Analytics makes it easy to query your storage and analyse files using the SQL language.<\/p>\n<p><strong>Disclaimer:<\/strong> This is only the price for data processing on the <strong>Synapse Serverless SQL pool<\/strong>. It doesn\u2019t include the storage price and other services used. This price may change anytime. Please, check the <a href=\"https:\/\/azure.microsoft.com\/en-us\/pricing\/details\/synapse-analytics\/\">pricing on the Azure website<\/a> for more precise information.<\/p>\n<p>Next, I\u2019ll show you how to implement Azure Synapse and dig into the details to show you how it works.<\/p>\n<h2>Provisioning the workspace<\/h2>\n<p>I will not duplicate the good Microsoft documentation. You can follow this <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/quickstart-create-workspace?WT.mc_id=DP-MVP-4014132\">QuickStart to provision the workspace<\/a>. Use the option to create a new storage account together with the workspace.<\/p>\n<p>When the workspace is provisioned, only the <strong>SQL On Demand<\/strong> pool is created. It\u2019s the most affordable one, and you can stick to that: use the <strong>SQL On Demand<\/strong> pool to query your blob storage.<\/p>\n<p>The Synapse Workspace name needs to be unique because It can become a public URL. In fact, it is one by default. I will call my <strong>Synapse Workspace<\/strong> <em>MaltaLake. <\/em>You need to create your own custom name for your <strong>Synapse Workspace<\/strong>. Every time I refer to <em>MaltaLake<\/em>, replace it with your workspace name.<\/p>\n<h2>Provisioning a storage account<\/h2>\n<p>The storage account you created in the previous step will be used by <strong>Synapse Analytics<\/strong> itself for things like metadata. In order to make a demonstration of a small data lake, you will need another storage account.<\/p>\n<p>During the example, I will call the storage<em> LakeDemo<\/em>. This is the name for my storage, which also needs to be unique. You will need to create a unique name for your storage and use it every time you run an example that uses <em>LakeDemo<\/em> by adjusting the scripts, replacing <em>LakeDemo<\/em> with your storage account name.<\/p>\n<p>You can use a PowerShell script to provision the storage account and upload demo files to it. You can <a href=\"https:\/\/bufaloinfo-my.sharepoint.com\/:u:\/g\/personal\/dennes_bufaloinfo_onmicrosoft_com\/EUMDIq1vMHdFu1oWuuzlsx8Bh9QbsRa59vqxu-Av8p-C_g?e=9KN1JY\">download the scripts and demos<\/a> and execute the file <em>Install.bat<\/em>. It will ask you some questions and create the environment on your Azure account.<\/p>\n<p>This is what the script will do for you:<\/p>\n<ul>\n<li>Create a resource group or use the existing resource group you provide<\/li>\n<li>Create a storage account or use an existing storage account you provide<\/li>\n<li>Upload the files to the storage account<\/li>\n<\/ul>\n<p>If you would like, you can also follow these steps to <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/storage\/common\/storage-account-create?tabs=azure-portal&amp;WT.mc_id=DP-MVP-4014132\">provision a new storage account<\/a>. You can still run the script after this, providing the name of the resource group and the storage account. The script will create the containers and upload the file.<\/p>\n<p>Once again, if you would like, you can create the containers and upload the files by yourself. You will need two containers for the demonstrations, one private and one with public access: <em>DataLake<\/em> and <em>OpenDataLake<\/em>. In order to upload the files, you can use <a href=\"https:\/\/azure.microsoft.com\/en-us\/features\/storage-explorer?WT.mc_id=DP-MVP-4014132\">Azure Storage Explorer<\/a> if you chose not to use the script.<\/p>\n<h2>Querying the blob storage data<\/h2>\n<p>Azure provides a nice environment, Synapse Studio, for running queries against your storage. Next, I\u2019ll show you how to get started.<\/p>\n<h3>Opening the environment<\/h3>\n<p>Follow these steps to learn the details of Synapse Studio and how you will run these queries.<\/p>\n<ol>\n<li>Open Azure Synapse Studio. You will find it under <em>Getting Started<\/em> on the <em>Overview<\/em> tab of the <em>MaltaLake<\/em> workspace<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91470\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image.png\" alt=\"An image showing how to open Synapse Studio\" width=\"420\" height=\"232\" \/><\/p>\n<p><strong>Synapse studio<\/strong> may ask you to authenticate again; you can use your Azure account.<\/p>\n<ol start=\"2\">\n<li>Open the <em>Develop<\/em> tab. It\u2019s the 3<sup>rd<\/sup> icon from the top on the left side of the <strong>Synapse Studio<\/strong> window<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91471\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-1.png\" alt=\"An image showing how to find the Develop tabl\" width=\"132\" height=\"318\" \/><\/p>\n<ol start=\"3\">\n<li>Create a new SQL Script\n<ol style=\"list-style-type: lower-alpha;\">\n<li>On the <em>Develop<\/em> window, click the <em>\u201c+\u201d<\/em> sign<\/li>\n<li>Click the <em>SQL Script<\/em> item on the menu<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91472\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-2.png\" alt=\"An image showing how to create a new SQL Script\" width=\"439\" height=\"281\" \/><\/p>\n<p>The script will be connected to <em>BuiltIn,<\/em> meaning the <strong>SQL On Demand<\/strong>. There are two databases, <em>master<\/em> and <em>default<\/em>. You can&#8217;t create objects in either of them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91473\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-3.png\" alt=\"An image showing the SQL Script toolbar\" width=\"912\" height=\"105\" \/><\/p>\n<h3>A side note about scripts<\/h3>\n<p>You can save your scripts in <em>Synapse<\/em>. The button <em>Publish<\/em> on the image above is used to save your scripts. It is recommended that you change the name of your script before publishing because many scripts called <em>SQL Script X<\/em> can become confusing very fast.<\/p>\n<p>You can use the <em>Properties<\/em> window on the right side of the screen. It appears by default, but if hidden, the button on the toolbar, exactly over where the window should be, will make it appear again.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"364\" height=\"339\" class=\"wp-image-91474\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-4.png\" \/><\/p>\n<p>Synapse Studio also has additional features to manage the scripts. These features are hidden on the <em>\u201c\u2026\u201d<\/em> menu each script type has. Once you click the <em>\u201c\u2026\u201d<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"361\" height=\"43\" class=\"wp-image-91475\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-5.png\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91476\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-6.png\" alt=\"An image showing how to create a new folder\" width=\"446\" height=\"137\" \/><\/p>\n<p><strong>New Folder:<\/strong> You can organize scripts in folders under the script type. The Script type (<em>\u201cSQL scripts\u201d<\/em>) can have many folders under it.<\/p>\n<p><strong>Import:<\/strong> You can import existing scripts from your machine to <strong>Synapse Studio<\/strong><\/p>\n<p>Besides these features, Synapse Studio also automatically breaks down the content according to the type of the script. For example, the image below is an example of a <em>Develop<\/em> environment with many kinds of elements.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91477\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-7.png\" alt=\"An image of a develop environment with many types of files\" width=\"367\" height=\"192\" \/><\/p>\n<h3>File Type Support<\/h3>\n<p>Azure Synapse can read two types of files:<\/p>\n<p><strong>PARQUET:<\/strong> A columnar format with defined data types for the columns, very common in Big Data environments<\/p>\n<p><strong>CSV:<\/strong> The classic Comma Separated Values file format, without specified data types for the columns.<\/p>\n<p>These file types can be in their regular format or compressed. The two possible compressions available are:<\/p>\n<p><strong>Snappy:<\/strong> Usually used with PARQUET<\/p>\n<p><strong>GZIP:<\/strong> Usually used with CSV<\/p>\n<h3>Work-Around for additional types<\/h3>\n<p>You can play around with CSV confirmation, defining the row and column delimiters to access different file types. The idea is simple: Read the entire files into a varchar(max) field and then use T-SQL features to process these fields.<\/p>\n<p>For example, this works for JSON file types. After reading the files, you can process the fields using JSON functions.<\/p>\n<h2>Reading Parquet files<\/h2>\n<p>Copy the following query to the new script window created and execute the query. Don\u2019t forget to change the URL to your storage URL:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\u00a0\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/trips\/',\r\n\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\r\n\u00a0\u00a0)\u00a0AS\u00a0[rides]<\/pre>\n<p>The <code>OPENROWSET<\/code> function allows reading data from blob storage or other external locations. It works only with <strong>SQL On Demand<\/strong> pools; it\u2019s not available with <strong>SQL Dedicated<\/strong> pools yet.<\/p>\n<p>Usually, in data lakes, the data is broken down into many files, many pieces of data need to be loaded together as a single set. That\u2019s why the <strong>URL<\/strong> is pointing to a folder. <strong>Synapse<\/strong> can read from all the files in the folder as if they are one. You don\u2019t need to handle file-by-file; <strong>Synapse<\/strong> will return all the data. Beyond <strong>Synapse<\/strong>, this is a <strong>Big Data<\/strong> concept.<\/p>\n<p>It\u2019s also important that this URL finishes with \u201c\/\u201d, otherwise it may be mistaken as a file. This requirement is not present for all statements that use an URL, but it\u2019s better to always finish it with a \u201c\/\u201d.<\/p>\n<p>Another interesting rule is that Synapse will ignore any file inside the folder starting with \u201c_\u201d . This happens because many data lake tools use these files as metadata files to control the saving and ETL process, such as Azure Data Factory.<\/p>\n<p>The <code>OPENROWSET<\/code> statement also defines the format as <strong>PARQUET<\/strong>. This is one of the formats very common on data lakes. Here are some characteristics of the <strong>PARQUET<\/strong> file:<\/p>\n<ul>\n<li>It includes not only column names but also types, but it has its own type system.<\/li>\n<li>It\u2019s a columnar file, similar to <strong>ColumnStore<\/strong> indexes. This means it\u2019s optimized to retrieve only the columns you ask for.<\/li>\n<li>It can be compressed using Snappy compression. <code>OPENROWSET<\/code> function automatically recognizes if it\u2019s compressed or not.<\/li>\n<\/ul>\n<p>Try one more query. You can create a new SQL script, or you can copy the query to the same script and select the query before executing it; it\u2019s up to you.<\/p>\n<p>Execute the following query:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0Month(Cast(Cast(DateId\u00a0as\u00a0Varchar)\u00a0as\u00a0Date))\u00a0Month,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0count(*)\u00a0Trips\r\n\u00a0from\u00a0\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/trips\/',\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\r\n\u00a0\u00a0)\u00a0AS\u00a0[rides]\r\ngroup\u00a0by\u00a0Month(Cast(Cast(DateId\u00a0as\u00a0Varchar)\u00a0as\u00a0Date))\r\norder\u00a0by\u00a0Month<\/pre>\n<p>This example is grouping, counting the taxi rides by month, and ordering. The use of functions over the date is not optimized at all, but still, this query executes in about one second. This query is an example showing more complex calculations and summaries from the data in the blob storage.<\/p>\n<h3>SQL Script execution features<\/h3>\n<p>After executing the previous query, you can use some interesting features to better analyse the data. The query result, on the lower part of the window, is a list of months and the taxi trips made during that month.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91478\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-8.png\" alt=\"An image showing the results of running the query\" width=\"366\" height=\"240\" \/><\/p>\n<p>Changing the view to <em>Chart, <\/em>you immediately see a line chart, as the image below:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91479\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-9.png\" alt=\"An image showing the chart view\" width=\"1005\" height=\"477\" \/><\/p>\n<p>The chart is not correct. You may see Month and Trips fields are on the Legend space, which is not correct. The month should be a Category Column. Changing the Month to a Category column will make a big difference on the graphic, as you may note below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91480\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-10.png\" alt=\"An image showing the chart view after correcting the category\" width=\"1008\" height=\"434\" \/><\/p>\n<p>You can also change the type of chart to column, resulting in the image below:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91481\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-11.png\" alt=\"An image showing the column chart type\" width=\"1020\" height=\"431\" \/><\/p>\n<p>The possibility to query information on blob storage and other sources easily with a Serverless Pool has many uses. One of them is ad-hoc analysis queries, and the UI feature to view the result in many different ways contributes even more to this.<\/p>\n<h2>Reading CSV files<\/h2>\n<p>Copy the following query to a SQL script and execute it. Once again, you can create a new script, or you can use the same script and select the query before executing it.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0\u00a0\u00a0firstrow=2,\r\n\u00a0\u00a0\u00a0\u00a0parser_version='1.0'\r\n\u00a0\u00a0\u00a0\u00a0)\u00a0\u00a0\u00a0AS\u00a0[holidays]\u00a0<\/pre>\n<p>This time, the query is trying to read <strong>CSV<\/strong> files. This example includes some additional information on the <code>OPENROWSET<\/code> function:<\/p>\n<p><strong>Field Terminator:<\/strong> The <strong>CSV<\/strong> format is flexible in relation to the characters used as field terminator, record terminator, and string delimiter. In this case, you are customizing only the field terminator.<\/p>\n<p><strong>First Row:<\/strong> This parameter tells to <code>OPENROWSET<\/code> function the records will start on the 2<sup>nd<\/sup> line. When this happens, it\u2019s usually due to the header of the columns, although not all <strong>CSV<\/strong> files have the column headers included.<\/p>\n<p><strong>Parser Version:<\/strong> There are more than one <strong>CSV<\/strong> parser available in <strong>Synapse<\/strong>. This example specifies the parser version 1.0.<\/p>\n<p>After the execution, you will see an error message complaining it was not possible to identify the schema of the file.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"875\" height=\"115\" class=\"wp-image-91482\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-12.png\" \/><\/p>\n<p>As opposed to the <strong>Parquet<\/strong> format, the <strong>CSV<\/strong> at most has the column names, never the types. This means that for the <strong>CSV<\/strong> files, you need to apply a technique called <strong>Schema-On-Read. <\/strong>You will apply the schema now, when reading the file, instead of applying it before, when the file was being saved.<\/p>\n<p>You can specify the schema on the <code>SELECT<\/code> statement:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0\u00a0 BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0\u00a0firstrow=2,\r\n\u00a0\u00a0\u00a0parser_version='1.0'\r\n\u00a0\u00a0\u00a0\u00a0)\r\nWITH\r\n(\u00a0\u00a0\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0country\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0holiday\u00a0\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0varchar(10),\r\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\r\n)\u00a0\u00a0AS\u00a0[holidays]\u00a0<\/pre>\n<p>You may notice the <code>WITH<\/code> statement is considered part of the <code>FROM<\/code> clause, as if the entire <code>OPENROWSET<\/code> function and the <code>WITH<\/code> statement were only the table name. That\u2019s why the table alias, <code>AS [HOLIDAYS]<\/code> only appears in the end.<\/p>\n<h2>Using Parser 2.0 for CSV files<\/h2>\n<p>Parser 2.0 has some interesting differences. Execute the same query as above, but now with parser 2.0:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0\u00a0HEADER_ROW=true,\r\n\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0\u00a0parser_version='2.0'\r\n\u00a0\u00a0\u00a0\u00a0)\r\nWITH\r\n(\u00a0\u00a0\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0country\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0holiday\u00a0\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\r\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\r\n)\u00a0\u00a0AS\u00a0[holidays]\u00a0<\/pre>\n<p>You may notice the syntax has a small difference: instead of using the <code>FirstRow<\/code> parameter, it\u2019s using the <code>Header_Row<\/code> parameter. In summary, the syntax is not the same between parser 1.0 and 2.0.<\/p>\n<p>After the execution, some columns will return only <code>NULL<\/code> values. This is another difference with parser 2.0: It will make a match between the column names you use and the column names in the file. They need to match, otherwise, it\u2019s like the columns you are specifying are new columns, not present in the file, so they are all filled with <code>NULL<\/code>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91483\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-13.png\" alt=\"An image showing the results with lots of NULLs\" width=\"1045\" height=\"282\" \/><\/p>\n<p>Luckily, parser 2.0 can also auto-detect the types in the CSV file. If you just remove the <code>WITH<\/code> clause, you will get back the existing fields in the file.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0OPENROWSET(\r\n\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0HEADER_ROW=true,\r\n\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0parser_version='2.0'\r\n\u00a0\u00a0)\u00a0AS\u00a0[holidays]\u00a0<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91484\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-14.png\" alt=\"An image showing the results when Parser 2.0 automatically detects the columns\" width=\"1501\" height=\"335\" \/><\/p>\n<p>It\u2019s interesting to notice this is basically the same query that failed under parser 1.0 because of the lack of field identification, while 2.0 auto-identifies the fields when possible. Not all CSVs have a header row, in this case, auto-identify the fields would be impossible.<\/p>\n<p>In summary, this leaves three different ways to read CSV files:<\/p>\n<ul>\n<li>Using Parser 1.0 with schema<\/li>\n<li>Using Parser 2.0 with schema<\/li>\n<li>Using Parser 2.0 auto-detecting fields<\/li>\n<\/ul>\n<h2>Parser differences<\/h2>\n<p>Parser 1.0 Is created to be more feature complete, supporting more complex scenarios. Parser 2.0, on the other hand, is focused on performance and doesn\u2019t have all the features Parser 1.0 has.<\/p>\n<p>My recommendation is to use Parser 2.0 when it\u2019s possible and revert to parser 1.0 when the feature is not available in 2.0.<\/p>\n<p>A small summary about parser 2.0 extracted from <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/sql\/develop-openrowset?WT.mc_id=DP-MVP-4014132\">https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/sql\/develop-openrowset?WT.mc_id=DP-MVP-4014132<\/a>:<\/p>\n<p>\u201cCSV parser version 2.0 specifics:<\/p>\n<ul>\n<li>Not all data types are supported.<\/li>\n<li>Maximum character column length is 8000.<\/li>\n<li>Maximum row size limit is 8 MB.<\/li>\n<li>Following options aren&#8217;t supported: DATA_COMPRESSION.<\/li>\n<li>Quoted empty string (&#8220;&#8221;) is interpreted as empty string.<\/li>\n<li>DATEFORMAT SET option is not honored.<\/li>\n<li>Supported format for DATE data type: YYYY-MM-DD<\/li>\n<li>Supported format for TIME data type: HH:MM:SS[.fractional seconds]<\/li>\n<li>Supported format for DATETIME2 data type: YYYY-MM-DD HH:MM:SS[.fractional seconds]\u201d<\/li>\n<\/ul>\n<h2>Data types<\/h2>\n<p>The field identification makes the query much easier, but it\u2019s important to be sure you are using the correct data types. <strong>CSV<\/strong> files don\u2019t have data types for the fields, so the type identification may not be so precise; it doesn\u2019t matter how much the Parser 2.0 has evolved in this area.<\/p>\n<p>You can confirm by checking in detail the resultset returned using the <code>sp_describe_first_result_set<\/code> stored procedure. Execute this procedure passing a query as a parameter, like the example below:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">sp_describe_first_result_set\u00a0N'select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0\u00a0OPENROWSET(\r\n\u00a0BULK\u00a0''https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/'',\r\n\u00a0\u00a0\u00a0\u00a0FORMAT=''CSV'',\r\n\u00a0\u00a0\u00a0\u00a0HEADER_ROW=true,\r\n\u00a0\u00a0\u00a0\u00a0FIELDTERMINATOR=''|'',\r\n\u00a0\u00a0\u00a0\u00a0parser_version=''2.0''\r\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[holidays]\u00a0'<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91485\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-15.png\" alt=\"An image showing the data types\" width=\"1516\" height=\"320\" \/><\/p>\n<p>This stored procedure will help with many tasks in a data lake, discovering field names and data types. You may notice on the result of this execution, the field <code>isPaidTimeOff<\/code> was recognized as <code>BigInt<\/code> instead of <code>Bit<\/code>.<\/p>\n<p>It\u2019s important to be the most precise possible with the data types when using SQL over a data lake. Small mistakes will not prevent the query from working. The CSV format hasn\u2019t types at all, while PARQUET format has a type system where some types can have more than one similar in SQL Server and there are no data type sizes.<\/p>\n<p>Due to that, the query will work even with some mistakes, it\u2019s just a matter of a data type conversion. However, you will pay the price of data type mistakes with slower queries. This is the worst kind of problem: The silent one, when everything seems to be working, but it\u2019s charging you the performance cost.<\/p>\n<p>Even having the auto-detect feature, you can still override it to define better data types. You just need to keep the same column names, like on the query below:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\u00a0from\r\n\u00a0OPENROWSET(\r\n\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0\u00a0HEADER_ROW=true,\r\n\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0\u00a0parser_version='2.0'\r\n\u00a0\u00a0\u00a0\u00a0)\r\nWITH\r\n(\u00a0\u00a0\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\r\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\r\n)\u00a0\u00a0AS\u00a0[holidays]\u00a0<\/pre>\n<p>This query above is not using the best data types possible yet. I will dig into the data types later.<\/p>\n<p>Once again, you can make groupings and aggregations. Run this query to see how many holidays each country has:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0countryOrRegion,\u00a0count(*)\u00a0as\u00a0TotalHolidays\u00a0from\r\n\u00a0 OPENROWSET(\r\n\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\r\n\u00a0\u00a0FORMAT='CSV',\r\n\u00a0\u00a0HEADER_ROW=true,\r\n\u00a0\u00a0FIELDTERMINATOR='|',\r\n\u00a0\u00a0parser_version='2.0'\r\n\u00a0\u00a0\u00a0\u00a0)\r\nWITH\r\n(\u00a0\u00a0\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\r\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0varchar(8000),\r\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\r\n)\u00a0\u00a0AS\u00a0[holidays]\r\ngroup\u00a0by\u00a0countryOrRegion<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91486\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-16.png\" alt=\"An image showing a column chart total holidays by country\" width=\"1052\" height=\"429\" \/><\/p>\n<h2>Querying JSON files<\/h2>\n<p>JSON files are prevalent nowadays, and the need to query them is also very common in a data lake or data lake house.<\/p>\n<p>A JSON file has elements and values separated by commas. It may appear strange, but this allows us to read a JSON file as a CSV file, resulting in a string, and later parsing the string as JSON. You use the CSV parser, then process the JSON using the <strong>OPENJSON<\/strong> function introduced in SQL Server 2017. In this way, you can retrieve specific data from the JSON files in your data lake.<\/p>\n<p>The query below makes the first step, read the JSON file. The classic JSON file uses the 0x0b terminator and the entire file is read as one row.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0top\u00a010\u00a0*\r\nfrom\u00a0openrowset(\r\n\u00a0\u00a0\u00a0bulk\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/events\/',\r\n\u00a0\u00a0\u00a0format\u00a0=\u00a0'csv',\r\n\u00a0\u00a0\u00a0fieldterminator\u00a0='0x0b',\r\n\u00a0\u00a0\u00a0fieldquote\u00a0=\u00a0'0x0b',\r\n\u00a0\u00a0\u00a0rowterminator\u00a0=\u00a0'0x0b'\u00a0\r\n\u00a0\u00a0\u00a0\u00a0)\u00a0with\u00a0(doc\u00a0nvarchar(max))\u00a0as\u00a0rows<\/pre>\n<p>Each JSON file has a structure that, if broken, makes the JSON invalid. The query above will bring each file in the folder as a different record, keeping the JSON structure still valid.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-91487\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-17.png\" alt=\"An image showing the returned JSON\" width=\"885\" height=\"102\" \/><\/p>\n<p>After returning the records, you can use the <strong>OPENJSON<\/strong> function released in <strong>SQL Server 2017<\/strong> to process the JSON. The query below does exactly this to extract some fields from the JSON.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">select\u00a0id,eventName, eventType,title\r\nfrom\u00a0openrowset(\r\n\u00a0\u00a0\u00a0bulk\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/events\/',\r\n\u00a0\u00a0\u00a0format\u00a0=\u00a0'csv',\r\n\u00a0\u00a0\u00a0fieldterminator\u00a0='0x0b',\r\n\u00a0\u00a0\u00a0fieldquote\u00a0=\u00a0'0x0b',\r\n\u00a0\u00a0\u00a0rowterminator\u00a0=\u00a0'0x0b'\u00a0\r\n\u00a0\u00a0\u00a0\u00a0)\u00a0with\u00a0(doc\u00a0nvarchar(max))\u00a0as\u00a0rows\u00a0\u00a0\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0cross\u00a0apply\u00a0openjson\u00a0(doc,'$.items')\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0with\u00a0(\u00a0\u00a0id\u00a0int,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eventName\u00a0varchar(200),\r\n\t\t   title varchar(200),\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0userHasAccess\u00a0bit,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eventType\u00a0int)<\/pre>\n<p>Here are the highlights about the OPENJSON syntax:<\/p>\n<ul>\n<li>The first parameter is a JSON document in string format<\/li>\n<li>Use the cross apply, so you can execute the <strong>OPENJSON<\/strong> for each row retrieved from the previous query<\/li>\n<li>The 2<sup>nd<\/sup> parameter of the <strong>OPENJSON<\/strong> function is a JSON expression applied from the root. The records will be the elements below this expression.<\/li>\n<li>You need to apply a schema to the data, but the field names are automatically matched with json properties. You will only need additional expressions if there isn\u2019t a precise match, which doesn\u2019t happen on this query.<\/li>\n<\/ul>\n<h2>How to query blob storage with SQL<\/h2>\n<p>In this first part of the series, you saw the basic concepts about using the <strong>Serverless Pool<\/strong> to query your blob storage. This is just the beginning. Next, I\u2019ll go deeper about authentication, performance and conclude with how to use these features with <strong>Power BI<\/strong>.<\/p>\n<p><em>If you liked this article, you might also like\u00a0<a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/power-bi-jupyter-notebooks\/\">Embed Embed Power BI in Jupyter Notebooks.<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How do you include blob storage in reports? In this article Dennes Torres explains how to query blob storage with SQL using Azure Synapse.&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[137091,53],"tags":[143586,124952,124953],"coauthors":[6810],"class_list":["post-91469","post","type-post","status-publish","format-standard","hentry","category-azure","category-featured","tag-blob-storage-query","tag-redgate-deploy","tag-synapse"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91469","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=91469"}],"version-history":[{"count":10,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91469\/revisions"}],"predecessor-version":[{"id":91995,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91469\/revisions\/91995"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=91469"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=91469"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=91469"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=91469"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}