{"id":92696,"date":"2021-10-21T12:47:53","date_gmt":"2021-10-21T12:47:53","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=92696"},"modified":"2025-06-27T14:03:31","modified_gmt":"2025-06-27T14:03:31","slug":"performance-of-querying-blob-storage-with-sql","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/performance-of-querying-blob-storage-with-sql\/","title":{"rendered":"Azure Synapse Serverless SQL: Querying Blob Storage with OPENROWSET &#8211; Filepath Filtering, Partitioning, Parquet vs CSV, and External Tables"},"content":{"rendered":"<p><b>Azure Synapse Analytics&#8217;s Serverless SQL pool lets you run T-SQL queries directly against files in Azure Blob Storage or Azure Data Lake Storage &#8211; no data movement, no ingest step, pay only for bytes processed per query. The primary access mechanism is OPENROWSET (for ad-hoc queries specifying a path, format, and schema inline) and External Tables (for persistent table-like references to storage paths). Query performance on blob storage depends on three main factors: (1) how much data the query needs to scan &#8211; directly proportional to cost and time, controlled via filepath filtering and partitioning; (2) the file format &#8211; Parquet (columnar, compressed, typed) is dramatically faster than CSV for analytical queries; (3) how the schema is specified &#8211; explicit typed schemas outperform auto-detection. This article walks through the performance techniques: filepath() function for filtering to specific directories, partition-aware layouts, schema definition tradeoffs, Parquet vs CSV benchmarks, CSV parser variants (auto-detect vs legacy vs 2.0), and External Tables vs OPENROWSET comparison. Practical guidance: design your data layout for filtering before you optimise queries &#8211; a query that scans less data is always faster than a query that scans everything efficiently.<\/b><\/p>\n<p><strong>The series so far:<\/strong><\/p>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/query-blob-storage-sql-using-azure-synapse\/\">How to query blob storage with SQL using Azure Synapse<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/query-private-blob-storage-sql\/\">How to query private blob storage with SQL and Azure Synapse<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/performance-of-querying-blob-storage-with-sql\/\">Performance of querying blob storage with SQL<\/a>\u00a0<\/li>\n<\/ol>\n\n<p>In the third part of the series Querying Blob Storage with SQL, I will focus on the performance behaviour of queries: What makes them faster, slower, and some syntax beyond the basics.<\/p>\n<p>The performance tests in this article are repeated, and the best time of the queries is recorded. This doesn\u2019t mean you will always achieve the same timing. Many architectural details will affect the timing, such as cache, first execution, and so on. The timing exposed on each query is only a reference pointing to the differences of the query methods that can affect the time and the usual result for better or worse performance.<\/p>\n<h2>Filtering data by filepath<\/h2>\n<p>Microsoft keeps some large datasets available for anyone to play with them using Synapse and other tools. The examples in this article use one of these datasets, the New York Yellow cab trips from many years.<\/p>\n<h2>Partitioning<\/h2>\n<p>Following Big Data rules, the detail rows about the taxi rides are broken down by year and month in different folders. It\u2019s like partitioning the data.<\/p>\n<p>You can query all folders, or partitions, at once by using wildcards in the query. The query below is a sample query summarizing the rides by year. Note that the query runs in Synapse Studio, which is part of the Synapse Workspace. See the previous articles to set up the workspace if it\u2019s not already set up.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT\n\u00a0\u00a0\u00a0\u00a0Year(tpepPickupDateTime),\u00a0count(*)\u00a0as\u00a0Trips\nFROM\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/nyctlc\/yellow\/puYear=*\/puMonth=*\/*.parquet',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[nyc]\ngroup\u00a0by\u00a0Year(tpepPickupDateTime)<\/pre>\n<p>The path may appear a bit strange, <em>\u2018puYear=\u2019<\/em> and <em>\u2018puMonth=\u2019<\/em> . These are part of the path. Only the wildcard symbol, <em>\u2018*\u2019<\/em>, replaces the year and month.<\/p>\n<p>The source data has a date field, tpepPickupDateTime. This query above uses this field to summarize the years. The Synapse Serverless engine cannot tell that the Year used on the query is the same information hidden by wildcard on the path. That\u2019s why it will read all the records.<\/p>\n<p>Here are the execution results of the query:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1033\" height=\"95\" class=\"wp-image-92697\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/10\/word-image-13.png\" \/><\/p>\n<p><strong>Records:<\/strong> 29<\/p>\n<p><strong>Data Scanned:<\/strong> 19203MB (19GB)<\/p>\n<p><strong>Data Moved:<\/strong> 1 MB<\/p>\n<p><strong>Execution Time:<\/strong> 01:48.676<\/p>\n<p>This query above does not make use of the folder partitioning at all; it\u2019s just reading everything. Synapse Serverless has a function called <code>FilePath<\/code> that can use <code>OPENROWSET<\/code> to retrieve pieces of the path using wildcards. Each wildcard has a number, starting with 1, and the <code>FilePath<\/code> function can retrieve its value.<\/p>\n<h2>FilePath function<\/h2>\n<p>Replacing the year on the previous query by the <code>FilePath<\/code> function, the new query will look like this:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT\n\u00a0\u00a0\u00a0\u00a0nyc.filepath(1)\u00a0as\u00a0Year,\u00a0count(*)\u00a0as\u00a0Trips\nFROM\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/nyctlc\/yellow\/puYear=*\/puMonth=*\/*.parquet',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[nyc]\ngroup\u00a0by\u00a0nyc.filepath(1)<\/pre>\n<p>This is the execution result of the new query:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1023\" height=\"104\" class=\"wp-image-92698\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/10\/word-image-14.png\" \/><\/p>\n<p><strong>Records:<\/strong> 29<\/p>\n<p><strong>Data Scanned:<\/strong> 187 MB<\/p>\n<p><strong>Data Moved:<\/strong> 1 MB<\/p>\n<p><strong>Execution Time:<\/strong> 00:14.939<\/p>\n<p>As you may notice, the first query takes over 1:30 longer than the second one. The second query reads less than half a gigabyte of data to retrieve the same amount of information and records and the exact same amount of content.<\/p>\n<h2>Using views to simplify the syntax<\/h2>\n<p>The <code>OPENROWSET<\/code> syntax can be quite long and complex, and adding the <code>FilePath<\/code> doesn\u2019t make it easier.<\/p>\n<p>The queries need to be easier to write so they can be used for those who will use Power BI or other BI tools, not only for those who will use the Serverless Pool directly to reach the Data Lake.<\/p>\n<p>Views are a great solution to make the environment easier for our users. The Serverless Pool can\u2019t hold tables, but you can create a database to host other objects, such as views.<\/p>\n<p>It\u2019s possible to build many different solutions using views, creating views with pre-defined queries to aggregate the data in different ways.<\/p>\n<p>One of the most flexible ways to use the views in Synapse Serverless is to include the values of the FilePath in the view. By doing so, the users can use this value to create queries in Power BI using the partitioning schema built on the Data Lake. The users can also use these queries to configure aggregations in Power BI, turning a huge model into something possible to analyse in seconds.<\/p>\n<p>The view can be created using the following query, but be sure to switch to the NYTaxi database if you are connected to master:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">Create\u00a0View\u00a0Tripsvw\u00a0AS\nSELECT\n\u00a0\u00a0\u00a0\u00a0vendorID,tpepPickupDateTime,tpepDropOffDateTime,passengerCount,\n\u00a0\u00a0\u00a0\u00a0tripDistance,puLocationId,\u00a0doLocationId,startLon,\n\u00a0\u00a0\u00a0\u00a0startLat,endLon,endLat,rateCodeId,storeAndFwdFlag,\u00a0paymentType,\u00a0fareAmount,\u00a0extra,\u00a0mtaTax,\n\u00a0\u00a0\u00a0\u00a0improvementSurcharge,tipAmount\u00a0,tollsAmount,\u00a0totalAmount,\n\u00a0\u00a0\u00a0\u00a0nyc.filepath(1)\u00a0as\u00a0Year,nyc.filepath(2)\u00a0as\u00a0Month\nFROM\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/nyctlc\/yellow\/puYear=*\/puMonth=*\/*.parquet',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[nyc]<\/pre>\n<p>The preview query using FilePath can be simplified using the view:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT\n\u00a0\u00a0\u00a0\u00a0Year,\u00a0count(*)\u00a0as\u00a0Trips\nFROM\n\u00a0\u00a0\u00a0\u00a0Tripsvw\ngroup\u00a0by\u00a0Year<\/pre>\n<h2>How the schema affects performance<\/h2>\n<p>The previous article of this series demonstrated that the schema can be auto-detected or defined in the query. The method used and the schema defined can make a difference in the query performance.<\/p>\n<p>For a matter of comparison, these different ways to build a query are considered:<\/p>\n<ul>\n<li>Auto-detect the schema<\/li>\n<li>Define a schema with the wrong size for the data types<\/li>\n<li>Define a schema with correct size but wrong string types, including collation<\/li>\n<li>Define a schema with the correct size and type, including collation<\/li>\n<\/ul>\n<p>Besides these options for CSV types, all the options can be tested with both parsers except the auto-detect option, which only works with Parser 2.0:<\/p>\n<ul>\n<li>Parser 1.0<\/li>\n<li>Parser 2.0<\/li>\n<\/ul>\n<p>Some file types, such as CSV, don\u2019t have a schema at all, while other file types, such as PARQUET, don\u2019t have the size of the fields.<\/p>\n<p>Auto-detecting the schema has a name: <strong>Schema-on-Read<\/strong>. You can specify the schema while querying the data instead of the usual RDBMS, where you define the table schema when creating the table.<\/p>\n<p>It\u2019s common, when defining the schema, to make mistakes on the field sizes. Because the source files don\u2019t hold the field sizes, this is hard to define. It\u2019s also easy to make mistakes on the field types. Mistakes between <em>varchar<\/em> and <em>nvarchar<\/em> and integer types are very common.<\/p>\n<p>The examples illustrate the performance implications of any mistake made on the field type when specifying a schema.<\/p>\n<h2>Defining the schema<\/h2>\n<p>Previous articles of this series showed how to use the procedure <code>sp_describe_first_result_set<\/code> to check the schema of a result set. Take a look at the example below:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">sp_describe_first_result_set\u00a0N'select\u00a0top\u00a010\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0''https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/'',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT=''PARQUET''\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[holidays]\u00a0'<\/pre>\n<p>The result is shown in the image below. It contains field sizes but all the field sizes are their maximum size. The source files don\u2019t have the size; it\u2019s filled with the maximum size.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1017\" height=\"305\" class=\"wp-image-92699\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/10\/table-description-automatically-generated-1.png\" alt=\"Table\n\nDescription automatically generated\" \/><\/p>\n<p>Using the maximum size on the schema definition is a usual mistake, exactly because these are the values returned on the schema specification. To discover the real size of the fields, use the query below:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0max(len(countryOrRegion)) AS maxCountryOrRegion, \n    max(len(holidayName)) AS maxHolidayName,\n\u00a0\u00a0\u00a0\u00a0max(len(normalizeHolidayName)) AS maxNormalizedHolidayName,     max(len(countryRegionCode)) AS maxCountryRegionCode\n\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[holidays]<\/pre>\n<p>This is the result:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"803\" height=\"77\" class=\"wp-image-92700\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/10\/word-image-15.png\" \/><\/p>\n<h2>Parquet source files<\/h2>\n<p>The results can be analysed using a PARQUET data source.<\/p>\n<p><strong>Wrong type and size<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\nWITH\n(\u00a0\u00a0\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(8000),\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(8000),\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(8000),\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0varchar(8000),\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\n)\u00a0\u00a0AS\u00a0[holidays]<\/pre>\n<p><strong>Correct size, wrong collation<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select * from\n    OPENROWSET(\n        BULK 'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n        FORMAT='PARQUET'\n    )\nWITH\n(    \n    countryOrRegion varchar(16),\n    holidayName  varchar(170),\n    normalizeHolidayName varchar(170),\n    isPaidTimeOff bit,\n    countryRegionCode char(2),\n    date DATETIME2(7)\n)  AS [holidays]<\/pre>\n<p><strong>Solving collation with Unicode Strings<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select * from\n    OPENROWSET(\n        BULK 'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n        FORMAT='PARQUET'\n    )\nWITH\n(    \n    countryOrRegion nvarchar(16),\n    holidayName  nvarchar(170),\n    normalizeHolidayName nvarchar(170),\n    isPaidTimeOff bit,\n    countryRegionCode nchar(2),\n    date DATETIME2(7)\n)  AS [holidays]<\/pre>\n<p><strong>Auto-detect<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\u00a0\u00a0\u00a0AS\u00a0[holidays]<\/pre>\n<p><strong>Collation specified on string fields<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='PARQUET'\n\u00a0\u00a0\u00a0\u00a0)\nWITH\n(\u00a0\u00a0\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(16)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(170)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(170)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0char(2)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2(7)\n)\u00a0\u00a0AS\u00a0[holidays]<\/pre>\n<p>Here are the results:<\/p>\n<table>\n<tbody>\n<tr>\n<td><strong>Scenario<\/strong><\/td>\n<td><strong>Data scanned<\/strong><\/td>\n<td><strong>Data moved<\/strong><\/td>\n<td><strong>Duration<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>Wrong size and type<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>9.962 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Correct size, wrong collation<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>8.298 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Solving collation with Unicode strings<\/strong><\/td>\n<td>1MB<\/td>\n<td>8MB<\/td>\n<td>13.734 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Auto-detect<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>9.491 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Collation specified on string fields<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>8.755 seconds<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>After all these comparisons, there are some interesting conclusions to make:<\/p>\n<ul>\n<li>Using Unicode string types hide the collation problem when reading strings, but it doubles the size of the string, resulting in a higher data movement and terrible performance.<\/li>\n<li>The auto-detect feature is quite efficient; it\u2019s almost as good as the correct choice of data types.<\/li>\n<li>The difference between the query without collation specification and with collation specification is almost nothing. I would recommend specifying the collation.<\/li>\n<li>The scanned data is smaller than the moved data because the files are compressed.<\/li>\n<\/ul>\n<h2>External Tables<\/h2>\n<p>As you notice, the query with the schema becomes quite complex. Once again, it\u2019s possible to make a solution, so the queries are easier for users.<\/p>\n<p>The previous article of the series demonstrated External Data Sources and creating a database in Synapse Serverless. This example uses the same techniques:<\/p>\n<p>First, create a new external data source:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE\u00a0EXTERNAL\u00a0DATA\u00a0SOURCE\u00a0[msSource]\u00a0WITH\u00a0\n(\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0LOCATION\u00a0=\u00a0'https:\/\/azureopendatastorage.blob.core.windows.net\/'\n);\ngo<\/pre>\n<p>You can create external tables to map the schema to a table. This table doesn\u2019t exist physically on the Serverless Pool; it\u2019s only a schema mapping for the external storage.<\/p>\n<p>The next step is to create two additional objects: An external file that will specify the format of the files on the external storage and the external table, mapping the schema of the external files.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE\u00a0EXTERNAL\u00a0FILE\u00a0FORMAT\u00a0parquetcompressed\u00a0\u00a0\nWITH\u00a0(\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0FORMAT_TYPE\u00a0=\u00a0PARQUET,\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0DATA_COMPRESSION\u00a0=\u00a0'org.apache.hadoop.io.compress.SnappyCodec'\u00a0\u00a0\n);\u00a0\u00a0\nCreate\u00a0external\u00a0table\u00a0[extHolidays]\n(\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(16)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(170)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(170)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0char(2)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2(7)\n)\nWITH\u00a0(LOCATION\u00a0=\u00a0N'holidaydatacontainer\/Processed\/',\u00a0\n\u00a0\u00a0\u00a0\u00a0DATA_SOURCE\u00a0=\u00a0[msSource],\u00a0\n\u00a0\u00a0\u00a0\u00a0FILE_FORMAT\u00a0=\u00a0[parquetcompressed]);\ngo\nselect\u00a0*\u00a0from\u00a0[extHolidays]<\/pre>\n<h2>External Tables: Performance Comparison<\/h2>\n<p>The first limitation you may notice on external tables is the lack of support for the FilePath function. This is a severe limitation because often, the data will be partitioned by folders, like the earlier examples.<\/p>\n<p>Besides this severe difference, compare the performance difference between External Tables and OPENROWSET function.<\/p>\n<p><strong>OPENROWSET method<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select * from\n    OPENROWSET(\n        BULK 'https:\/\/azureopendatastorage.blob.core.windows.net\/holidaydatacontainer\/Processed\/',\n        FORMAT='PARQUET'\n    )\nWITH\n(    \n    countryOrRegion varchar(16) COLLATE Latin1_General_100_BIN2_UTF8,\n    holidayName  varchar(170) COLLATE Latin1_General_100_BIN2_UTF8,\n    normalizeHolidayName varchar(170) COLLATE Latin1_General_100_BIN2_UTF8,\n    isPaidTimeOff bit,\n    countryRegionCode char(2) COLLATE Latin1_General_100_BIN2_UTF8,\n    date DATETIME2(7)\n)  AS [holidays]<\/pre>\n<p><strong>External table method<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0\nfrom\u00a0[extHolidays]<\/pre>\n<p>Here are the results:<\/p>\n<table>\n<tbody>\n<tr>\n<td><strong>Scenario<\/strong><\/td>\n<td><strong>Data scanned<\/strong><\/td>\n<td><strong>Data moved<\/strong><\/td>\n<td><strong>Duration<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>OPENROWSET<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>8.459 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>External table<\/strong><\/td>\n<td>1MB<\/td>\n<td>5MB<\/td>\n<td>8.314 seconds<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You may notice the results are quite similar, allowing us to use external tables to simplify OPENROWSET queries when the FilePath function is not needed.<\/p>\n<h2>CSV File Format: Comparing parser and auto-detect<\/h2>\n<p>When reading data from CSV files, the main difference is the parser used. Parser 2.0 may appear to be a logical choice because of some advantages, such as auto-detecting schema and because the appearance to be a more advanced version.<\/p>\n<p>However, the two versions have different features and limitations, and you will find situations where Parser 1.0 will be needed.<\/p>\n<p>Here\u2019s a comparison of features of the two versions:<\/p>\n<table style=\"width: 599.503px;\">\n<tbody>\n<tr>\n<td style=\"width: 279px;\">Parser 1.0<\/td>\n<td style=\"width: 302.503px;\">Parser 2.0<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 279px;\">\n<ul>\n<li>Following options aren&#8217;t supported: HEADER_ROW.<\/li>\n<li>Default terminators are \\r\\n, \\n and \\r.<\/li>\n<li>If you specify \\n (newline) as the row terminator, it will be automatically prefixed with a \\r (carriage return) character, which results in a row terminator of \\r\\n.<\/li>\n<\/ul>\n<\/td>\n<td style=\"width: 302.503px;\">\n<ul>\n<li>Not all data types are supported.<\/li>\n<li>Maximum character column length is 8000.<\/li>\n<li>Maximum row size limit is 8 MB.<\/li>\n<li>Following options aren&#8217;t supported: DATA_COMPRESSION.<\/li>\n<li>Quoted empty string (&#8220;&#8221;) is interpreted as empty string.<\/li>\n<li>DATEFORMAT SET option is not honored.<\/li>\n<li>Supported format for DATE data type: YYYY-MM-DD<\/li>\n<li>Supported format for TIME data type: HH:MM:SS[.fractional seconds]<\/li>\n<li>Supported format for DATETIME2 data type: YYYY-MM-DD HH:MM:SS[.fractional seconds]<\/li>\n<li>Default terminators are \\r\\n and \\n.<\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 581.503px;\" colspan=\"2\">Reference: https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/sql\/develop-openrowset<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>My personal suggestion: Use Parser 2.0 unless some specific requirement appears for Parser 1.0.<\/p>\n<h2>CSV Parsers and Collation<\/h2>\n<p>There are some differences between how Parsers 1.0 and 2.0 read strings. On our sample data, Parser 1.0 can read the string fields without much difficulty, while Parser 2.0 will require the correct collation specification or Unicode data type, increasing a bit the size of the field.<\/p>\n<p>Synapse makes an automatic conversion from UTF8 collation to the collation used by string fields. In some situations, this automatic conversion may create problems on the data.<\/p>\n<p>Parser 2.0, when used without a collation specification, results in a complaining message about collation, like you may notice on the image below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"965\" height=\"224\" class=\"wp-image-92701\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/10\/text-description-automatically-generated-2.png\" alt=\"Text\n\nDescription automatically generated\" \/><\/p>\n<p>When creating a database in Synapse Serverless, you can specify the collation as UTF8 using the following statement:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE\u00a0DATABASE\u00a0mydb\n\u00a0\u00a0\u00a0\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8;<\/pre>\n<p>Another option is to change the collation in an existing database:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">ALTER\u00a0DATABASE\u00a0mydb\n\u00a0\u00a0\u00a0\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8;<\/pre>\n<p>Besides defining the database collation, another option is to specify the collation of the string fields on the query schema. You need to discover if by doing so, you create some kind of performance difference.<\/p>\n<h2>Comparing the Performance<\/h2>\n<p>Here are some simple performance comparisons with the 4 possible situations:<\/p>\n<ul>\n<li>Using Parser 1.0 with schema<\/li>\n<li>Using Parser 2.0 with schema and collation<\/li>\n<li>Using Parser 2.0 with schema and nvarchar type<\/li>\n<li>Using Parser 2.0 auto-detecting fields<\/li>\n<\/ul>\n<p><strong>Using Parser 1.0 with schema<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='CSV',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0firstrow=2,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0parser_version='1.0'\n\u00a0\u00a0\u00a0\u00a0)\nWITH\n(\u00a0\u00a0\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(16)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(225),\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(225),\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0char(2)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\n)\u00a0\u00a0AS\u00a0[holidays]<\/pre>\n<p><strong>Using Parser 2.0 with schema and collation<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='CSV',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0HEADER_ROW=true,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0parser_version='2.0'\n\u00a0\u00a0\u00a0\u00a0)\nWITH\n(\u00a0\u00a0\u00a0\u00a0\n\u00a0\u00a0\u00a0\u00a0countryOrRegion\u00a0varchar(16)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0holidayName\u00a0\u00a0varchar(500)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0normalizeHolidayName\u00a0varchar(500)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0isPaidTimeOff\u00a0bit,\n\u00a0\u00a0\u00a0\u00a0countryRegionCode\u00a0char(2)\u00a0COLLATE\u00a0Latin1_General_100_BIN2_UTF8,\n\u00a0\u00a0\u00a0\u00a0date\u00a0DATETIME2\n)\u00a0\u00a0AS\u00a0[holidays]<\/pre>\n<p><strong>Using Parser 2.0 with schema and nvarchar type<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">select * from\n    OPENROWSET(\n        BULK 'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\n        FORMAT='CSV',\n        HEADER_ROW=true,\n        FIELDTERMINATOR='|',\n        parser_version='2.0'\n    )\nWITH\n(    \n    countryOrRegion nvarchar(16) COLLATE Latin1_General_100_BIN2_UTF8,\n    holidayName  nvarchar(225),\n    normalizeHolidayName nvarchar(225),\n    isPaidTimeOff bit,\n    countryRegionCode nchar(2) COLLATE Latin1_General_100_BIN2_UTF8,\n    date DATETIME2\n)  AS [holidays]<\/pre>\n<p><strong>Using Parser 2.0 auto-detecting fields<\/strong><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select\u00a0*\u00a0from\n\u00a0\u00a0\u00a0\u00a0OPENROWSET(\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BULK\u00a0'https:\/\/lakedemo.blob.core.windows.net\/opendatalake\/holidays\/',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FORMAT='CSV',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0HEADER_ROW=true,\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0FIELDTERMINATOR='|',\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0parser_version='2.0'\n\u00a0\u00a0\u00a0\u00a0)\u00a0AS\u00a0[holidays]<\/pre>\n<p>Here are the results:<\/p>\n<table>\n<tbody>\n<tr>\n<td><strong>Scenario<\/strong><\/td>\n<td><strong>Data scanned<\/strong><\/td>\n<td><strong>Data moved<\/strong><\/td>\n<td><strong>Duration<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>Using Parser 1.0 with schema<\/strong><\/td>\n<td>8MB<\/td>\n<td>5MB<\/td>\n<td>8.769 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Using Parser 2.0 with schema and collation<\/strong><\/td>\n<td>14MB<\/td>\n<td>6MB<\/td>\n<td>10.050 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Using Parser 2.0 with schema and nvarchar type<\/strong><\/td>\n<td>14MB<\/td>\n<td>9MB<\/td>\n<td>15.422 seconds<\/td>\n<\/tr>\n<tr>\n<td><strong>Using Parser 2.0 auto-detecting fields<\/strong><\/td>\n<td>14MB<\/td>\n<td>6MB<\/td>\n<td>9.664 seconds<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The difference in the varchar fields causes processing overhead for the query. The auto-detect on Parser 2.0 manages this relatively well. Although still complaining about collation, it reaches a performance similar to Parser 1.0.<\/p>\n<h2>Conclusion<\/h2>\n<p>There are many variations about how to query files in a Data Lake. This article gives you some directions about how these variations can impact the query performance.<\/p>\n<h2>Additional Reference<\/h2>\n<p>Collation in Synapse Analytics: <a href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-synapse-analytics\/always-use-utf-8-collations-to-read-utf-8-text-in-serverless-sql\/ba-p\/1883633\">https:\/\/techcommunity.microsoft.com\/t5\/azure-synapse-analytics\/always-use-utf-8-collations-to-read-utf-8-text-in-serverless-sql\/ba-p\/1883633<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Performance-tuning SQL queries against Azure Blob Storage using Synapse Serverless SQL pool &#8211; filepath filtering with the filepath() function, partition-based filtering, schema definition tradeoffs, Parquet vs CSV format comparisons, and External Tables vs OPENROWSET with benchmarks.&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[137091,53],"tags":[5134],"coauthors":[6810],"class_list":["post-92696","post","type-post","status-publish","format-standard","hentry","category-azure","category-featured","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92696","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=92696"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92696\/revisions"}],"predecessor-version":[{"id":107296,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92696\/revisions\/107296"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=92696"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=92696"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=92696"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=92696"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}