{"id":98289,"date":"2023-09-27T20:02:13","date_gmt":"2023-09-27T20:02:13","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=98289"},"modified":"2024-09-03T20:15:18","modified_gmt":"2024-09-03T20:15:18","slug":"microsoft-fabric-and-the-delta-tables-secrets","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/microsoft-fabric-and-the-delta-tables-secrets\/","title":{"rendered":"Microsoft Fabric and the Delta Tables Secrets"},"content":{"rendered":"<p>Microsoft Fabric storage uses <strong>OneLake <\/strong>and <strong>Delta Tables<\/strong>, the core storage of all Fabric objects, as explained <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/data-intelligence-on-light-speed-microsoft-fabric\/\">in my introduction to Microsoft Fabric<\/a>.<\/p>\n<p>Either if you use lakehouses, data warehouses, or even datasets, all the data is stored in the <strong>OneLake<\/strong>. <strong>OneLake<\/strong> uses Delta Tables as storage. In this way, we need to be aware of some &#8220;secrets&#8221; related to delta table storage in order to make the best decisions possible for our data platform solution.<\/p>\n<p>Let&#8217;s analyze some of the <strong>Parquet<\/strong> and <strong>Delta Table<\/strong> behaviors.<\/p>\n<h2>Parquet and Delta<\/h2>\n<p>Parquet is an open source format which became famous as one of the most used formats in data lakes. It&#8217;s a columnar storage format intended to store historical data.<\/p>\n<p>One very important &#8220;secret&#8221;, and the main subject of this blog is: <strong><em>Parquet format is immutable<\/em><\/strong>. It&#8217;s not possible to change the content of a Parquet file. It&#8217;s intended for historical data.<\/p>\n<p>Although the main objective is for historical data, many people and companies worked on this problem. The result was <strong>Delta Tables<\/strong>.<\/p>\n<p><strong>Delta Tables<\/strong> use the <strong>Parquet<\/strong> format, it&#8217;s not a different format. The immutable behavior continues. However, <strong>Delta Tables<\/strong> use an additional folder as a transaction log. The operations are registered in the delta logs, marking records with updates and deletes.<\/p>\n<p>In this way, the underlying immutable behavior is still present, but we can work with the data in a transactional way, allowing updates and deletes, for example.<\/p>\n<h2>Time Travel<\/h2>\n<p>The delta logs allow us to make what&#8217;s called <strong>Time Travel<\/strong>. We can retrieve the information from a delta table in the way it was on a specific date and time, as long as the logs are kept complete.<\/p>\n<p>The access to the entire historical of data changes is an important resource for a data storage.<\/p>\n<h2>Data Modelling<\/h2>\n<p>The need to keep historical data is way older than technologies such as Delta Time Travel, which allow us to keep them. The Data Modelling techniques, such as <strong>Data Warehouse<\/strong> modelling, proposed solutions for historical storage a long time ago.<\/p>\n<p>One of the features used for this purpose is called <strong>Slowly Changing Dimensions<\/strong>, or <strong>Dimension Type 2<\/strong>. When we design a start schema, we decide which dimensions should keep an entire history and which ones aren\u2019t worth the trouble and a simple update on the records would be enough.<\/p>\n<p>For example, let&#8217;s consider a dimension called <em>Customer<\/em>. If we decide to keep the dimension as a <strong>SCD<\/strong> dimension, every time a customer record is changed in production, we create a new version of the record in the intelligence storage (data warehouse, data lake, whatever the name).<\/p>\n<p>On the other hand, if we decide that a dimension is not worth keeping a history, we can just update the record in the intelligence storage when needed.<\/p>\n<p>The decision of using a <strong>SCD<\/strong> dimension or not, and many more, are all made during modelling. They are independent of any technical feature capable of keeping the history of the data. The history is designed during modelling and kept by us.<\/p>\n<h2>Choosing Between Time Travel and Data Modelling<\/h2>\n<p>We have the possibility to use data modelling to control the history of our storage, or use the technical features, such as Time Travel.<\/p>\n<p>This leads to several possibilities with different consequences:<\/p>\n<table style=\"border-style: solid\" border=\"2\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"text-align: center;background-color: #22d679\">\n<p><strong>Approach<\/strong><\/p>\n<\/td>\n<td style=\"text-align: center;background-color: #22d679\">\n<p><strong>Possible Results<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>We can choose to rely on time travel for the entire history storage of our data<\/p>\n<\/td>\n<td>\n<p>This will tie the history of your data solution with a specific technological feature. It also creates the risk of performance issues related to executing updates in a delta table. Let&#8217;s talk more in depth about the technology and leave the decision to you.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>We can choose to rely on the modelling for the entire history and not depend on the time travel feature<\/p>\n<\/td>\n<td>\n<p>This creates additional modelling and ingestion work, plus additional work to avoid performance issues with the delta log. The work to avoid performance issues with the delta log may be easier than if we were really relying on it.<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The decision whether we should rely on modelling, on technology or stay somewhere in the middle is complex enough to generate many books. What&#8217;s important on this blog is to understand the decision is present when designing an architecture with <strong>Microsoft Fabric<\/strong>.<\/p>\n<p>In order to make a well-informed decision, we need to understand how the delta tables process updates\/deletes.<\/p>\n<h2>Lakehouse Example<\/h2>\n<p>The example will be made using a table called <em>Fact_Sale<\/em>. You can use a pipeline to import the data from https:\/\/azuresynapsestorage.blob.core.windows.net\/sampledata to the files area of one Fabric Lakehouse. The article about <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/microsoft-fabric-lakehouse-and-data-factory-in-power-bi-environment\/\">Lakehouse and ETL<\/a> explains how to build a pipeline to bring data to the <strong>Files<\/strong> area.<\/p>\n<p>The article <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/microsoft-fabric-using-notebooks-and-table-partitioning-to-convert-files-to-tables\/\">https:\/\/www.red-gate.com\/simple-talk\/blogs\/microsoft-fabric-using-notebooks-and-table-partitioning-to-convert-files-to-tables\/<\/a>\u00a0explains this import process and how we can partition the data by Year and Quarter, making a more interesting example. The notebook code to import the partitioned data to the <strong>Tables<\/strong> area of the lakehouse is the following:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #008800;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">pyspark.sql.functions<\/span> <span style=\"color: #008800;font-weight: bold\">import<\/span> col, year, month, quarter\r\ntable_name <span style=\"color: #333333\">=<\/span> <span style=\"background-color: #fff0f0\">'fact_sale'<\/span>\r\ndf <span style=\"color: #333333\">=<\/span> spark<span style=\"color: #333333\">.<\/span>read<span style=\"color: #333333\">.<\/span>format(<span style=\"background-color: #fff0f0\">\"parquet\"<\/span>)<span style=\"color: #333333\">.<\/span>load(<span style=\"background-color: #fff0f0\">'Files\/fact_sale_1y_full'<\/span>)\r\ndf <span style=\"color: #333333\">=<\/span> df<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">'Year'<\/span>, year(col(<span style=\"background-color: #fff0f0\">\"InvoiceDateKey\"<\/span>)))\r\ndf <span style=\"color: #333333\">=<\/span> df<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">'Quarter'<\/span>, quarter(col(<span style=\"background-color: #fff0f0\">\"InvoiceDateKey\"<\/span>)))\r\ndf <span style=\"color: #333333\">=<\/span> df<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">'Month'<\/span>, month(col(<span style=\"background-color: #fff0f0\">\"InvoiceDateKey\"<\/span>)))\r\ndf<span style=\"color: #333333\">.<\/span>write<span style=\"color: #333333\">.<\/span>mode(<span style=\"background-color: #fff0f0\">\"overwrite\"<\/span>)<span style=\"color: #333333\">.<\/span>format(<span style=\"background-color: #fff0f0\">\"delta\"<\/span>)<span style=\"color: #333333\">.<\/span>partitionBy(<span style=\"background-color: #fff0f0\">\"Year\"<\/span>,<span style=\"background-color: #fff0f0\">\"Quarter\"<\/span>)<span style=\"color: #333333\">.<\/span>save(<span style=\"background-color: #fff0f0\">\"Tables\/\"<\/span> <span style=\"color: #333333\">+<\/span> table_name)\r\n<\/pre>\n<\/div>\n<p>The default size for a Parquet file in the Fabric environment is 1GB (1073741824 bytes). This is defined by the session configuration <strong>spark.microsoft.delta.optimizeWrite.binSize <\/strong>and has the purpose to avoid the delta lake small files problem. Although this is a common problem, the writing on delta tables can cause consequences when the file is too big. Let&#8217;s analyze this as well.<\/p>\n<p>You can find more about <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/fabric-lakehouse-convert-to-table-feature-and-workspace-level-spark-configuration\/\">Spark configuration on this blog<\/a> and more about the <a href=\"https:\/\/delta.io\/blog\/2023-01-25-delta-lake-small-file-compaction-optimize\/\">small files problem on this link<\/a>.<\/p>\n<p>On our example, this configuration generated a single parquet file for each quarter, this will help to identify what&#8217;s happening during the example.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1195\" height=\"174\" class=\"wp-image-98290\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-close-up-of-a-white-background-description-auto.png\" alt=\"A close up of a white background\n\nDescription automatically generated\" \/><\/p>\n<h2>The Data we Have<\/h2>\n<p>We need to identify the total of records we have, the total records for each quarter, and the minimum and maximum <em>SalesKey<\/em> value in each quarter in order to build the example.<\/p>\n<p>We can use the SQL Endpoint to run the following queries:<\/p>\n<div><span style=\"font-family: Courier New;font-size: 10pt\"> <span style=\"color: blue\">SELECT<\/span>\u00a0<span style=\"color: fuchsia;font-style: italic\">Count<\/span><span style=\"color: maroon\">(<\/span><span style=\"color: silver\">*<\/span><span style=\"color: maroon\">)<\/span>\u00a0<span style=\"color: blue\">AS<\/span>\u00a0<span style=\"color: maroon\">TotalRecords<\/span> <br \/>\n<span style=\"color: blue\">FROM<\/span>\u00a0\u00a0\u00a0<span style=\"color: maroon\">fact_sale<\/span>\u00a0 <\/span><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"329\" height=\"102\" class=\"wp-image-98291\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<div><span style=\"font-family: Courier New;font-size: 10pt\"> <span style=\"color: blue\">SELECT<\/span>\u00a0<span style=\"color: maroon\">quarter<\/span><span style=\"color: silver\">,<\/span> <br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: fuchsia;font-style: italic\">Count<\/span><span style=\"color: maroon\">(<\/span><span style=\"color: silver\">*<\/span><span style=\"color: maroon\">)<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: maroon\">SalesCount<\/span><span style=\"color: silver\">,<\/span> <br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: fuchsia;font-style: italic\">Min<\/span><span style=\"color: maroon\">(<\/span><span style=\"color: maroon\">salekey<\/span><span style=\"color: maroon\">)<\/span>\u00a0<span style=\"color: maroon\">MinKey<\/span><span style=\"color: silver\">,<\/span> <br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: fuchsia;font-style: italic\">Max<\/span><span style=\"color: maroon\">(<\/span><span style=\"color: maroon\">salekey<\/span><span style=\"color: maroon\">)<\/span>\u00a0<span style=\"color: maroon\">MaxKey<\/span> <br \/>\n<span style=\"color: blue\">FROM<\/span>\u00a0\u00a0\u00a0<span style=\"color: maroon\">fact_sale<\/span> <br \/>\n<span style=\"color: blue\">GROUP<\/span>\u00a0\u00a0<span style=\"color: blue\">BY<\/span>\u00a0<span style=\"color: maroon\">quarter<\/span>\u00a0 <\/span><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1348\" height=\"181\" class=\"wp-image-98292\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-close-up-of-a-line-description-automatically-ge.png\" alt=\"A close-up of a line\n\nDescription automatically generated\" \/><\/p>\n<h2>Test query and Execution Time<\/h2>\n<p>For test purposes, let&#8217;s use the following query:<\/p>\n<div><span style=\"font-family: Courier New;font-size: 10pt\"> <span style=\"color: blue\">SELECT<\/span>\u00a0<span style=\"color: maroon\">customerkey<\/span><span style=\"color: silver\">,<\/span> <br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: fuchsia;font-style: italic\">Sum<\/span><span style=\"color: maroon\">(<\/span><span style=\"color: maroon\">totalincludingtax<\/span><span style=\"color: maroon\">)<\/span>\u00a0<span style=\"color: maroon\">TotalIncludingTax<\/span> <br \/>\n<span style=\"color: blue\">FROM<\/span>\u00a0\u00a0\u00a0<span style=\"color: maroon\">fact_sale<\/span> <br \/>\n<span style=\"color: blue\">GROUP<\/span>\u00a0\u00a0<span style=\"color: blue\">BY<\/span>\u00a0<span style=\"color: maroon\">customerkey<\/span>\u00a0 <\/span><\/div>\n<p>This query makes a grouping on all our records by <em>CustomerKey<\/em>, across the quarter partitions, creating a total of Sales by customer. On each test, I will execute this query 10 times and check the initial, minimum, and maximum execution time.<\/p>\n<p>&nbsp;<\/p>\n<table style=\"height: 321px;width: 21.9259%;border-style: solid\" border=\"2\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"background-color: #219acf\" colspan=\"2\">\n<p style=\"text-align: center\"><strong>First 10 Executions<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Initial<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>2.639 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Minimum<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>1.746 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Maximum<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>3.32 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Average<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>2.292 seconds<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Updating one single record<\/h2>\n<p>As the first test, let&#8217;s update one single sales record in each quarter. We will use a notebook to execute the following code:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #333333\">%%<\/span>sql\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">=<\/span> <span style=\"color: #0000dd;font-weight: bold\">6000000<\/span>;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">=<\/span> <span style=\"color: #0000dd;font-weight: bold\">15624569<\/span> ;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">=<\/span> <span style=\"color: #0000dd;font-weight: bold\">35270205<\/span>;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">=<\/span> <span style=\"color: #0000dd;font-weight: bold\">45032105<\/span>;\r\n<\/pre>\n<\/div>\n<p>After executing these updates, if you look on the table <strong>Parquet<\/strong> files, you will notice each parquet file containing the record updated was duplicated.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1192\" height=\"196\" class=\"wp-image-98293\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-screenshot-of-a-computer-description-automatica-2.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>The parquet file continues to be immutable, in order to update a record, the entire file is duplicated with the record update and the delta logs register this update.<\/p>\n<p>In our example, we updated 4 records, but each one was in a different parquet file. As a result, all the parquet files were duplicated (one for each quarter).<\/p>\n<p>Remember the default file size is 1GB. A single record update will result in the duplication of a 1GB file. The big file size may have a bad side effect if you decide to use upserts or deletes too much.<\/p>\n<h2>Testing the Execution<\/h2>\n<p>Let\u2019s test the execution of our sample query again and get the number after these duplication of parquet files:<\/p>\n<table style=\"height: 422px;width: 43.4282%\" border=\"2\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"background-color: #219acf\" colspan=\"2\">\n<p style=\"text-align: center\"><strong>After Updating 1 record<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Initial<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>6.692 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Minimum<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>1.564 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Maximum<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>3.166 seconds (ignoring initial)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Average<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>2.8955 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Average Ignoring Initial<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>2.4374 seconds<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There is a initial execution substantially slower and the average, even ignoring the initial execution, is slower, but not much.<\/p>\n<h2>Updating many records<\/h2>\n<p>Let&#8217;s change a bit the script and update a higher volume of records each quarter. You can execute the script below many times and each time you execute you will see the parquet files being duplicated.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #333333\">%%<\/span>sql\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">&lt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">7000000<\/span>;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">&gt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">14624564<\/span> AND SaleKey <span style=\"color: #333333\">&lt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">18624564<\/span>;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">&gt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">28270201<\/span> AND SaleKey <span style=\"color: #333333\">&lt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">35270201<\/span>;\r\n\r\nupdate fact_sale <span style=\"color: #007020\">set<\/span> TotalIncludingTax<span style=\"color: #333333\">=<\/span>TotalIncludingTax <span style=\"color: #333333\">*<\/span> <span style=\"color: #0000dd;font-weight: bold\">2<\/span>\r\nwhere SaleKey <span style=\"color: #333333\">&gt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">42032102<\/span> AND SaleKey <span style=\"color: #333333\">&lt;<\/span> <span style=\"color: #0000dd;font-weight: bold\">45032102<\/span>;\r\n<\/pre>\n<\/div>\n<h2>Result<\/h2>\n<p>After executing the update script 4 times, we end up with 6 files on each partition. The original file, the file created by the update of a single record and the 4 files created when updating multiple record.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1146\" height=\"348\" class=\"wp-image-98294\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-screenshot-of-a-computer-description-automatica-3.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>These are the results of the test query execution:<\/p>\n<table style=\"height: 375px;width: 44.0049%\" border=\"2\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"background-color: #219acf\" colspan=\"2\">\n<p style=\"text-align: center\"><strong>After Updating 1 record<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Initial<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>11.894 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Minimum<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>1.553 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Maximum<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>2.286 seconds (ignoring initial)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Average<\/p>\n<\/td>\n<td style=\"text-align: right\">\n<p>2.9394 seconds<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"background-color: #29f0e6\">\n<p>Average Ignoring Initial<\/p>\n<\/td>\n<td style=\"text-align: right;background-color: #29f0e6\">\n<p>1.9494 seconds<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The test seems to illustrate only the initial query is affected and affected a lot. After the data is in cache, the files in the delta table don\u2019t affect the query, or at least, it seems so.<\/p>\n<p>On one hand, this illustrates the problem. On the other hand, we are talking about a few seconds difference for a set of 50 million records.<\/p>\n<h2>Cleaning the Table<\/h2>\n<p>The process of cleaning the table from unlinked parquet files is executed by the statement <strong>VACUUM<\/strong>. There are some important points to highlight about this process:<\/p>\n<ul>\n<li>If you decide to manage yourself the data history using data modelling, this needs to be a regular process on tables affected by updates and deletes.<\/li>\n<li>On the other hand, if you decide to use Time Travel to manage history, you can&#8217;t execute this process, otherwise you will lose the time travel capability.<\/li>\n<\/ul>\n<p>This process needs to be executed very carefully. You can&#8217;t try to delete files while you have some process in execution over the data. You need to ensure this will only be executed while you don&#8217;t have anything running over the data.<\/p>\n<p>The default method to ensure this is to only delete files older than one specific time. For example, if you want to delete unlinked files younger than 168 hours, you need to activate a special spark session configuration to ensure you are aware about what you are doing.<\/p>\n<p>In this way, the example below, which activates this configuration and executes the <strong>VACUUM<\/strong> with 0 retention, is only for test purposes, not for production scenarios.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\">spark<span style=\"color: #333333\">.<\/span>conf<span style=\"color: #333333\">.<\/span>set(<span style=\"background-color: #fff0f0\">\"spark.databricks.delta.retentionDurationCheck.enabled\"<\/span>, <span style=\"background-color: #fff0f0\">\"false\"<\/span>)\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #333333\">%%<\/span>sql\r\nVACUUM <span style=\"background-color: #fff0f0\">'Tables\/fact_sale'<\/span> RETAIN <span style=\"color: #0000dd;font-weight: bold\">0<\/span> HOURS\r\n<\/pre>\n<\/div>\n<p>After executing this cleaning, the additional files will disappear and only the most updated will remain.<\/p>\n<h2>Onelake is for all<\/h2>\n<p>This process affects not only the <strong>lakehouse<\/strong>, but the data warehouse as well. In the lakehouse, the <strong>SQL Endpoint<\/strong> is read-only, but the <strong>Data Warehouse<\/strong> is read-write with <strong>MERGE<\/strong> operations.<\/p>\n<h2>Conclusion<\/h2>\n<p><strong>Microsoft Fabric<\/strong> is a PaaS environment for the entire Enterprise Data Architecture and capable of enabling the most modern architectures, such as <strong>Data Mesh<\/strong>.<\/p>\n<p>However, we should never lose track of the data concepts, such as the fact the data intelligence storage is intended to be read-only and for historical purposes. Our architectural decisions may have an impact on the result which may not be so obvious in a PaaS environment.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Fabric storage uses OneLake and Delta Tables, the core storage of all Fabric objects, as explained in my introduction to Microsoft Fabric. Either if you use lakehouses, data warehouses, or even datasets, all the data is stored in the OneLake. OneLake uses Delta Tables as storage. In this way, we need to be aware&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,159164,159166],"tags":[123646,159034,158998,158997,101611],"coauthors":[6810],"class_list":["post-98289","post","type-post","status-publish","format-standard","hentry","category-blogs","category-microsoft-fabric","category-powerbi","tag-data-lake","tag-delta","tag-lakehouse","tag-microsoft-fabric","tag-power-bi"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98289","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=98289"}],"version-history":[{"count":11,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98289\/revisions"}],"predecessor-version":[{"id":98393,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98289\/revisions\/98393"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=98289"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=98289"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=98289"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=98289"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}