{"id":101726,"date":"2024-04-03T17:00:04","date_gmt":"2024-04-03T17:00:04","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=101726"},"modified":"2024-09-03T20:04:46","modified_gmt":"2024-09-03T20:04:46","slug":"dynamic-partitioning-and-a-simple-incremental-load","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/dynamic-partitioning-and-a-simple-incremental-load\/","title":{"rendered":"Dynamic Partitioning and a Simple Incremental Load"},"content":{"rendered":"<p>Let&#8217;s consider a simple statement for partitioning and save a table in a lakehouse:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\">df<span style=\"color: #333333\">.<\/span>write<span style=\"color: #333333\">.<\/span>mode(<span style=\"background-color: #fff0f0\">\"overwrite\"<\/span>)<span style=\"color: #333333\">.<\/span>format(<span style=\"background-color: #fff0f0\">\"delta\"<\/span>)<span style=\"color: #333333\">.<\/span>partitionBy(<span style=\"background-color: #fff0f0\">\"Year\"<\/span>,<span style=\"background-color: #fff0f0\">\"Month\"<\/span>,<span style=\"background-color: #fff0f0\">\"Day\"<\/span>)<span style=\"color: #333333\">.<\/span>save(<span style=\"background-color: #fff0f0\">\"Tables\/\"<\/span> <span style=\"color: #333333\">+<\/span> table_name)\r\n<\/pre>\n<\/div>\n<p>Let&#8217;s consider we load the data daily, with all the transactions from the day. The table will save the transactions for each day in different partitions. We can expect the table to keep the partitions from previous day, months and years to be kept, achieving a kind of incremental load, right?<\/p>\n<p>Wrong.<\/p>\n<p>The files are neither, overwritten or deleted, but they are removed from the delta log. The records are not directly acessible in the delta table any more.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"894\" height=\"395\" class=\"wp-image-101727\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-blue-rectangular-with-white-text-description-au.png\" alt=\"A blue rectangular with white text\n\nDescription automatically generated\" \/><\/p>\n<p>The table removes the previous partitions from the Delta Log<\/p>\n<p>There are two possible solutions: If you load the data once a day,\u00a0 you can replace the <em>&#8220;overwrite&#8221;<\/em> by <em>&#8220;append&#8221;<\/em>. This solution only works if each ingestion never brings a record already loaded before.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\">df<span style=\"color: #333333\">.<\/span>write<span style=\"color: #333333\">.<\/span>mode(<span style=\"background-color: #fff0f0\">\"append\"<\/span>)<span style=\"color: #333333\">.<\/span>format(<span style=\"background-color: #fff0f0\">\"delta\"<\/span>)<span style=\"color: #333333\">.<\/span>partitionBy(<span style=\"background-color: #fff0f0\">\"Year\"<\/span>,<span style=\"background-color: #fff0f0\">\"Month\"<\/span>,<span style=\"background-color: #fff0f0\">\"Day\"<\/span>)<span style=\"color: #333333\">.<\/span>save(<span style=\"background-color: #fff0f0\">\"Tables\/\"<\/span> <span style=\"color: #333333\">+<\/span> table_name)\r\n<\/pre>\n<\/div>\n<p>However, if we load the data multiple times a day, bringing all the transactions from the day and containing records which are possibly already loaded in the warehouse, the <em>&#8220;append&#8221;<\/em> mode doesn&#8217;t work, because it will create duplicated records.<\/p>\n<p>One of the solutions is to overwrite specific partitions instead of all of them. In this case, overwrite the &#8220;Day&#8221; partition.<\/p>\n<p>The Spark configuration called <em>&#8220;partitionOverwriteMode&#8221;<\/em> can have the values <em>static<\/em> or <em>dynamic<\/em>. The default value is <em>static<\/em>. The table removes all old partitions from the delta log by default. When we set this configuration to <em>dynamic<\/em>,\u00a0 the table keeps the old partitions. It overwrites the ones coming in the dataframe.<\/p>\n<p>After we set this configuration to <em>&#8220;dynamic&#8221;, <\/em>we can execute the same original statement to ensure the table will keep the old partitions:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\">spark<span style=\"color: #333333\">.<\/span>conf<span style=\"color: #333333\">.<\/span>set(<span style=\"background-color: #fff0f0\">\"spark.sql.sources.partitionOverwriteMode\"<\/span>, <span style=\"background-color: #fff0f0\">\"dynamic\"<\/span>)\r\ndf<span style=\"color: #333333\">.<\/span>write<span style=\"color: #333333\">.<\/span>mode(<span style=\"background-color: #fff0f0\">\"overwrite\"<\/span>)<span style=\"color: #333333\">.<\/span>format(<span style=\"background-color: #fff0f0\">\"delta\"<\/span>)<span style=\"color: #333333\">.<\/span>partitionBy(<span style=\"background-color: #fff0f0\">\"Year\"<\/span>,<span style=\"background-color: #fff0f0\">\"Month\"<\/span>,<span style=\"background-color: #fff0f0\">\"Day\"<\/span>)<span style=\"color: #333333\">.<\/span>save(<span style=\"background-color: #fff0f0\">\"Tables\/\"<\/span> <span style=\"color: #333333\">+<\/span> table_name)\r\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"400\" class=\"wp-image-101728\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-blue-rectangular-with-white-text-description-au-1.png\" alt=\"A blue rectangular with white text\n\nDescription automatically generated\" \/><\/p>\n<p>The table keeps the existing partitions, instead of overwritting them.<\/p>\n<h2>Summary<\/h2>\n<p>Using dynamic partitioning and defining the partition key correctly is one simple and fast way to achieve an incremental load. Of course, this is only one among other options we will talk about later.<\/p>\n<p>You may also would like to take a look on <a href=\"https:\/\/www.youtube.com\/watch?v=KG53rX0zUJ4\" target=\"_self\" rel=\"noopener\">how to set the configuration as default using environments<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s consider a simple statement for partitioning and save a table in a lakehouse: df.write.mode(&#8220;overwrite&#8221;).format(&#8220;delta&#8221;).partitionBy(&#8220;Year&#8221;,&#8221;Month&#8221;,&#8221;Day&#8221;).save(&#8220;Tables\/&#8221; + table_name) Let&#8217;s consider we load the data daily, with all the transactions from the day. The table will save the transactions for each day in different partitions. We can expect the table to keep the partitions from previous day,&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,159164],"tags":[158998,158997],"coauthors":[6810],"class_list":["post-101726","post","type-post","status-publish","format-standard","hentry","category-blogs","category-microsoft-fabric","tag-lakehouse","tag-microsoft-fabric"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101726","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=101726"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101726\/revisions"}],"predecessor-version":[{"id":102026,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101726\/revisions\/102026"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=101726"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=101726"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=101726"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=101726"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}