{"id":98274,"date":"2023-09-13T17:00:02","date_gmt":"2023-09-13T17:00:02","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=98274"},"modified":"2024-09-03T20:15:18","modified_gmt":"2024-09-03T20:15:18","slug":"microsoft-fabric-using-notebooks-and-table-partitioning-to-convert-files-to-tables","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/microsoft-fabric-using-notebooks-and-table-partitioning-to-convert-files-to-tables\/","title":{"rendered":"Microsoft Fabric: using Notebooks and Table Partitioning to Convert Files to Tables"},"content":{"rendered":"<p>When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.<\/p>\n<p>As I explained on the article about <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/microsoft-fabric-lakehouse-and-data-factory-in-power-bi-environment\/\">lakehouse and ETL<\/a>, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.<\/p>\n<p>Let\u2019s make a step-by-step on this blog about how to use notebooks and table partitioning.<\/p>\n<p>The steps explained here are based on the same sample data from the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/azure\/microsoft-fabric-lakehouse-and-data-factory-in-power-bi-environment\/\">lakehouse and ETL article<\/a>. In that article you can find the steps to build a pipeline to import the data to the files area.<\/p>\n<h2>The Starting Point<\/h2>\n<p>The starting point for this example is a <strong>lakehouse<\/strong> called <em>demolake<\/em> with the files already imported to the <strong>Files<\/strong> area using a pipeline.<\/p>\n<p>No table is created yet.<\/p>\n<h2>Loading a Notebook<\/h2>\n<p>We will use a notebook for the conversion. You can <a href=\"https:\/\/github.com\/microsoft\/fabric-samples\/tree\/main\/docs-samples\/data-engineering\/Lakehouse%20Tutorial%20Source%20Code\">download the notebook here<\/a>. We will use the notebook called <em>\u201c01 &#8211; Create Delta Tables.ipynb\u201d<\/em><\/p>\n<ol>\n<li>Click the <em>Experience<\/em> button and select the <em>Data Engineering Experience<\/em><\/li>\n<li>Click the <em>Import Notebook<\/em> button.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"944\" height=\"270\" class=\"wp-image-98275\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-word-de.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo, Word\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Click the <em>Upload<\/em> button and select the notebook you just downloaded.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"341\" height=\"211\" class=\"wp-image-98276\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-texto-aplicativo-e.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Texto, Aplicativo, Email\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Click the button <em>lakehouse demo<\/em> in the left button bar to return to the workspace.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"66\" height=\"69\" class=\"wp-image-98277\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-texto-aplicativo-d.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Texto, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Click on the notebook you imported.<\/li>\n<li>On the <em>Lakehouse explorer<\/em> window, click the <em>Add<\/em> button to link the notebook to a lakehouse.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"243\" height=\"547\" class=\"wp-image-98278\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-teams-d.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo, Teams\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>On the <em>Add Lakehouse<\/em> window, select <em>Existing Lakehouse<\/em> and click the <em>Add<\/em> button.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"339\" height=\"240\" class=\"wp-image-98279\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-texto-aplicativo-c.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Texto, Aplicativo, chat ou mensagem de texto\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Select the demolake and click the <em>Add<\/em> button.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"794\" class=\"wp-image-98280\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-texto-aplicativo-e-1.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Texto, Aplicativo, Email\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<h2>The notebook code<\/h2>\n<p>Let\u2019s analyse the code in the notebook we just imported.<\/p>\n<h3>First Code Block<\/h3>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\" class=\"crayon:false\"><span style=\"color: #888888;\"># Copyright (c) Microsoft Corporation.<\/span>\r\n<span style=\"color: #888888;\"># Licensed under the MIT License.<\/span>\r\n\r\nspark<span style=\"color: #333333;\">.<\/span>conf<span style=\"color: #333333;\">.<\/span>set(<span style=\"background-color: #fff0f0;\">\"sprk.sql.parquet.vorder.enabled\"<\/span>, <span style=\"background-color: #fff0f0;\">\"true\"<\/span>)\r\n\r\nspark<span style=\"color: #333333;\">.<\/span>conf<span style=\"color: #333333;\">.<\/span>set(<span style=\"background-color: #fff0f0;\">\"spark.microsoft.delta.optimizeWrite.enabled\"<\/span>, <span style=\"background-color: #fff0f0;\">\"true\"<\/span>)\r\n\r\nspark<span style=\"color: #333333;\">.<\/span>conf<span style=\"color: #333333;\">.<\/span>set(<span style=\"background-color: #fff0f0;\">\"spark.microsoft.delta.optimizeWrite.binSize\"<\/span>, <span style=\"background-color: #fff0f0;\">\"1073741824\"<\/span>)\r\n<\/pre>\n<\/div>\n<p>This code block enables the <strong>V-Order<\/strong> optimization. This is a special optimization for delta tables. You can read more about the <strong>V-Order<\/strong> optimization on this link: <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/delta-optimization-and-v-order?tabs=sparksql\">https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/delta-optimization-and-v-order?tabs=sparksql<\/a><\/p>\n<p>The <strong>optimizeWrite<\/strong> option allow us to define what&#8217;s called <strong>binSize<\/strong>. This configuration defines the size for each parquet file used to store the data. Controlling the size of the files we can avoid too big files or too small files, either of these cases would cause a performance problem.<\/p>\n<p>I would not be surprised if in the future some of these configurations may become a default.<\/p>\n<h3>Second Code Block<\/h3>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\" class=\"crayon:false\"><span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pyspark.sql.functions<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> col, year, month, quarter\r\n\r\ntable_name <span style=\"color: #333333;\">=<\/span> <span style=\"background-color: #fff0f0;\">'fact_sale'<\/span>\r\n\r\ndf <span style=\"color: #333333;\">=<\/span> spark<span style=\"color: #333333;\">.<\/span>read<span style=\"color: #333333;\">.<\/span>format(<span style=\"background-color: #fff0f0;\">\"parquet\"<\/span>)<span style=\"color: #333333;\">.<\/span>load(<span style=\"background-color: #fff0f0;\">'Files\/fact_sale_1y_full'<\/span>)\r\n\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>withColumn(<span style=\"background-color: #fff0f0;\">'Year'<\/span>, year(col(<span style=\"background-color: #fff0f0;\">\"InvoiceDateKey\"<\/span>)))\r\n\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>withColumn(<span style=\"background-color: #fff0f0;\">'Quarter'<\/span>, quarter(col(<span style=\"background-color: #fff0f0;\">\"InvoiceDateKey\"<\/span>)))\r\n\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>withColumn(<span style=\"background-color: #fff0f0;\">'Month'<\/span>, month(col(<span style=\"background-color: #fff0f0;\">\"InvoiceDateKey\"<\/span>)))\r\n\r\ndf<span style=\"color: #333333;\">.<\/span>write<span style=\"color: #333333;\">.<\/span>mode(<span style=\"background-color: #fff0f0;\">\"overwrite\"<\/span>)<span style=\"color: #333333;\">.<\/span>format(<span style=\"background-color: #fff0f0;\">\"delta\"<\/span>)<span style=\"color: #333333;\">.<\/span>partitionBy(<span style=\"background-color: #fff0f0;\">\"Year\"<\/span>,<span style=\"background-color: #fff0f0;\">\"Quarter\"<\/span>)<span style=\"color: #333333;\">.<\/span>save(<span style=\"background-color: #fff0f0;\">\"Tables\/\"<\/span> <span style=\"color: #333333;\">+<\/span> table_name)\r\n<\/pre>\n<\/div>\n<p>This block executes the following tasks:<\/p>\n<ul>\n<li>Read the <em>fact_sale <\/em>table from the <strong>Files<\/strong> folder.<\/li>\n<li>Create a column called <em>Year.<\/em><\/li>\n<li>Create a column called <em>Quarter<\/em>.<\/li>\n<li>Create a column called <em>Month<\/em>.<\/li>\n<li>Save the data in the <strong>Tables<\/strong> area. The table is partitioned by <em>Year<\/em> and <em>Quarter,<\/em> and it will overwrite any existing table with the same name.<\/li>\n<\/ul>\n<h3>Third Code Block<\/h3>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\" class=\"crayon:false\"><span class=\"crayon:false\" style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pyspark.sql.types<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #333333;\">*<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">def<\/span> <span style=\"color: #0066bb; font-weight: bold;\">loadFullDataFromSource<\/span>(table_name):\r\n    df <span style=\"color: #333333;\">=<\/span> spark<span style=\"color: #333333;\">.<\/span>read<span style=\"color: #333333;\">.<\/span>format(<span style=\"background-color: #fff0f0;\">\"parquet\"<\/span>)<span style=\"color: #333333;\">.<\/span>load(<span style=\"background-color: #fff0f0;\">'Files\/'<\/span> <span style=\"color: #333333;\">+<\/span> table_name)\r\n    df<span style=\"color: #333333;\">.<\/span>write<span style=\"color: #333333;\">.<\/span>mode(<span style=\"background-color: #fff0f0;\">\"overwrite\"<\/span>)<span style=\"color: #333333;\">.<\/span>format(<span style=\"background-color: #fff0f0;\">\"delta\"<\/span>)<span style=\"color: #333333;\">.<\/span>save(<span style=\"background-color: #fff0f0;\">\"Tables\/\"<\/span> <span style=\"color: #333333;\">+<\/span> table_name)\r\n\r\nfull_tables <span style=\"color: #333333;\">=<\/span> [\r\n\r\n<span style=\"background-color: #fff0f0;\">'dimension_city'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'dimension_date'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'dimension_employee'<\/span>,\r\n<span style=\"background-color: #fff0f0;\">'dimension_stock_item'<\/span>\r\n ]\r\n\r\n<span style=\"color: #008800; font-weight: bold;\">for<\/span> table <span style=\"color: #000000; font-weight: bold;\">in<\/span> full_tables:\r\n    loadFullDataFromSource(table)\r\n<\/pre>\n<\/div>\n<p>This block has a function called <em>loadFullDataFromSource<\/em>. The function is defined in the beginning of the code, but it will only be executed when called.<\/p>\n<p>The execution itself starts on the <em>full_tables<\/em> variable definition. This variable is an array with the name of the dimensions, which are also the name of the folders under the <strong>Files<\/strong> area in the lakehouse.<\/p>\n<p>A <strong>FOR<\/strong> loop is executed over the table variable. For each dimension, the function <em>loadFullDataFromSource<\/em> is called receiving the name of the dimension as a parameter.<\/p>\n<p>The function loads the data from the <strong>Files<\/strong> area and saves it to the <strong>Tables<\/strong> area using the table name as parameter. As a result, the function will load all the dimensions, one by one.<\/p>\n<h2>Executing the Notebook and checking the result<\/h2>\n<ol>\n<li>Click the Run All button to execute all the code blocks in the notebook.<\/li>\n<\/ol>\n<p><strong>Microsoft Fabric<\/strong> uses a feature called <strong>Live Pool<\/strong>. We don&#8217;t need to configure a spark pool for the notebook execution. Every workspace linked with <strong>Fabric<\/strong> artifacts has the <strong>Live Pool<\/strong> enabled. Once we ask to execute a notebook or a single code block, the pool is enabled in very few seconds.<\/p>\n<ol>\n<li>On the lakehouse explorer, right click <strong>Tables<\/strong> and click <em>Refresh<\/em>.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"355\" height=\"208\" class=\"wp-image-98281\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-descrica.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>The tables are now available in the <strong>Tables<\/strong> area.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"307\" height=\"312\" class=\"wp-image-98282\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-descrica-1.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>On the left bar, click the demolake button.<\/li>\n<\/ol>\n<p>Let&#8217;s look on the storage behind the tables in the lakehouse.<\/p>\n<ol>\n<li>Right click the <em>fact_sale<\/em> table<\/li>\n<li>On the context menu, click View files menu item.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"571\" height=\"593\" class=\"wp-image-98283\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-descrica-2.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<p>Mind how the files are stored in delta format and partitioned by <em>Year<\/em> and <em>Quarter<\/em> according to the configuration defined in the notebook.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"260\" height=\"181\" class=\"wp-image-98284\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-descrica-3.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Click the <em>Year<\/em> folder.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"337\" height=\"285\" class=\"wp-image-98285\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-aplicativo-descrica-4.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<ol>\n<li>Click one of the <em>Quarters<\/em> folder.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"562\" height=\"164\" class=\"wp-image-98286\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/interface-grafica-do-usuario-texto-aplicativo-d-1.png\" alt=\"Interface gr\u00e1fica do usu\u00e1rio, Texto, Aplicativo\n\nDescri\u00e7\u00e3o gerada automaticamente\" \/><\/p>\n<h2>Conclusion<\/h2>\n<p>On this blog you discovered how you can make the transformation from <strong>Files<\/strong> to <strong>Tables<\/strong> in a <strong>lakehouse<\/strong> using <strong>pyspark<\/strong> code and in this way be able to partition tables and schedule the code to make this movement.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion. As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,159164,159166],"tags":[123646,123648,158998,158997,159035,101611],"coauthors":[6810],"class_list":["post-98274","post","type-post","status-publish","format-standard","hentry","category-blogs","category-microsoft-fabric","category-powerbi","tag-data-lake","tag-data-platform","tag-lakehouse","tag-microsoft-fabric","tag-notebook","tag-power-bi"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98274","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=98274"}],"version-history":[{"count":3,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98274\/revisions"}],"predecessor-version":[{"id":98354,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98274\/revisions\/98354"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=98274"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=98274"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=98274"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=98274"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}