{"id":101692,"date":"2024-03-13T09:30:52","date_gmt":"2024-03-13T09:30:52","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=101692"},"modified":"2024-09-03T20:04:46","modified_gmt":"2024-09-03T20:04:46","slug":"5-secrets-for-dataflows-gen-2","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/5-secrets-for-dataflows-gen-2\/","title":{"rendered":"5 Secrets about Dataflows Gen 2"},"content":{"rendered":"<p>Dataflows Gen 2 are the new version of Power BI dataflows. There are so many changes in relation to the previous version they are considered a new feature.<\/p>\n<p>The main difference is the possibility to set a target for the result of each query in the dataflow. In this way, it can be used as an ingestion tool to lakehouses and warehouses.<\/p>\n<p>Let&#8217;s talk about some additional secrets of this tool.<\/p>\n<h2><strong>1 &#8211; Azure Blob Connector vs Parquet Connector<\/strong><\/h2>\n<p>If you need to read Parquet files stored in an Azure blob storage, which connector is the best option? The blob connector, or the Parquet connector?<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"600\" class=\"wp-image-101693\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-4.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Azure Blob Connector:<\/strong> Will list the files of an Azure blob container.<\/p>\n<p>This connector only accepts as a starting connection the address of the container. If you include any path or file information, you receive an error complaining about URL parameters. Yes, it&#8217;s a terrible error message.<\/p>\n<p>For example, if you consider the address <a href=\"https:\/\/azuresynapsestorage.blob.core.windows.net\/sampledata\/NYCTaxiLocations\/\">https:\/\/azuresynapsestorage.blob.core.windows.net\/sampledata\/NYCTaxiLocations\/<\/a> , this connector accepts <a href=\"https:\/\/azuresynapsestorage.blob.core.windows.net\/sampledata\">https:\/\/azuresynapsestorage.blob.core.windows.net\/sampledata<\/a> and list the contents of the container, if you have permission. However, the entire address will generate an error.<\/p>\n<p>This is a good choice to loop through files and folders, but besides that, it may not be the correct choice.<\/p>\n<p>If you don&#8217;t have access to the root of the container, it becomes impossible to use it to load files.<\/p>\n<p><strong>Parquet Connector:<\/strong> This connector loads a single parquet file from the URL you use. However, it\u2019s not capable to load an entire folder with parquet files.<\/p>\n<p><strong>In Summary:<\/strong> One connector can list the files, but not load them, and another can load file by file. 3 years ago, I wrote a blog explaining how <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/power-bi-reading-parquet-from-a-data-lake\/\">Power BI Desktop can read parquet files<\/a> . It builds a script in M to load the data. This was already bad. On the dataflow gen 2, you need to build the script, because it doesn\u2019t build for you.<\/p>\n<h2><strong>2 &#8211; Dataflow Staging<\/strong><\/h2>\n<p>When we create a dataflow gen 2, a lakehouse and a data warehouse are created in the same workspace. These are special lakehouses and data warehouses, we should never touch them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1011\" height=\"230\" class=\"wp-image-101694\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-5.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>By default, every query has its data always loaded to staging and the transformations are done in the staging lakehouse.<\/p>\n<p>There are some situations when we wouldn&#8217;t like to be using the staging:<\/p>\n<ul>\n<li>When the entire processing uses query folding. The processing will be done on the source server and has no need for the staging.<\/li>\n<li>When the data is of a reasonable size and the processing can be entirely done in memory<\/li>\n<\/ul>\n<p>The staging affects a lot the performance of the ETL. Disabling it, when possible, will make the processing faster.<\/p>\n<p>Disabling the staging is simple: Right-click the table and click the <strong>Enable Staging<\/strong> option in the context menu<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"296\" height=\"414\" class=\"wp-image-101695\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-6.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Of course, if you disable the staging on an ETL which would need it, the performance will get worse. It\u2019s a matter of testing.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>3- Dataflow Compute<\/strong><\/h2>\n<p>It&#8217;s interesting to notice the staging is done only with the special lakehouse created for it. The special data warehouse has another purpose: Execute what is called compute operation.<\/p>\n<p>When your resulting table appears with a lightning sign, this means this table needs computing. It will be processed using not only the staging, but also the compute area.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"538\" height=\"161\" class=\"wp-image-101696\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-7.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>If you can avoid dynamic tables requiring compute, this will improve the ETL performance.<\/p>\n<p>You can avoid the compute requirement in a similar way as you usually did on the old dataflows: <em>&#8220;disabling the load&#8221;<\/em>. The only difference is that instead of <em>\u201cload\u201d,<\/em> we call it <em>\u201cstaging\u201d.<\/em><\/p>\n<p>Considering the example above, the middle tables, <strong><em>dimension_city<\/em><\/strong> and <strong><em>fact_sale<\/em><\/strong>, can have the staging disabled. This will make the table <strong><em>SalesByState<\/em><\/strong> avoid the compute stage.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"508\" height=\"224\" class=\"wp-image-101697\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-8.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>This \u201creduced one level\u201d on the <strong><em>SalesByState<\/em><\/strong> table. Instead of using compute, it will now use staging, because the first two tables are processed in memory.<\/p>\n<p>There are some additional options:<\/p>\n<ul>\n<li>We can also disable staging on <strong><em>SalesByState<\/em><\/strong> and process everything in memory.<\/li>\n<li>We can disable staging only on <strong><em>SalesByState<\/em><\/strong> but not on the other two. The other two tables would still use staging, but the <strong><em>SalesByState<\/em><\/strong> would be processed in memory.<\/li>\n<\/ul>\n<h2><strong>4- Data Warehouse Destination Requires Staging<\/strong><\/h2>\n<p>At the moment I&#8217;m writing, dataflows gen 2 require staging when using data warehouse as destination. On the other hand, if you choose a lakehouse destination, the staging will not be needed.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"601\" height=\"185\" class=\"wp-image-101698\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-white-background-with-black-text-description-au.png\" alt=\"A white background with black text\n\nDescription automatically generated\" \/><\/p>\n<p>Unfortunately, this means the same processing will execute faster if the destination is a lakehouse than a data warehouse<\/p>\n<h2><strong>5- Workspace to Storage the Dataflow<\/strong><\/h2>\n<p>If the dataflow uses a Lakehouse as destination, we can have a workspace only for them, what is a good practice. I mentioned this on <a href=\"https:\/\/www.youtube.com\/watch?v=h8NKvx3aC7s\" target=\"_self\" rel=\"noopener\">Fabric Monday 9<\/a>.<\/p>\n<p>Using a lakehouse as destination, we can choose the lakehouse from any accessible workspace.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"656\" height=\"720\" class=\"wp-image-101699\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-9.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>However, when the destination is a data warehouse, both needs to be saved in the same workspace. This is an additional limitation of using a Data Warehouse as destination of a Data Flow Gen 2.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"285\" height=\"363\" class=\"wp-image-101700\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/a-screenshot-of-a-computer-description-automatica-10.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dataflows Gen 2 are the new version of Power BI dataflows. There are so many changes in relation to the previous version they are considered a new feature. The main difference is the possibility to set a target for the result of each query in the dataflow. In this way, it can be used as&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,159164],"tags":[123645,4472,158998,158997],"coauthors":[6810],"class_list":["post-101692","post","type-post","status-publish","format-standard","hentry","category-blogs","category-microsoft-fabric","tag-data-warehouse","tag-dataflow","tag-lakehouse","tag-microsoft-fabric"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101692","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=101692"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101692\/revisions"}],"predecessor-version":[{"id":101786,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101692\/revisions\/101786"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=101692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=101692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=101692"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=101692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}