{"id":102229,"date":"2024-06-10T21:25:26","date_gmt":"2024-06-10T21:25:26","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=102229"},"modified":"2025-06-27T14:55:49","modified_gmt":"2025-06-27T14:55:49","slug":"choosing-between-the-lakehouse-and-warehouse-in-microsoft-fabric","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/choosing-between-the-lakehouse-and-warehouse-in-microsoft-fabric\/","title":{"rendered":"Choosing Between the Lakehouse and Warehouse in Microsoft Fabric"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/microsoft-fabric\">Microsoft Fabric<\/a> is a new centralized, AI-powered cloud data platform hosted by Microsoft. It combines several services of the existing Azure Data Platform \u2013 such as Azure Data Factory \u2013 with Power BI, while also introducing new services. It\u2019s tempting to compare Microsoft Fabric with <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/synapse-analytics\">Azure Synapse Analytics<\/a> since Synapse also bundles different services together, but Fabric is so much more. In contract with Synapse, Fabric is a software-as-a-service (SaaS) platform providing you with many low-code to no-code data services.<\/p>\n<p>This doesn\u2019t mean no code has to be written. On the contrary, in this article we\u2019re going to focus on two services of Fabric: <strong>the lakehouse<\/strong> and <strong>the warehouse<\/strong>. The first one is part of the Data Engineering experience in Fabric, while the latter is part of the Data Warehousing experience. Both require code to be written to create any sort of artefact. In the warehouse we can use T-SQL to create tables, load data into them and do any kind of transformation. In the lakehouse, we use notebooks to work with data, typically in languages such as PySpark or Spark SQL.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"641\" height=\"294\" class=\"wp-image-102230\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-56.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 1: high-level overview of the services in Microsoft Fabric<\/p>\n<p><em>ALT: a diagram displaying all available services in Fabric, with the Onelake storage layer beneath them<\/em><\/p>\n<p>For an overview of all the available services\/experiences in Fabric, check out the article <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/7792\/microsoft-fabric-personas-filter-options-and-features\/?utm_source=KoenVerbeeck\">Microsoft Fabric Personas<\/a>.<\/p>\n<p>If we can create tables and work with data in both the lakehouse or the warehouse, how do we choose between the two? When would you use one or the other? The answer is \u2013 as always \u2013 <strong>it depends<\/strong>. In this article, we\u2019re going to present you with the advantages and disadvantages of each service and everything else you might take into consideration to make an informed decision about which service might be best suited for your use case.<\/p>\n<h2>Microsoft Fabric behind the Scenes<\/h2>\n<p>Before we start, let\u2019s take a look first at how Microsoft Fabric works. For example, how is data stored in Microsoft Fabric? The answer is quite simple: <strong>everything is stored in the OneLake storage layer<\/strong>. This is a logical data lake build on top of <em>Azure Data Lake Storage Gen2<\/em>. It acts as a wrapper around this storage. Microsoft commonly refers to OneLake as <em>\u201cOneLake is the OneDrive for your data\u201d<\/em>. This means that in reality you might be using one or more Azure Data Lake storage accounts depending on your set-up and which regions you use, but you will perceive it as one single storage. As mentioned before, Fabric tries to centralize as much as it can. A benefit of this is that one service can read data stored by another service. For example, if you create a table in a lakehouse, you can query this table from a warehouse. This is known as the <em>OneCopy<\/em> feature.<\/p>\n<p>Another interesting fact about storage is that all tables you create in the warehouse and in the lakehouse are <strong>delta tables<\/strong>. <a href=\"https:\/\/delta.io\/\">Delta<\/a> is an open-source format that applies a transaction log on top of Parquet files. In other words, all table data is stored in Parquet files (these are well-compressed files where data is stored using a columnstore format). However, Parquet files are immutable, which means they cannot be updated after they are written to storage. Every time you insert, update or delete a record in a table, a new Parquet file has to be written out. The delta format manages a transaction log alongside all those Parquet files, guaranteeing ACID compliance just like in relational databases. In short, thanks to the delta format, we can use Parquet files to store our data but it all works similar to a regular relational database such as SQL Server.<\/p>\n<p>When we look at how a table is stored in OneLake, we can see two folders:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"441\" height=\"231\" class=\"wp-image-102231\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-57.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 2: Delta table with transaction log and Parquet files<\/p>\n<p><em>ALT: delta table in the storage layer. One folder for the transaction log, one folder for the Parquet files<\/em><\/p>\n<p>In the <em>_delta_log<\/em> folder we can see the whole transaction log of the table:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"269\" height=\"229\" class=\"wp-image-102232\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-58.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 3: Transaction Log of the Delta Table<\/p>\n<p><em>ALT: transaction log of a delta table, showing 8 different transactions<\/em><\/p>\n<p>In the other folder, we can see a whole bunch of Parquet files. They contain the data of the table through various points in time (new files can be created after each transaction).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"966\" height=\"385\" class=\"wp-image-102233\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-59.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 4: The different Parquet files of a Delta Table<\/p>\n<p><em>ALT: a folder with a bunch of Parquet files. Displayed is also a deletion vector file.<\/em><\/p>\n<p>The folder also contains a <a href=\"https:\/\/docs.delta.io\/latest\/delta-deletion-vectors.html\">deletion vector<\/a>, which is an optimization technique to handle deletes more efficiently.<\/p>\n<p>Now we know how data is stored, but how is <strong>the compute<\/strong> managed in Fabric? The driving force behind everything in Fabric are <strong>the capacities<\/strong>. A capacity is a compute resource and is designated with a scale. The lowest capacity is an F2, the next one is F4, then F8 all the way up to F2048. Every time something wants to use compute, the capacity is used. It doesn\u2019t matter if it is a warehouse, a data factory pipeline, or a real-time stream, it all uses the same capacity.<\/p>\n<p>In the <a href=\"https:\/\/app.powerbi.com\">Microsoft Fabric portal<\/a> (which is the same as the one used for Power BI), you can create workspaces and you can assign a capacity to that workspace.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"873\" class=\"wp-image-102234\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-survey-description-aut.png\" alt=\"A screenshot of a computer survey\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 5: Assigning a Fabric capacity to a workspace<\/p>\n<p><em>ALT: the list of options when assigning a capacity to a workspace<\/em><\/p>\n<p>At the time of writing, there\u2019s also a <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/get-started\/fabric-trial\">free trial of Fabric<\/a> available. This trial capacity is equivalent with an F64. There are three methods of purchasing Microsoft Fabric:<\/p>\n<ul>\n<li>You provision an F-SKU through the Azure Portal. This capacity can be paused. You will be billed for the time the capacity is running. The capacity can be scaled up and down when needed. This is <strong>pay-as-you-go<\/strong> pricing. You can learn more about this option in the article <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/7754\/microsoft-fabric-capacities-and-cost-structure\/?utm_source=KoenVerbeeck\">What are Capacities in Microsoft Fabric?<\/a>.<\/li>\n<li>You can also buy a <strong>reserved capacity<\/strong>. You pay for a reserved number of CU (capacity units) up front. Say for example that you buy an F4 with reserved pricing for the whole year. But the capacity that you\u2019ve been running the whole time is an F8. This means that 4 capacity units will be subtracted from the total of 8, so you only pay for 4 capacity units with pay-as-you-go pricing. If you don\u2019t scale up and only use F4 for the whole year, you\u2019d pay nothing extra. So think of the reserved capacity units as some sort of \u201cvoucher\u201d that you can use to get capacity units at a lower cost. More info can be found in <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cost-management-billing\/reservations\/fabric-capacity\">the documentation<\/a>.<\/li>\n<li>The last option is to buy <strong>Power BI Premium<\/strong>. With Power BI Premium, you get a fixed capacity that you can assign to workspaces, and you also get all of the Fabric capabilities with it. A P1 Premium capacity is equivalent with an F64, a P2 with an F128 and so on.<\/li>\n<\/ul>\n<p>The following table gives an overview of the SKUs and indicative prices for the pay-as-you-go options:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"398\" height=\"365\" class=\"wp-image-102235\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-table-with-numbers-and-a-few-black-text-descrip.png\" alt=\"A table with numbers and a few black text\n\nDescription automatically generated with medium confidence\" \/><\/p>\n<p>Figure 6: Price overview for Fabric<\/p>\n<p><em>ALT: a table with the capacities from F2 till F2048 along with prices<\/em><\/p>\n<p>A P1 Premium capacity is around $5,000, so it is considerably cheaper than an F64 if you would keep it running all the time. Only consider the pay-as-you-go option if you can pause the capacity for longer periods of time.<\/p>\n<p>To conclude this section: every compute service in Microsoft uses the OneLake storage layer to store data (and in the case of the warehouse and the lakehouse it\u2019s both delta tables), and they also use the same capacity for compute resources. So if both compute and storage are the same, what is the difference between the lakehouse and the warehouse?<\/p>\n<h2>The Warehouse versus the Lakehouse<\/h2>\n<h3>The Warehouse<\/h3>\n<p>Even though the storage layer is completely different from SQL Server, a lot of effort has been made to make the warehouse have the look and feel of the old \u2018n trustworthy relational database. When you create a warehouse in Fabric, you\u2019ll be taken to the following screen:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1253\" height=\"714\" class=\"wp-image-102236\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-60.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 7: Warehouse Editor in Fabric<\/p>\n<p><em>ALT: the Fabric warehouse editor in the browser. On the left there\u2019s an explorer, on the right a data preview pane.<\/em><\/p>\n<p>You can see familiar objects from SQL Server present: the <em>dbo<\/em> schema, tables, views, functions, and stored procedures. Working with the warehouse in Fabric is very similar to \u201ctraditional data warehousing\u201d using a database like SQL Server or Azure SQL Database. Once data has been loaded into tables, you will write (mostly) the same T-SQL as we\u2019ve been doing for the past years\/decades.<\/p>\n<p>However, not everything is supported (yet). For example, the TRUNCATE TABLE statement isn\u2019t available, as well as identity columns, temporary tables, and the MERGE statement. Hopefully many limitations will be lifted in the future. You can find an overview of the current limitations in <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/tsql-surface-area\">the documentation<\/a>. Primary keys, foreign keys and unique keys are supported, but at the moment they\u2019re not enforced. It can still be useful to create them, as it allows the query engine to create more optimal plans and because some BI tools use constraints to automatically create relationships. More info can be found <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/table-constraints\">here<\/a>.<\/p>\n<h4>Ingesting Data<\/h4>\n<p>Before we start writing SQL, we must ingest data first. There are several options available:<\/p>\n<ul>\n<li><strong>Data Pipelines<\/strong>. These are very similar <em>to Azure Data Factory pipelines<\/em>. They are a low-code to no-code experience for copying data to a destination (in this case the warehouse), or for orchestrating various other tasks. You can schedule pipelines for automatic data retrieval.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"815\" height=\"270\" class=\"wp-image-102237\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-61.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 8: Sample Pipeline in Fabric<\/p>\n<p><em>ALT: a sample pipeline in Fabric consisting of 4 activities<\/em><\/p>\n<ul>\n<li><strong>Dataflows<\/strong>. These are very similar to <em>Power BI dataflows<\/em>. Both use the Power Query engine, but dataflows in Fabric can write to several possible destinations. Unlike data pipelines, you can actually see the data you\u2019re working with, making them ideal for \u201ccitizen developers\u201d. You can define a refresh schedule for a dataflow, or you can trigger them from a pipeline.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"999\" height=\"594\" class=\"wp-image-102238\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-62.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 9: Fabric Dataflow with sample data<\/p>\n<p><em>ALT: an example of a Fabric Dataflow, with data in preview<\/em><\/p>\n<ul>\n<li><strong>COPY T-SQL statement<\/strong>. With the <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/t-sql\/statements\/copy-into-transact-sql?view=fabric&amp;preserve-view=true\">COPY INTO statement<\/a>, data stored in Parquet files or in CSV files can be loaded into a warehouse table. Currently only Azure Data Lake Storage or Blob accounts are supported as an external location.<\/li>\n<li><strong>Selecting data from other workspace sources<\/strong>. As mentioned before, thanks to the <em>OneCopy<\/em> principle, data can be read from other warehouses or lakehouses. You can use the 3-part naming convention <strong>database.schema.table<\/strong> to reference a table in another warehouse\/lakehouse. Using SQL statements such as CREATE TABLE AS SELECT (CTAS), INSERT INTO or SELECT INTO, we can retrieve data from other sources (or from within the same warehouse), transform it and then write the results to a table.<\/li>\n<\/ul>\n<p>People who have been working with the SQL Server platform or with the Azure Data Platform should be familiar with most of the options available to ingest data. In fact, existing code should be able to be migrated to the Fabric ecosystem, keeping in mind that there are still some limitations to the T-SQL supported in Fabric.<\/p>\n<p>For more information about the data ingestion options, check out <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/ingest-data\">the documentation<\/a> or the article <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/7869\/automate-the-ingestion-of-data-into-microsoft-fabric\/?utm_source=KoenVerbeeck\">Microsoft Fabric Data Ingestion Methods and Tools<\/a>.<\/p>\n<h4>Transactions<\/h4>\n<p>Thanks to the transaction log of the delta tables, but also thanks to an internally managed transaction log, the warehouse is fully ACID-compliant. Only the snapshot isolation level is supported. <strong>Transactions that span across different databases in the same workspace are supported<\/strong>.<\/p>\n<p>More information about transaction in the warehouse can be found <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/transactions\">here<\/a>.<\/p>\n<h4>Security<\/h4>\n<p>The same access controls that exist in SQL Server are available in the Fabric warehouse as well. You can use GRANT and DENY for granular object or column level security for example. For more fine-grained access, you can use row-level security or data masking. The documentation page <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/security\">Security for data warehousing in Microsoft Fabric<\/a> has a good overview of all possible security measures. Another method is \u201csharing\u201d a warehouse, which is interesting when you want only a specific warehouse to be visible to someone. <img loading=\"lazy\" decoding=\"async\" width=\"315\" height=\"668\" class=\"wp-image-102239\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-63.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 10: Sharing a Fabric warehouse<\/p>\n<p><em>ALT: the sharing window for a warehouse in Fabric. You can change settings for additional permissions and enter e-mail addresses for adding people, just like in OneDrive.<\/em><\/p>\n<h4>Conclusion<\/h4>\n<p>Even though the storage layer is completely different than SQL Server, the Fabric warehouse is very similar to its on-premises counterpart. You can write the same T-SQL \u2013 albeit with some restrictions \u2013 and use the same security and transactions controls. The primary developer skill set is still T-SQL, combined with maybe Pipelines or Dataflows for ingesting data or for orchestrating the ETL flow. People who have been building data warehouses with relational databases should have little trouble working with the Fabric warehouse.<\/p>\n<h3>The Lakehouse<\/h3>\n<p>Together with the rise (and fall?) of big data, the data lake made its introduction years ago. Because storage is cheap, you can put terabytes of data in your data lake and figure out what to do with it later (<a href=\"https:\/\/www.dremio.com\/wiki\/schema-on-read-vs-schema-on-write\/\" target=\"_blank\" rel=\"noopener\">schema-on-read vs schema-on-write<\/a>). Decoupling storage from compute (in traditional databases the storage is proprietary and included into the database engine) had the advantage that you could scale your storage separately and choose which compute service suits you best. Turns out querying all those files \u2013 without any predefined structure \u2013 is quite hard. To bring together the strengths of the data lake and the traditional data warehouse, the <strong>data lakehouse<\/strong> was introduced. Thanks to the delta format, you can treat Parquet files like regular tables and have a transaction log associated with them. But they\u2019re still files on a file system, so you can keep on scaling that storage.<\/p>\n<p>Fabric also has a lakehouse service and as mentioned before, data is kept in the OneLake storage layer.<\/p>\n<h4>Tables &amp; Files<\/h4>\n<p>A big difference with the warehouse is that a lakehouse can have <strong>both tables and files<\/strong>. Tables can be delta tables, but other types of tables are supported as well (for example csv or Parquet; a delta table is indicated with a small black triangle). Tables can be either managed (Spark takes care of the metadata and the data) or unmanaged (Spark only takes care of the metadata). A good overview of the different types is given in the article <a href=\"https:\/\/data-mozart.com\/working-with-tables-in-microsoft-fabric-lakehouse-everything-you-need-to-know\/\">Working with tables in Microsoft Fabric Lakehouse \u2013 Everything you need to know!<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"847\" height=\"601\" class=\"wp-image-102240\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-64.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 11: A lakehouse with both tables and files<\/p>\n<p><em>ALT: a lakehouse with a table and files section. A table is selected, and its data is shown in a preview<\/em><\/p>\n<p>The files section shows the \u201cunmanaged\u201d section of the lakehouse. This can be any file type. When you click on a folder in the explorer pane, you\u2019ll see all of the files listed in the preview pane on the right.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"408\" height=\"492\" class=\"wp-image-102241\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-65.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 12: File preview in the lakehouse<\/p>\n<p><em>ALT: a folder is selected in the files section of the explorer. The files in that folder are shown in the preview pane<\/em><\/p>\n<p>A nice feature of Fabric is the ability to create <strong>shortcuts<\/strong>. With a shortcut, you can link to data that is stored in an external location, such as an Azure Blob container, an S3 bucket or maybe files in another lakehouse.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"686\" height=\"477\" class=\"wp-image-102242\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-66.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 13: Dialog for creating a new shortcut<\/p>\n<p><em>ATL: the screen you get when creating a new shortcut. You can choose between OneLake, S3, or Azure Data Lake Storage<\/em><\/p>\n<p>A shortcut makes the data or file visible in the lakehouse, but it doesn\u2019t take a physical copy. The data remains at the source and is only loaded when the shortcut is queried. Shortcuts are a method of simplifying ETL and minimizing the amount of data that needs to be copied between systems. You can create shortcuts in the file section, or you can create them in the table section. When an object is a shortcut, a little paperclip icon will be shown.<\/p>\n<h4>Ingesting Data<\/h4>\n<p>Like in the warehouse, there are a couple of ways to get data into the lakehouse.<\/p>\n<ul>\n<li><strong>Pipelines and Dataflows. <\/strong>Exactly the same as in the warehouse, but now with a lakehouse as the destination.<\/li>\n<li><strong>The lakehouse interface.<\/strong> When you right-click on a file or folder, you can choose to load its data to a (delta) table. This is fine for a one-shot analysis or a demo\/proof-of-concept, but in real-life scenarios you might want to opt for an ingestion method that can be automated.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"489\" height=\"298\" class=\"wp-image-102243\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-67.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 14: Load data manually to a table.<\/p>\n<p><em>ALT: the context menu to load a file to a table in the lakehouse<\/em><\/p>\n<ul>\n<li><strong>Notebooks.<\/strong> The main method to work with a lakehouse is notebooks. They allow you to write Spark code (PySpark, Spark SQL, Scale or SparkR). You can connect to any data source that is supported by any of the languages (you can load modules for extra functionality) and load the data to the lakehouse. In a notebook, you can control if data ends up in a delta table or another type of table, or if the table is managed or unmanaged.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1008\" height=\"501\" class=\"wp-image-102244\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-code-description-autom.png\" alt=\"A screenshot of a computer code\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 15: a notebook reading data from Azure Open Datasets with PySpark into a data frame.<\/p>\n<p><em>ALT: a notebook with PySpark code that reads data from a blob container into a data frame<\/em><\/p>\n<ul>\n<li><strong>Real-time streaming with event streams<\/strong>. <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/real-time-analytics\/event-streams\/overview\">Event streams<\/a> are a low-code solution to fetch real-time data, apply rules to it and push it to one or more destinations. A lakehouse is one of the possible destinations.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"779\" height=\"103\" class=\"wp-image-102245\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-close-up-of-a-screen-description-automatically.png\" alt=\"A close up of a screen\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 16: A sample event stream in Fabric<\/p>\n<p><em>ALT: an event stream fetching data from a sample real-time data stream, pushing the data to a lakehouse<\/em><\/p>\n<h4>Querying Data<\/h4>\n<p>Besides ingesting data, with notebooks you can also pull data into data frames, do transformations on that data and even visualize it using many of the available <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/python-guide\/python-visualizations\">Python libraries<\/a>. You can use <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/quickstart_ps.html\">Pandas data frames<\/a> if you want, or use Spark SQL to write SQL queries on top of your data. Since you\u2019re using notebooks, you can even go further and make machine learning models and do <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/data-science-overview\">data science experiments<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"885\" height=\"267\" class=\"wp-image-102246\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-program-description-au.png\" alt=\"A screenshot of a computer program\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 17: notebook working with predictions<\/p>\n<p><em>ALT: a notebook cell containing pyspark code<\/em><\/p>\n<p>Important to note is that when you create a lakehouse, a <strong>SQL analytics endpoint<\/strong> is automatically created. This endpoint allows you to write T-SQL statements on top of the tables in the lakehouse, but it is <strong>read-only<\/strong>. This version of SQL is pretty much the same as the one you use in the warehouse, so it is a bit different from the Spark SQL you use in the notebooks. When you\u2019re in the lakehouse, you can switch to the SQL endpoint in the top right corner:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"462\" height=\"166\" class=\"wp-image-102247\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-68.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 18: Switch from Lakehouse to SQL analytics endpoint<\/p>\n<p><em>ALT: Switch from Lakehouse to SQL analytics endpoint<\/em><\/p>\n<p>The SQL analytics endpoint resembles the warehouse explorer:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1209\" height=\"736\" class=\"wp-image-102248\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-69.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 19: the SQL analytics endpoint of a lakehouse<\/p>\n<p><em>ALT: the SQL analytics endpoint of a lakehouse. On the left there\u2019s an object explorer, similar to the warehouse. On the right is the data preview pane.<\/em><\/p>\n<p>You can write T-SQL statements just like in the warehouse (but only read-only queries):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"619\" height=\"441\" class=\"wp-image-102249\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/04\/a-screenshot-of-a-computer-description-automatica-70.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 20: A T-SQL Query executed on a lakehouse<\/p>\n<p><em>ALT: a T-SQL query executed on a lakehouse. At the bottom the result grid is shown.<\/em><\/p>\n<p>The SQL endpoint is in fact treated as similar to the warehouse in the official documentation, as explained in the article <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/get-started-lakehouse-sql-analytics-endpoint\">Better together: the lakehouse and warehouse<\/a>.<\/p>\n<h4>Transactions<\/h4>\n<p>The SQL endpoint has the same transaction support as the warehouse, but for read-only queries. This means for example that if a table has new rows added by another transaction (from a notebook), they are not included in any open transaction of running queries on the endpoint.<\/p>\n<p>In theory the lakehouse supports ACID transactions because there\u2019s a transaction log created by the delta layer. However, since it is completely open, no-one is stopping you from connecting to the OneLake storage layer and randomly deleting Parquet files from the table. This means consistency cannot be guaranteed (an interesting discussion about this topic is presented in the blog post <a href=\"https:\/\/datamonkeysite.com\/2023\/06\/15\/acid-transaction-in-fabric\/\">ACID Transaction in Fabric?<\/a>, and also check out the accompanying <a href=\"https:\/\/twitter.com\/bogdanC_guid\/status\/1669337841409871885\">Twitter discussion<\/a>).<\/p>\n<h4>Security<\/h4>\n<p>Again, the SQL analytics endpoint follows the warehouse closely. When it comes to security, the endpoint supports the same features: object and row level security is available. However, there are no such security features when working with Spark. Security is mainly defined through <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/workspace-roles-lakehouse\">workspace roles<\/a> and <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/lakehouse-sharing\">lakehouse sharing<\/a>.<\/p>\n<h4>Conclusion<\/h4>\n<p>The main point I\u2019m trying to make about the lakehouse is that it is a different beast than the warehouse. You can use notebooks to write code in various languages (PySpark, Spark SQL, Scala, SparkR and Java) and there are more possible use cases in the lakehouse, such as machine learning, data science and data visualization.<\/p>\n<p>Each lakehouse comes with a SQL analytics endpoint, which you can use to write T-SQL on the lakehouse tables. The SQL endpoint shares the same features as the warehouse but is read-only.<\/p>\n<h2>Choosing between the Two<\/h2>\n<p>We\u2019re at the end of the article and hopefully you now have a good idea which features the warehouse or the lakehouse brings. When you need to decide which service you want to use when building your data platform in Fabric, you can ask yourself any of the following questions:<\/p>\n<ul>\n<li><strong>What are the skills of my current team?<\/strong> If you have a seasoned data warehouse team that has been building data warehouses in databases like SQL Server for years (and thus they have good SQL skills), you might want to opt for the warehouse (typical job titles are BI Developer, Analytics Engineer, ETL\/data warehouse developer, data engineer). If on the other hand most people on the team have Python skills, the lakehouse might be a better option (typical job titles are data engineer and data scientist).<\/li>\n<li><strong>Do I need to migrate existing solutions?<\/strong> If you have data warehouses that need to be migrated (for example from SQL Server, Azure SQL DB or Synapse Analytics), the warehouse might be the easiest option since most of the code can be migrated without too much hassle (don\u2019t forget to check the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/tsql-surface-area\">T-SQL limitations in Fabric<\/a>!). If you need to migrate solutions from Databricks, the lakehouse is the better option.<\/li>\n<li><strong>Do you need to work with unstructured data?<\/strong> If you need to work with unstructured or semi-structured data, such as json files, images, video, audio and so on, the lakehouse is your choice.<\/li>\n<li><strong>Do you need to support other use cases besides just data warehousing? <\/strong>For example, if you want real-time data or machine learning models, the lakehouse will be a better fit.<\/li>\n<li><strong>What are your transaction and security requirements? <\/strong>If you have strict requirements about permissions, such as row-level security, and strict transaction requirements, the warehouse is probably the best options. Skills from SQL Server can easily be reused. However, the SQL analytics endpoint of the lakehouse supports the same features.<\/li>\n<\/ul>\n<p>You can find a good overview of the differences between the warehouse and the lakehouse (and other services such as Power BI Datamarts and KQL databases) in <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/get-started\/decision-guide-data-store#data-warehouse-and-lakehouse-properties\">the documentation<\/a>.<\/p>\n<p>However, the main question you might need to ask yourself is: <strong>\u201cDo I really need to choose?\u201d<\/strong>. You can perfectly combine both services in one single data platform. You can use for example the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/onelake-medallion-lakehouse-architecture\">medallion architecture<\/a> where the bronze and silver layer are implemented in a lakehouse (or multiple lakehouses), and the gold layer in a warehouse. This idea is entertained in the blog post <a href=\"https:\/\/debruyn.dev\/2023\/fabric-lakehouse-or-data-warehouse\/\">Fabric: Lakehouse or Data Warehouse?<\/a> by Sam Debruyn.<\/p>\n<p>As we stated in the introduction: the answer is <strong>it depends<\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Fabric is a new centralized, AI-powered cloud data platform hosted by Microsoft. It combines several services of the existing Azure Data Platform \u2013 such as Azure Data Factory \u2013 with Power BI, while also introducing new services. It\u2019s tempting to compare Microsoft Fabric with Azure Synapse Analytics since Synapse also bundles different services together,&#8230;&hellip;<\/p>\n","protected":false},"author":110905,"featured_media":103101,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,53,159164],"tags":[123645,158998,158997,159085],"coauthors":[159081],"class_list":["post-102229","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bi-sql-server","category-featured","category-microsoft-fabric","tag-data-warehouse","tag-lakehouse","tag-microsoft-fabric","tag-warehouse"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102229","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/110905"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=102229"}],"version-history":[{"count":3,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102229\/revisions"}],"predecessor-version":[{"id":107312,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102229\/revisions\/107312"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/103101"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=102229"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=102229"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=102229"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=102229"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}