Loading Models from Source Data with dbt

Comments 0

Share to social media

The data transformation tool dbt (data build tool) has become more and more popular over the past years. It focuses heavily on SQL, and it adds a bunch of interesting features into the mix, such as data lineage, automatic orchestration, reusable macros etc. In the first article in this series, A gentle introduction to dbt, it’s explained how you can get dbt for free in the cloud version, how you can set up an account and create a connection to Microsoft Fabric. We also created our first models, where a model can eventually be persisted as a view or a table in your database. It’s recommended to go through that article first if you haven’t already, because we will build upon it.

In this article, we will dive a bit deeper into how you can specify the source tables for your dbt project, and we will create a couple of models that will load some fact and dimension tables.

How to read your Source Data

In an ELT pipeline, dbt is responsible for the “T”, meaning it can transform data that is already in the data store of choice using SQL. This means that some other tool – this can be pipelines in Azure Data Factory or in Fabric – needs to be responsible for extracting source data and storing it in your database. Suppose this part of the process has already been completed and you have data available for you in staging tables. How can you process that data with dbt?

Get sample data in Fabric SQL DB

As a source, we’ll use sample data loaded into a SQL DB in Microsoft Fabric (this feature is in preview since the Microsoft Ignite 2024 conference). In a Fabric-enabled workspace, you can find the SQL database item under the “store data” category when you want to add a new item.

A screenshot of a computer

Description automatically generated

Give the new database a name. This name will be displayed when you connect to the database with the SQL Analytics Endpoint (which reads the mirrored Parquet files in the delta lake storage).

A screenshot of a computer

Description automatically generated

Once the database is provisioned, you’re presented with multiple options to load data into it. In our case, we want to load the sample AdventureWorksLT database.

A screenshot of a computer

Description automatically generated

It might take a little while to load the data.

A screenshot of a computer

Description automatically generated

When the database is ready, you can see several tables are loaded into the SalesLT schema:

A screenshot of a computer

Description automatically generated

We will now try to read and transform this data in dbt.

Specifying the Sources in dbt

dbt needs to know where it can find the source tables. The declaration of the source tables is done in a YAML file. Let us first add a new folder to the dbt project and

A screenshot of a computer

Description automatically generated

Name it “AdventureWorksDW”:

A screenshot of a computer

Description automatically generated

Using the same context menu, we can add a new YAML file to this folder:

A screenshot of a computer

Description automatically generated

The file doesn’t necessarily need to be called sources.yml, but inside the YAML file there needs to be a sources: key in the file. In the file, we place the following contents:

We gave our source a name – adventureworkslt – and we specify the name of the database, the schema and the names of the different tables. Because the warehouse can use three-part naming, we can access the tables in the SQL DB from the warehouse itself. This uses the SQL Analytics Endpoint of the SQL DB. For example:

A screenshot of a computer screen

Description automatically generated

In the folder we created earlier, we can now add a new file called DimProduct.sql. It’s important to include the .sql extension, so dbt can recognize it as a model. Inside the file, we can add the following SQL, which includes a jinja reference to one of the source tables we specified in the sources.yml file:

The reference consists of the source name and the name of the table. At compile time, dbt will translate this to the correct three-part name:

A screenshot of a computer

Description automatically generated

It’s also possible to do a preview of the data, if the SQL statement is valid:

A screenshot of a computer

Description automatically generated

Let’s change our query with something more complex; we now read from several source tables in the same query:

When we save and compile the model, we get an updated lineage view, which shows the dependencies of the model on the source tables:

A screenshot of a computer

Description automatically generated

This only works if you use the source macro to reference the source tables, if you would hardcode the table names the lineage view will not show the dependencies. We can now create a similar model for the customers. First, we will need to append the names of the source tables to the sources.yml file:

A screenshot of a computer

Description automatically generated

Add a new file called DimCustomer.sql and add the following SQL statement:

Resulting in this lineage view:

A diagram of a company

Description automatically generated

For our final model, FactSales.sql, we can use the following SQL statement (don’t forget to add the two extra source tables to the sources.yml file!).

Which will result in this lineage view:

A diagram of a diagram

Description automatically generated with medium confidence

Our (simple) data warehouse is finished, for now. We can build all the models and deploy them to the Fabric warehouse with the following command:

This command will take all compiled models within the path specified and execute them against the database.

A screenshot of a computer

Description automatically generated

However, the models are created as views, not as tables.

A screenshot of a computer

Description automatically generated

This is the default in dbt, but we can override this by setting it in the model itself. If you want to do this for an entire folder, you can specify the behaviour in the dbt_project.yml file.

A screenshot of a computer

Description automatically generated

Add the folder and the +materialized keyword and set it to Table:

A screen shot of a computer program

Description automatically generated

When we run the command again, we can see load times have increased as data is loaded into tables:

A screenshot of a computer

Description automatically generated

The views have been removed and tables have been created instead:

A screenshot of a computer

Description automatically generated

Conclusion

In this article, we showed how you can load sample data into a Fabric SQL Database. Then we used this data to load a couple of models in dbt. We needed to specify those source tables in a YAML file, which allows metadata references inside the SQL statement, allowing dbt to derive dependencies and create a lineage view.

Those models aren’t “best practice” yet when it comes to data warehouse modelling. In a following article, we will discover how we can use macros to add surrogate keys, reuse code and create a date dimension.

Article tags

Load comments

About the author

Koen Verbeeck

See Profile

Koen Verbeeck is a data professional working at AE. He helps organizations to get insight in their data and to improve their analytics solutions. Koen has over a decade of experience in developing data warehouses, models and reports using the Microsoft data platform. Since 2017 Koen is a Microsoft Data Platform MVP. He has a blog at http://www.sqlkover.com, writes articles for MSSQLTips.com and is a frequent speaker at Microsoft Data Platform events.