Microsoft Fabric: Loading On-Premises and External Data

Loading On-Premises and external data is a fundamental need for a Data Platform such as Microsoft Fabric. However, during this preview stage, Microsoft Fabric brings many surprises (and, should I say, deceptions?) in this area.

This blog may explain a lot, but I’m not bringing conclusions, but lot of questions. It will be a pleasure to hear your ideas in the comments, let’s talk about it.

How Azure Data Factory Works

In order to understand how everything got mixed, it’s important to understand how Data Factory works in relation to external data.

Data factory has dataflows and pipelines. The pipelines are the ones which go to external resources to retrieve the data. The dataflows, on the other hand, have limited data sources available. The idea is exactly to work over the data after it’s brought to an “internal” environment.

The images below compare the broad range of data sources available for a pipeline with the limited sources available for a dataflow:

Pipeline

Dataflows

All the sources are available for pipelines

Many sources appear disabled for dataflows

If the data is located on-premises or in a network we don’t have direct access to, we can use what’s called a self-hosted integration runtime. The self-hosted integration runtime is a software we install in the source network environment, and we use to extract data from the source to the data platform.

A screenshot of a computer

Description automatically generated

A screenshot of a computer runtime setup

Description automatically generated

A screenshot of a computer

Description automatically generated

“The source”, in this case, can be an on-premises environment or a private network in a different cloud.

If this reminds you of the on-premises data gateway used by Power BI, it’s because they are mostly the same, with small differences related to the use on Power BI or Data Factory.

Data Gateway and Self-Hosted: How they work

Either the on-premises data gateway or the self-hosted integration runtime, they are installed in the source network. They are intended to work as a bridge to load data from the source networks.

The of these tools’ security feature is to never allow a direct connection from the cloud to the private network. The connection is always reversed: The private network opens a connection to the cloud and keeps this connection open. Using the already opened connection, specific cloud services can request data from the private network.

A screenshot of a social media post

Description automatically generated

The service responsible for this architecture is called Azure Relay. In a distant past Azure Relay was born as part of Service Bus technologies, that’s why some images illustrating this article still mentions Service Bus. Nowadays, Azure Relay is an independent service and it’s automatically used behind the scenes of these software, either the data gateway or the self-hosted runtime.

Landing Zone: How does it work

In a medallion architecture, Bronze, Silver and Gold are the more well-known zones. However, we also need a landing zone. The landing zone is an area where we drop the source files in the exact format, we are receiving them, before any transformation is made to integrate them into the data lake format. This article (https://piethein.medium.com/medallion-architecture-best-practices-for-managing-bronze-silver-and-gold-486de7c90055 ) and some more on the web have some explanations about the landing zone.

Data Lake Medallion Architecture to Maintain Data Integrity

In this way, we drop the files in the landing zone, and start the transformations from the landing zone to the Bronze layer and onwards.

The landing zone has an additional benefit: Usually we keep the ingestion in the landing zone in blocks. For example, let’s say you are ingesting data daily. You create a folder for each daily ingestion. In this way, you can track the lineage of the data, identifying exactly when one specific record or value was ingested.

In my opinion, the lack of a landing zone is dangerous, exactly for making for difficult to keep track of the lineage. Even if you think your initial project is simple and doesn’t need it, the landing zone is still important because it would be way more difficult to implement when the project increases.

 

Microsoft Fabric Pipelines

There are no integration runtimes in Microsoft Fabric. The execution of the pipelines is managed internally. On the other hand, Microsoft Fabric works inside Power BI environment. As a result, it’s integrated with Power BI Gateways, either on-premises data gateways or VNET Data Gateways.

However, at the moment I’m writing this blog, Microsoft Fabric pipelines don’t support the use of gateways. They can only import data from directly accessible sources. In this way, the pipelines in Fabric can retrieve external data, but not on-premises data.

The image below illustrates how the connection configuration in pipelines lacks a data gateway configuration. The connections need to be directly to the resource. Any existing connection in Power BI environment using a data gateway is not available for use in a pipeline.

A screenshot of a computer

Description automatically generated

There are two other limitations to consider:

The pipelines in Fabric are new: They are data factory pipelines transposed to Power BI environment. They still lack many types of connectors to different sources. For example: There is no mySQL connector.

Lakehouses and Data Warehouses connectivity is limited: The connectivity to or from data warehouses and lakehouses is limited by the workspace level in Power BI. A MS Fabric pipeline can’t connect to a lakehouse or data warehouse in a different workspace.

 

The image below illustrates how activities configuration are limited to choose lakehouses or data warehouses located in the same workspace:

A screenshot of a computer

Description automatically generated

 

This creates an annoying problem: pipelines need to be created in the same workspace as the data repository (either a lakehouse or data warehouse). In my personal opinion, this is a limitation to the organization. We could have a workspace only for the pipelines, but this technical limitation prevents this.

 

Microsoft Fabric Dataflows Gen 2

The Dataflows Gen 2 has all the connectors which are already available to Power BI, so it’s capable of accessing a broad range of sources. It also supports gateways, so it can bring data from on-premises environments and private networks to the data lake being created.

The image below shows the gateway support when configuring a connection:

A screenshot of a computer

Description automatically generated

Initially this may seem like an inversion: In Fabric, it’s the dataflow which goes outside to retrieve data, while the pipeline features to do the same are limited.

However, Dataflows Gen 2 can’t drop files into the “Files” area of a lakehouse. They can only drop content into the “Tables” area of a lakehouse. This makes it more difficult to establish a landing zone.

The image below illustrates how Dataflows Gen 2 can drop data in any workspace, but only on the Tables area of a lakehouse or data warehouse, never on the files area.

A screenshot of a computer

Description automatically generated

 

It’s interesting to notice how the Dataflows Gen 2 can read from the “Files” area, but can’t use the “Files” area as destination. This allow them to be used to convert content from the Files area to the Tables area of a lakehouse, but not to import data from external sources to the “Files” area.

 

Gateway Version

Old versions of the gateway don’t support correctly the dataflow gen 2 and the destinations for lakehouse or data warehouse. The resulting error messages are not easy to understand.

Once you are using the gateway, you need to ensure you are using the most recent version of the gateway. This is not only about the software installed on the on-premises environment, but also the gateway registration in Power BI portal.

A white background with black text

Description automatically generated

 

This can be annoying: You will need to drop the gateway in the portal and create again. If you have many connections already linked to the gateway. You will lose them. In this scenario you may need to keep the old gateway and create a new one.

A screenshot of a computer

Description automatically generated

A screenshot of a computer

Description automatically generated

The Fabric Landing Zone dilemma

We can use pipelines to create landing zones, but they have limited connectors and can’t use the gateways to read from private networks.

We can use dataflows gen 2 to connect to a broader range of sources and use the gateways, but we can’t drop the content in the “Files” area to create a landing zone.

Our options become the following:

  • We could give up having a landing zone and use dataflows gen 2 to drop data directly to Bronze layer, but this can bring limitations to our data solution.
  • We could use a data object (SQL Database, lakehouse or data warehouse) as a landing zone. However, the relational format doesn’t make them a good landing zone to keep together the blocks of data ingested. There is plenty of room to be creative at this point.
  • We could use a data object (SQL Database, lakehouse or data warehouse) as a pre-landing zone. A dataflow Gen 2 would load the data from external sources to this pre-landing zone. A pipeline would move the data from the pre-landing zone to the “Files” area of a lakehouse, making the landing zone in the lakehouse “Files” area. It’s closer to the landing zone architecture. We keep the blocks of loaded data, but we lose the original format of the data loaded.

Pipelines and Dataflows Gen 2: Summarizing the differences.

 

Pipelines

Dataflows Gen 2

Connectors

Fewer connectors available

Broad range of connectors

Gateway

Can’t use a gateway, the source needs to be accessible

It supports the on-premises data gateway

Target

For lakehouses or data warehouses, it can only target objects in the same workspace

It can target objects in different workspaces

Drop Location

It can drop the result in both, Tables area and Files area of a lakehouse

It can only drop the result in the Tables area of a lakehouse, creating a limitation in relation to a landing zone creation

 

Work Around: Uploading files

We can build solutions uploading files to a lakehouse Files area manually or using the OneLake File Explorer.

The OneLake file explorer has many similarities with OneDrive. However, the data is stored in the Files area of a lakehouse. In this way, remote solutions can send files directly to inside a landing area in the lakehouse.

These solutions, however, depend on a lot of other technical solutions or manual work.

Conclusion

Microsoft Fabric is still in an early preview stage. I expect this blog to be outdated soon, improving our architectural options to build the enterprise Data Platform using Fabric.