Discover the Microsoft Fabric Data Wrangler

The Data Wrangler is as interesting as hidden inside Microsoft Fabric. It’s not easy to find and activate it, but it is worth the trouble.

Before digging into the Data Wrangler, let’s analyze the data features in the query environment.

Data Features in Query Environment

The new query environment, which allow us to make data exploration with visual queries and SQL queries, is available with many Power BI features:

Datamarts
lake house
Data Warehouse

And probably more to come if I’m not missing some of them already.

Why are we starting on the Query Environment? Because the Query Environment has some similar features to the Data Wrangler. Let’s discover them first and compare with the features on the Data Wrangler.

This example starts on a SQL Endpoint of a lake house.

Create a Visual Query.

Drop the table Fact_Sales to the Visual Query

A screenshot of a computer

Description automatically generated

On the top menu, on the Settings button, the Data View item has some interesting features for us to investigate. Let’s analyze them one by one. Description automatically generated”

Enable Column Profile

When we enable this option, a subtle green line appears on the Display Area, just below the title. If we hover this green line on a specific column, we find information about the data quality of the rows on that column.

The information tells us how many rows have valid values, error values and empty values for the column.

A screenshot of a computer program

Description automatically generated

Show Column Quality Details

Enabling this option expands the green line. The information about data quality which was only visible when hovering a column becomes visible in the expanded panel.

A screenshot of a computer

Description automatically generated

Show Value Distribution

This option adds to the panel information about the distribution of unique values in the column. It points to us how many distinct and unique values each column has.

A screenshot of a computer

Description automatically generated

This information may be very useful to identify primary keys in sets of data you don’t know.

Data Wrangler

The Data Wrangler has similar information to the ones we just found on the Query Environment. However, while the query environment works in a visual environment, the data wrangler works on python notebooks.

Two additional details it worth mentioning: The Data Wrangler has more features, linked to spark notebooks and it’s difficult to locate if you don’t know exactly where to look.

Opening the Data Wrangler

The secret to opening the data wrangler is simple after you discover it: The Data Wrangler requires a Pandas data frame. Once we have a notebook opened and a Pandas dataframe loaded into a variable, the Data Wrangler becomes available.

As an example, we will use the same lake house as the previous example.

Let’s follow these steps:

Click the Experience button and select the Data Engineering experience.

A screen shot of a computer

Description automatically generated

Click the button New Notebook on the Data Engineering experience.

A screenshot of a computer

Description automatically generated

On the top left, change the notebook name to WranglerTest

A screenshot of a computer

Description automatically generated

On the left, click the Add button to link the notebook with the lake house.

A screenshot of a computer

Description automatically generated

On the Add lakehouse window, select the Existing lakehouse option.

A screenshot of a computer

Description automatically generated

Choose your lakehouse and click the Add button.

A screenshot of a computer

Description automatically generated

Drag and drop the dimension_customer table to the code block in the notebook.

The code to load the table is automatically created. Did you know this one?

A screenshot of a computer

Description automatically generated

A screenshot of a computer program

Description automatically generated

Remove the Display line in the code. It’s not needed.
Add toPandas() at the end of Spark.sql

A screenshot of a computer

Description automatically generated

Execute the code block.

We will receive an error message and it explains the problem. Pandas, in the lake house environment, always try to make an optimization called Arrow optimization by default. This optimization may not work well with some fields, so we need to disable it.

Add the following line to the start of the code:

spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'false')

Run the code block again.

The Arrow optimization is disabled but the datetime field still causes problems. To make it easier, let’s remove the datetime field from the query.

Change the SQL query in the code to the following:

SELECT customerkey,
       wwicustomerid,
       billtocustomer,
       category,
       buyinggroup,
       primarycontact,
       postalcode
FROM   demolake.dimension_customer
LIMIT  1000

Execute the code block again.

A white card with black text

Description automatically generated

On the top menu, click Data menu item.
Click the Launch Data Wrangler button.

The variable created on the code block is identified and appears on the menu.

A screenshot of a computer

Description automatically generated

Click the variable name on the menu. The Data Wrangler will appear.

A screenshot of a computer

Description automatically generated

Data Wrangler: First Impressions

The initial Data Wrangler window seems like the information we have in the query environment. It seems to have some additional details about the data distribution, in some cases, making it more visible when there are only a few unique values.

If your purpose is only to see this additional information about the columns, both Data Wrangler and the query environment, work. It becomes a matter of preference which one you should use and for sure you will choose according to your preferred environment: If you prefer to work visually, the query environment will be better, if you prefer to work with spark, Data Wrangler will be better.

However, Data Wrangler can achieve much more. It can accept transformations over the data frame and implement these transformations as pyspark code.

In fact, the UI is slightly like Power Query.

Transformations with Data Wrangler

Let’s implement some transformations over the data frame and check the result.

Click the Expand button (“…”) beside the BillToCustomer field.

A screenshot of a computer

Description automatically generated

Select the Filter option on the menu.
On the left side, on the dropdown Select a Condition, select the option Starts With

A screenshot of a computer screen

Description automatically generated

On the 3rd textbox, type WingTip

A screenshot of a computer

Description automatically generated

Click the Apply button.

Mind the Cleaning steps, registering the transformations.

A screenshot of a computer

Description automatically generated

Click the Expand button (“…”) beside the PostalCode field.
Select the Drop Columns option on the menu.

A screenshot of a computer

Description automatically generated

Click the Apply button.

A screenshot of a computer

Description automatically generated

On the top menu, click the Add code to notebook button.

A screenshot of a computer

Description automatically generated

The transformations created visually are included in the notebook as part of the code.

A screenshot of a computer code

Description automatically generated

Summary

The Data Wrangler is a powerful tool not only to help data exploration but also to help building pyspark code using visual methods.

Register for Simple Talk

Discover the Microsoft Fabric Data Wrangler

Data Features in Query Environment

Enable Column Profile

Show Column Quality Details

Show Value Distribution

Data Wrangler

Opening the Data Wrangler

Data Wrangler: First Impressions

Transformations with Data Wrangler

Summary

Article tags

About the author

Dennes Torres

Dennes's contributions

Articles

Books

Top topics

Dennes's latest contributions:

Vibe Coding or Not Coding: What’s in the middle

Using Parameters with DAX in Report Builder

Techniques to query Azure SQL’s new JSON Datatype

Data Features in Query Environment

Enable Column Profile

Show Column Quality Details

Show Value Distribution

Data Wrangler

Opening the Data Wrangler

Data Wrangler: First Impressions

Transformations with Data Wrangler

Summary

Article tags

Recommended

About the author

Dennes Torres

Dennes's contributions

Articles

Books

Top topics

Dennes's latest contributions:

Vibe Coding or Not Coding: What’s in the middle

Using Parameters with DAX in Report Builder

Techniques to query Azure SQL’s new JSON Datatype