High Concurrency Data Pipelines in Fabric

Comments 0

Share to social media

Data Pipelines can orchestrate many activities, creating a flow for data ingestion. One of these activities is the notebook execution activity.

However, every time a data pipelines executes a notebook, it creates a completely new session and spark pool.

This makes the Data Pipeline very slow and expensive.

How bad it can be

Imagine your pipeline will run a notebook inside a loop. The loop executes the notebook many times.

Each execution means a completely new spark pool. This is expensive.

A screenshot of a computer

Description automatically generated

Besides being expensive, the default configurations for a spark session and a capacity will not support this running in parallel. You will need to limit the number of parallel notebook executions, using the ForEach activity, like in the image below

A screenshot of a computer

Description automatically generated

High Concurrency to the Rescue

The solution is to enable High Concurrency for Data Pipelines running notebooks. This can be done in two steps:

  • Enable this configuration in the workspace settings
  • Configure the session tag in the notebook activity

In the workspace settings, you find this option to be enabled in Spark Settings, like in the image below:

A screenshot of a computer

Description automatically generated

After that, the Session Tag configuration defines which notebook activities will use this feature or not. You can create groups of notebook activities running each group in a different session. You can use any string as “Session Tag”

A screenshot of a computer

Description automatically generated

The High Concurrency Results

The image below shows a comparison between the execution without high concurrency and with high concurrency.

The execution time dropped from almost 13 minutes to less than 3.

A screenshot of a computer

Description automatically generated

References

Fabric Monday 55: Pipelines High Concurrency to Save Yout Time and Money

Summary

If you plan to orchestrate notebooks using Data Pipelines, the High Concurrency configuration is essential for you

Article tags

Load comments

About the author

Dennes Torres

See Profile

Dennes Torres is a Data Platform MVP and Software Architect living in Malta who loves SQL Server and software development and has more than 20 years of experience. Dennes can improve Data Platform Architectures and transform data in knowledge. He moved to Malta after more than 10 years leading devSQL PASS Chapter in Rio de Janeiro and now is a member of the leadership team of MMDPUG PASS Chapter in Malta organizing meetings, events, and webcasts about SQL Server. He is an MCT, MCSE in Data Platforms and BI, with more titles in software development. You can get in touch on his blog https://dennestorres.com or at his work https://dtowersoftware.com