PySpark: The flexibility of a loop

Comments 0

Share to social media

Pyspark has many flexible syntaxes which are not so common to other languages. One of these syntaxes is the loop format.

Loops in PySpark can be used to build objects or other values which will be set into variables. In other languages, this would involve a loop concatenating or appending to tThe variable.

In PySpark, this can easily be done with a loop in a single line or very few lines.

Example 1 : Inline loop

You have a list of field names in a dictionary called fieldnames. The values of the dictionary are the actual field names. You need to select these fields from a dataframe.

Let’s understand the pieces of this code:

  • The .values() method extract a collection with the values of a dictionary. Dictionaries also have a keys() method
  • The for loop is in the middle of the syntax to build an array of columns for the select method of the dataframe
  • Before the for loop, there is the syntax for one single element of the array, using the loop variable (new_name)

Example 2 : Building Variables

You are joining 2 dataframes. However, you are building a reusable piece of code which may be executed for different tables.

The tables are in a dictionary where the key is the table name and the value is an array with the primary key names. How to build the join condition ?

The for works in a similar way as the previous example, but this time building an array of conditions to be evaluated during the join of a dataframe.

There is no need of “and” or comma, PySpark understand the expression is building different elements of the array.

Although in this example there is only one key field (“id”), the for prepares the code for the presence of composite keys.

Example 3 : Object loop

Loop over a dictionary with multiple variables. The for loop can extract multiple variables from an object. This is specially interesting for dictionaries.

We use the method items() over the dictionary to allow the for to work.

Example 4 : JSON Loop

We can use a loop to build a JSON in a very interesting way. The example below is building a DAG for the runmultiple statement, but this is another subject. The interesting part for us is the way the for is used to build the DAG.


  • The basic concept is the same: One element of an array is built, the for comes immediatelly after, “multiplying” the element.
  • This example uses a dataframe
  • The variable of the for is a row of the dataframe and it’s used to build each array element.
  • The JSON can have other elements besides the array built by the for and this doesn’t affect the for (“concurrency” element is not part of the for).

Summary

This is a very powerful language syntax which can make the coding task very easy.

Load comments

About the author

Dennes Torres

See Profile

Dennes Torres is a Data Platform MVP and Software Architect living in Malta who loves SQL Server and software development and has more than 20 years of experience. Dennes can improve Data Platform Architectures and transform data in knowledge. He moved to Malta after more than 10 years leading devSQL PASS Chapter in Rio de Janeiro and now is a member of the leadership team of MMDPUG PASS Chapter in Malta organizing meetings, events, and webcasts about SQL Server. He is an MCT, MCSE in Data Platforms and BI, with more titles in software development. You can get in touch on his blog https://dennestorres.com or at his work https://dtowersoftware.com