{"id":103808,"date":"2024-09-04T17:00:00","date_gmt":"2024-09-04T17:00:00","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=103808"},"modified":"2024-09-03T20:04:42","modified_gmt":"2024-09-03T20:04:42","slug":"microsoft-fabric-and-pyspark-coding-tricks-to-improve-your-solutions","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/microsoft-fabric-and-pyspark-coding-tricks-to-improve-your-solutions\/","title":{"rendered":"Microsoft Fabric and PySpark: Coding tricks to improve your solutions"},"content":{"rendered":"<p><strong>PySpark<\/strong> has some unconventional syntaxes which provide power to the development process, making it easier. <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/pyspark-the-flexibility-of-a-loop\/\">We talked about loops before<\/a>.\u00a0Let\u2019s discover some more tricks to make our coding easier and more powerful.<\/p>\n<h2>The Power of the pyspark REDUCE function<\/h2>\n<p>Let\u2019s discover more about this powerful function<\/p>\n<p><strong>Definition:<\/strong> We can use this function to apply a given function to a sequence of elements and reduces it to a single value<\/p>\n<h3>Syntax<\/h3>\n<pre class=\"lang:python decode:true\">functools.reduce(function, iterable, initializer=None)<\/pre>\n<h2>Scenario<\/h2>\n<p>Imagine you need to make replaces in a dataframe&#8217;s string column. You need to make multiple replaces of different strings.<\/p>\n<p>This is the kind of scenario which creates a sequence of replace calls one inside another with a very ugly syntax.<\/p>\n<p>It would be something like this:<\/p>\n<pre class=\"lang:python decode:true \">replace(replace(replace(replace(replace(column,'xxx',''),'zz',''),'yy',''),'aa',''),'bb','')\n\n#this is an aproximate syntax<\/pre>\n<p><span style=\"color: revert;font-family: 'Roboto Slab', serif;font-size: 1.5rem\">Reduce function<\/span><\/p>\n<p>Let&#8217;s suppose the dataframe name is DF and the column name is &#8220;Description&#8221;. This is how we can use the Reduce function in this scenario:<\/p>\n<pre class=\"lang:tsql decode:true \">from pyspark.sql.functions import col, regexp_replace\nfrom functools import reduce\n\nreplacements = [\n    (\"USD Monthly\", \"\"),\n    (\"USD-Monthly\", \"\"),\n    (\"USD Quarterly\", \"\"),\n    (\"USD\", \"\")\n]\n\nnew_column = reduce(\n    lambda cl, replacement: regexp_replace(cl, replacement[0], replacement[1]),\n    replacements,\n    col(\"Description\")\n)\n\ndf=df.withColumn(\"Description\",new_column)<\/pre>\n<p><span style=\"font-size: 1rem\">Let&#8217;s analyze the code above:<\/span><\/p>\n<p>The array &#8220;replacements&#8221; includes sets of strings to be replaced, the original values and the replacement value.<\/p>\n<h4>Reduce function first parameter<\/h4>\n<p>This one is a lambda expression. In this example, the expression receives a column and one of the string sets for replacement. It executes the <strong>regex_replace<\/strong> function to replace the string value in the column.<\/p>\n<h4>Reduce function second parameter<\/h4>\n<p>This is the &#8220;replacements&#8221; array. The Reduce function will be execute the lambda expression for each element in the array. Each element is a set of two strings, the original string value and the replacement.<\/p>\n<h4>Reduce function third parameter<\/h4>\n<p>This is an additional value to initialize the lambda expression. In this case, one column. The column will be sent as parameter to the lambda expression.<\/p>\n<h4>Final Result<\/h4>\n<p>The function Reduce results in a column value. This column value is used together the <strong>withColumn <\/strong>method to replace the original &#8220;Description&#8221; column by the new processed value<\/p>\n<h2>Clearing a dataframe &#8211; and similar tasks<\/h2>\n<p>This is an interesting coding trick. We can use it to clear a dataframe, to duplicate a dataframe or you may find even additional uses for this trick.<\/p>\n<p>The logic is simple: You get the schema from one dataframe and use it as the definition to create a new, empty dataframe.<\/p>\n<p>The code is as simple as this:<\/p>\n<pre class=\"lang:python decode:true \">ingestion_df=spark.createDataFrame([], ingestion_df.schema)<\/pre>\n<h2>Calling pyspark functions by their string name<\/h2>\n<p>In one way or another, most languages have this capability: We can have the name of a function in a string variable, &#8220;functionName&#8221; for example, and execute the function from this information.<\/p>\n<p>This is usually a feature we can use when create a generic, reusable and parameterized code.<\/p>\n<h3>Scenario<\/h3>\n<p>My scenario was exactly this. I was building a reusable code we can call with different sets of parameters. However, each different situation had slightly different data transformations required.<\/p>\n<p>In this way, the reusable code can contain the data transformations which are common to all sets of parameters. We can handle the specific transformations each set of parameters need by calling a custom function, a different function for each set of parameters.<\/p>\n<p>In this way, we can include in each set of parameters the name of a custom function. The function will complete the specific transformations.<\/p>\n<h3>Implementation<\/h3>\n<p>Considering a dataframe called &#8220;df&#8221; and a string variable called <em>function_name<\/em>, this is how the code can be implemented:<\/p>\n<pre class=\"lang:tsql decode:true\">import builtin.Lib.pycdc as pycdc\n\nfunc=None\n\ntry:\n    func= getattr(pycdc, function_name)\nexcept:\n    print('no custom transformation available')\n\nif func!=None:\n   df=func(df)<\/pre>\n<p><span style=\"font-size: 1rem\">Some details about the code:<\/span><\/p>\n<ul>\n<li><strong>getattr<\/strong> is the function which retrieves the actual function implementation from the function name<\/li>\n<li>The first parameter is the module where the function is located. <strong><em>&#8220;pycdc&#8221;<\/em><\/strong> is the name of the module, as used in the <em>import<\/em> statement.<\/li>\n<li>The <strong><em>getattr<\/em><\/strong> is used together a try\/except. This creates a safe code in case the function doesn&#8217;t exist.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>These are only some of the <strong>PySpark<\/strong> tricks. This is a powerful language available in Microsoft Fabric, an equally powerful environment<\/p>\n","protected":false},"excerpt":{"rendered":"<p>PySpark has some unconventional syntaxes which provide power to the development process, making it easier. We talked about loops before.\u00a0Let\u2019s discover some more tricks to make our coding easier and more powerful. The Power of the pyspark REDUCE function Let\u2019s discover more about this powerful function Definition: We can use this function to apply a&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":103714,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143526,159164,2,146042],"tags":[158997,159035,159146,5021],"coauthors":[6810],"class_list":["post-103808","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-development","category-microsoft-fabric","category-other","category-python","tag-microsoft-fabric","tag-notebook","tag-pyspark","tag-python"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/103808","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=103808"}],"version-history":[{"count":1,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/103808\/revisions"}],"predecessor-version":[{"id":103809,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/103808\/revisions\/103809"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/103714"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=103808"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=103808"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=103808"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=103808"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}