{"id":102145,"date":"2024-04-30T17:00:49","date_gmt":"2024-04-30T17:00:49","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=102145"},"modified":"2024-09-03T20:04:45","modified_gmt":"2024-09-03T20:04:45","slug":"pyspark-secrets-to-use-with-fabric","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/pyspark-secrets-to-use-with-fabric\/","title":{"rendered":"PySpark Secrets to use with Fabric"},"content":{"rendered":"<p>PySpark is a powerful language for data manipulation and it\u2019s full of tricks. Let\u2019s discover some of them.<\/p>\n<h2>Control the Type of a NULL column<\/h2>\n<p>If you are creating a pysspark dataframe, but one of the columns contains only null values (None), how could you control the type of the column?<\/p>\n<p>There is an interesting expression you can build using <em>lit().cast()<\/em> to control the type of a column with null values.<\/p>\n<p>The following statement will add a new column or fix the type of an existing column containing only Null values.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #008800;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">pyspark.sql.functions<\/span> <span style=\"color: #008800;font-weight: bold\">import<\/span> lit\r\n\r\ndf<span style=\"color: #333333\">=<\/span>df<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">'myNewColumn'<\/span>,lit(<span style=\"color: #007020\">None<\/span>)<span style=\"color: #333333\">.<\/span>cast(<span style=\"background-color: #fff0f0\">'timestamp'<\/span>))\r\n<\/pre>\n<\/div>\n<h2>\u00a0<\/h2>\n<h2>Updating a single row\/column in a Dataframe<\/h2>\n<p>PySpark works very well with Big Data and blocks of data. Each operation is made to apply over an entire Dataframe.<\/p>\n<p>In this way, updating a single row\/column brings a syntax not much traditional. This is an example:<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\"><span style=\"color: #008800;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">pyspark.sql.functions<\/span> <span style=\"color: #008800;font-weight: bold\">import<\/span> lit, col, when\r\n\r\nDF <span style=\"color: #333333\">=<\/span> DF<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">'fieldToChange'<\/span>, when(\r\n     col(<span style=\"background-color: #fff0f0\">\"tableKey\"<\/span>) <span style=\"color: #333333\">==<\/span> <span style=\"background-color: #fff0f0\">'uniqueKey'<\/span>,\r\n     lit(<span style=\"background-color: #fff0f0\">'literalValue'<\/span>)\r\n     )<span style=\"color: #333333\">.<\/span>otherwise(col(<span style=\"background-color: #fff0f0\">\"fieldToChange\"<\/span>)))\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Let&#8217;s analyse the statement step-by-step:<\/p>\n<ul>\n<li><strong>withColumn<\/strong>: Adds or replaces a column. In this case, it will replace the column.<\/li>\n<li><strong>when\/otherwise<\/strong>: creates a condition, similar to a If statement<\/li>\n<li><strong>lit<\/strong>: literal value<\/li>\n<\/ul>\n<p>In this way, the <strong><em>when<\/em><\/strong> clause specifies a filter, selecting one single row to receive a new value in the column we are updating. All the remaining rows will continue with the original value.<\/p>\n<p>This example updates a single row. However, this syntax can be used as a SQL update\/where, updating multiple rows in different ways.<\/p>\n<h2>Creating a Hash for Dataframe rows<\/h2>\n<p>When you need to compare if a row was changed or not, one good method is to create a hash for the row and compare the hash.<\/p>\n<p>Using just a few statements you can create a hash for the rows of a dataframe. If these rows are kept immutable, the hash can also be used as surrogate key.<\/p>\n<p>The code below shows an example of how to create a hash for all the rows of a dataframe:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\" class=\"crayon:false\">df <span style=\"color: #333333\">=<\/span> spark<span style=\"color: #333333\">.<\/span>createDataFrame(\r\n   [(<span style=\"color: #0000dd;font-weight: bold\">1<\/span>,<span style=\"background-color: #fff0f0\">\"2\"<\/span>,<span style=\"color: #0000dd;font-weight: bold\">5<\/span>,<span style=\"color: #0000dd;font-weight: bold\">1<\/span>),(<span style=\"color: #0000dd;font-weight: bold\">3<\/span>,<span style=\"background-color: #fff0f0\">\"4\"<\/span>,<span style=\"color: #0000dd;font-weight: bold\">7<\/span>,<span style=\"color: #0000dd;font-weight: bold\">8<\/span>)],\r\n   (<span style=\"background-color: #fff0f0\">\"col1\"<\/span>,<span style=\"background-color: #fff0f0\">\"col2\"<\/span>,<span style=\"background-color: #fff0f0\">\"col3\"<\/span>,<span style=\"background-color: #fff0f0\">\"col4\"<\/span>)\r\n   )\r\n\r\ndf<span style=\"color: #333333\">.<\/span>withColumn(<span style=\"background-color: #fff0f0\">\"row_sha2\"<\/span>, \r\n   sha2(concat_ws(<span style=\"background-color: #fff0f0\">\"||\"<\/span>, <span style=\"color: #333333\">*<\/span>df<span style=\"color: #333333\">.<\/span>columns), <span style=\"color: #0000dd;font-weight: bold\">256<\/span>))\r\n\r\ndisplay(df)\r\n<\/pre>\n<\/div>\n<p>\nOne of the most interesting parts of the code is the unpacking operator, &#8216;*&#8217;. Used with the df.columns collection, it unpack the collection, converting it into its individual elements. In this way, the <em>concat_ws<\/em> receives all the columns as a parameter.<\/p>\n<p>You can also check more about the <a href=\"https:\/\/spark.apache.org\/docs\/3.1.2\/api\/python\/reference\/api\/pyspark.sql.functions.sha2.html\" target=\"_self\" rel=\"noopener\">sha2 function<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>PySpark is a powerful language for data manipulation and it\u2019s full of tricks. Let\u2019s discover some of them. Control the Type of a NULL column If you are creating a pysspark dataframe, but one of the columns contains only null values (None), how could you control the type of the column? There is an interesting&#8230;&hellip;<\/p>\n","protected":false},"author":50808,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,159164],"tags":[158997],"coauthors":[6810],"class_list":["post-102145","post","type-post","status-publish","format-standard","hentry","category-blogs","category-microsoft-fabric","tag-microsoft-fabric"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102145","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/50808"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=102145"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102145\/revisions"}],"predecessor-version":[{"id":104136,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102145\/revisions\/104136"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=102145"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=102145"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=102145"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=102145"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}