I wrote about lakehouse maintenance before, about multiple lakehouse maintenances, published videos about this subject and provided sample code about it.
However, there is one problem: All the maintenance execution should be avoided over shortcuts.
The tables require maintenance in their original place. According to our solution advances, we start using shortcuts, lots of them. Our maintenance code should always skip shortcuts and make the maintenance only on the tables.
In this way, the maintenance in each lakehouse manages the tables in that lakehouse and delegates the maintenance of the shortcuts to their source location.
The Problem: how to identify which objects in a lakehouse are shortcuts?
This is not a straightforward task. It’s not something like a boolean value nor a check we can make on the object.
We need to use the Fabric API to list all shortcuts from the lakehouse. Once we have a list of all shortcuts, we can change the maintenance code to skip the shortcuts.
This is the original code for lakehouse maintenance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
<span style="color: #333333">#</span> Welcome <span style="color: #008800;font-weight: bold">to</span> your <span style="color: #008800;font-weight: bold">new</span> notebook <span style="color: #333333">#</span> <span style="color: #008800;font-weight: bold">Type</span> here <span style="color: #008800;font-weight: bold">in</span> the cell editor <span style="color: #008800;font-weight: bold">to</span> <span style="color: #008800;font-weight: bold">add</span> code<span style="color: #333333">!</span> import pyarrow.dataset <span style="color: #008800;font-weight: bold">as</span> pq import os import org.apache.spark.<span style="color: #008800;font-weight: bold">sql</span> def cleanTables(delta_file_path,delta_table_name): spark.<span style="color: #008800;font-weight: bold">sql</span>(f<span style="background-color: #fff0f0">'OPTIMIZE {delta_table_name} VORDER'</span>) spark.<span style="color: #008800;font-weight: bold">sql</span>(f<span style="background-color: #fff0f0">'VACUUM \'</span><span style="color: #ff0000;background-color: #ffaaaa">{</span>delta_file_path<span style="color: #ff0000;background-color: #ffaaaa">}\</span><span style="background-color: #fff0f0">' RETAIN 0 HOURS'</span>); print(f<span style="background-color: #fff0f0">'\nTable {delta_file_path} OPTIMIZE and VACUUM sucessfully'</span>) <span style="color: #333333">#</span> Test the <span style="color: #008800;font-weight: bold">function</span> <span style="color: #008800;font-weight: bold">with</span> a path <span style="color: #008800;font-weight: bold">to</span> your delta file<span style="color: #333333">#</span> spark.conf.<span style="color: #008800;font-weight: bold">set</span>(<span style="color: #aa6600">"spark.databricks.delta.retentionDurationCheck.enabled"</span>, <span style="color: #aa6600">"false"</span>) full_tables <span style="color: #333333">=</span> os.listdir(<span style="background-color: #fff0f0">'/lakehouse/default/Tables'</span>) <span style="color: #008800;font-weight: bold">for</span> <span style="color: #008800;font-weight: bold">table</span> <span style="color: #008800;font-weight: bold">in</span> full_tables: cleanTables(<span style="background-color: #fff0f0">'Tables/'</span> <span style="color: #333333">+</span> <span style="color: #008800;font-weight: bold">table</span>,<span style="color: #008800;font-weight: bold">table</span>) |
We need an additional function to retrieve the list of shortcuts from the lakehouse:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<span style="color: #333333">#</span>shortcuts need <span style="color: #008800;font-weight: bold">to</span> be excluded <span style="color: #008800;font-weight: bold">from</span> the lakehouse maintenance import sempy.fabric <span style="color: #008800;font-weight: bold">as</span> fabric <span style="color: #008800;font-weight: bold">from</span> sempy.fabric.exceptions import FabricHTTPException def loadShortcuts(): client <span style="color: #333333">=</span> fabric.FabricRestClient() url <span style="color: #333333">=</span> f<span style="color: #aa6600">"v1/workspaces/{datausageWorkspaceId}/items/{lakehouseId}/shortcuts"</span> <span style="color: #008800;font-weight: bold">result</span> <span style="color: #333333">=</span> client.<span style="color: #008800;font-weight: bold">get</span>(url) <span style="color: #008800;font-weight: bold">data</span> <span style="color: #333333">=</span> <span style="color: #008800;font-weight: bold">result</span>.json().<span style="color: #008800;font-weight: bold">get</span>(<span style="background-color: #fff0f0">'value'</span>, []) <span style="color: #333333">#</span> Directly <span style="color: #008800;font-weight: bold">extract</span> the <span style="background-color: #fff0f0">'Name'</span> <span style="color: #008800;font-weight: bold">from</span> <span style="color: #008800;font-weight: bold">each</span> <span style="color: #008800;font-weight: bold">dictionary</span> shortcuts <span style="color: #333333">=</span> [item.<span style="color: #008800;font-weight: bold">get</span>(<span style="color: #aa6600">"Name"</span>) <span style="color: #008800;font-weight: bold">for</span> item <span style="color: #008800;font-weight: bold">in</span> <span style="color: #008800;font-weight: bold">data</span>] <span style="color: #008800;font-weight: bold">return</span> shortcuts |
The shortcuts are loaded as a collection because this makes it easier to filter the table names.
We need to change the main code to retrieve the shortcuts and skip them during the maintenance.
1 2 3 4 5 6 7 8 9 |
<span style="color: #333333">#</span> Test the <span style="color: #008800;font-weight: bold">function</span> <span style="color: #008800;font-weight: bold">with</span> a path <span style="color: #008800;font-weight: bold">to</span> your delta file<span style="color: #333333">#</span> spark.conf.<span style="color: #008800;font-weight: bold">set</span>(<span style="color: #aa6600">"spark.databricks.delta.retentionDurationCheck.enabled"</span>, <span style="color: #aa6600">"false"</span>) full_tables <span style="color: #333333">=</span> os.listdir(<span style="background-color: #fff0f0">'/lakehouse/default/Tables'</span>) shortcuts<span style="color: #333333">=</span>loadShortcuts() <span style="color: #008800;font-weight: bold">for</span> <span style="color: #008800;font-weight: bold">table</span> <span style="color: #008800;font-weight: bold">in</span> full_tables: if <span style="color: #008800;font-weight: bold">table</span> <span style="color: #008800;font-weight: bold">not</span> <span style="color: #008800;font-weight: bold">in</span> shortcuts: cleanTables(<span style="background-color: #fff0f0">'Tables/'</span> <span style="color: #333333">+</span> <span style="color: #008800;font-weight: bold">table</span>,<span style="color: #008800;font-weight: bold">table</span>) |
Conclusion
We need to be careful with shortcuts in our maintenance code
Load comments