Dynamic Partitioning and a Simple Incremental Load

Comments 0

Share to social media

Let’s consider a simple statement for partitioning and save a table in a lakehouse:

df.write.mode("overwrite").format("delta").partitionBy("Year","Month","Day").save("Tables/" + table_name)

Let’s consider we load the data daily, with all the transactions from the day. The table will save the transactions for each day in different partitions. We can expect the table to keep the partitions from previous day, months and years to be kept, achieving a kind of incremental load, right?

Wrong.

The files are neither, overwritten or deleted, but they are removed from the delta log. The records are not directly acessible in the delta table any more.

A blue rectangular with white text

Description automatically generated

The table removes the previous partitions from the Delta Log

There are two possible solutions: If you load the data once a day,  you can replace the “overwrite” by “append”. This solution only works if each ingestion never brings a record already loaded before.

df.write.mode("append").format("delta").partitionBy("Year","Month","Day").save("Tables/" + table_name)

However, if we load the data multiple times a day, bringing all the transactions from the day and containing records which are possibly already loaded in the warehouse, the “append” mode doesn’t work, because it will create duplicated records.

One of the solutions is to overwrite specific partitions instead of all of them. In this case, overwrite the “Day” partition.

The Spark configuration called “partitionOverwriteMode” can have the values static or dynamic. The default value is static. The table removes all old partitions from the delta log by default. When we set this configuration to dynamic,  the table keeps the old partitions. It overwrites the ones coming in the dataframe.

After we set this configuration to “dynamic”, we can execute the same original statement to ensure the table will keep the old partitions:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df.write.mode("overwrite").format("delta").partitionBy("Year","Month","Day").save("Tables/" + table_name)

A blue rectangular with white text

Description automatically generated

The table keeps the existing partitions, instead of overwritting them.

Summary

Using dynamic partitioning and defining the partition key correctly is one simple and fast way to achieve an incremental load. Of course, this is only one among other options we will talk about later.

You may also would like to take a look on how to set the configuration as default using environments

 

Load comments

About the author

Dennes Torres

See Profile

Dennes Torres is a Data Platform MVP and Software Architect living in Malta who loves SQL Server and software development and has more than 20 years of experience. Dennes can improve Data Platform Architectures and transform data in knowledge. He moved to Malta after more than 10 years leading devSQL PASS Chapter in Rio de Janeiro and now is a member of the leadership team of MMDPUG PASS Chapter in Malta organizing meetings, events, and webcasts about SQL Server. He is an MCT, MCSE in Data Platforms and BI, with more titles in software development. You can get in touch on his blog https://dennestorres.com or at his work https://dtowersoftware.com