Reading and Writing Parquet Files in SQL Server

Comments 0

Share to social media

When analytic data is stored directly in SQL Server, a clustered columnstore index is an ideal place for that data to reside. It is column-based, highly compressed, and supports a wide variety of optimizations for both reads and writes.

When analytic data is stored in files, such as for use in Azure Data Factory, Synapse, or Fabric, the Parquet file format will often be used. There are many similarities between these two columnstore formats and the methods they use to improve read/write speeds and storage space.

When writing data from SQL Server to files, though, uncompressed formats such as CSV are often used. While files can be written quickly, the lack of compression can be wasteful and slow as data and file sizes become large.

This article dives into the Apache Parquet file format, how it works, and how it can be used to export and import data directly to SQL Server, even when a data platform that supports Parquet files natively is unavailable to assist.

In the second part of this article, customizations and more advanced options will be highlighted, showing the flexibility of Python as a tool to solve analytic data movement challenges.

Background: Columnstore Indexes in SQL Server

Before diving into this article, there is value in briefly reviewing how columnstore indexes work. If in need of a refresher, check out my series on columnstore indexes here: https://www.red-gate.com/simple-talk/databases/sql-server/t-sql-programming-sql-server/hands-on-with-columnstore-indexes-part-1-architecture/

Background: Apache Parquet File Format

The Parquet file format is a columnar data file format that can be compressed using a variety of different algorithms. It is an open format, meaning that its underlying technology is not owned by any company or organization. This makes it portable and allows data to be stored and managed across different platforms without the typical vendor-lock-in challenges that many data formats pose.

While the Parquet file format can benefit a wide variety of data, it is most effective at storing analytic data. Because data intended for analytics tends to contain repetitive, easy-to-compress values within each individual column, a column-wise format is ideal for its storage. Most conventional file formats, such as CSV, XML, and JSON are all row-based. In these formats, values for all data elements with each row are stored consecutively. This is similar to how a clustered rowstore index stores data in SQL Server: Each column in sequence for each row in the table, one row after another until the end of the table.

The similarities between a columnstore index and a parquet file are no coincidence. Each is built to solve the same exact problem, but in different environments. The challenges of storing large amounts of analytic data in a relational database are quite different from those faced in file formats, and this article will bring them together as a way to improve the movement and storage of data to and from SQL Server.

How is a Parquet File Stored?

The Apache Parquet format can be summarized by this image:

A screenshot of a computer program

Description automatically generated

The key elements to note here are:

  • Row Groups
  • Column Chunks
  • File Metadata
  • Row Group Metadata

Row groups are similar to those found in columnstore indexes. Their purpose is to break up a large data set into manageable groupings. Whereas Columnstore indexes limit rowgroup size to 220 rows, the size of Row groups in the parquet format is more variable and can be configured. As is often the case with storage-level configurations, there is value in accepting the default settings until use-cases and performance dictate otherwise.

For more information on the Parquet file format, configuration settings, and other details, check out the definitive source: File Format | Parquet (apache.org)

Column chunks are akin to segments in a columnstore index. Each column chunk stores data for a single column within a single row group. The column chunk may be split into any number of data pages, which then store that data sequentially. Like with row groups, data page size can be configured if there is a need to do so.

Each column chunk represents the minimum amount of data that can be read at a time from a parquet file. Compression is applied at the data page level, though. Therefore, it is possible for a single column to have different compression details, even within the same column chunk, via different data pages.

Minimal High-level file metadata is stored in the file footer of each parquet file, allowing tools to quickly identify in general what is stored in the file.

Metadata for each row group is stored in file footer, as well. This includes details of data types, compression, size, and more.

The key to metadata in parquet files is that it allows tools to quickly determine what is stored in the file without reading and analyzing all of the data first. Therefore, queries that need to access many rows of data can do so without reading every one of those rows. This is quite like the function of metadata in columnstore indexes, allowing for functionality similar to segment elimination as unneeded columns and data ranges can automatically be ignored when reading parquet files.

There are many additional optimizations available for parquet files. Some are open-source parts of the file format, whereas others are proprietary add-ons by software vendors. If you see references to V-Order Optimization, Z-Order Optimization, or Liquid Clustering, then recognize that they are performance optimizations intended to:

  • Improve compression by tweaking data order.
  • Improve performance for specific filters.
  • Behave like indexes, where needed, to add critical query/analytics support.

A more detailed discussion of the parquet file format is out-of-scope for this article, as would be a review of Delta Tables/Delta Lake. No further knowledge is needed to understand the code in this article, and in fact, I’ve probably already over-shared quite a bit.

The Lego-Based Approach

Azure SQL Database added both Parquet and delta file support to OPENROWSET, though this functionality is currently in private preview. Azure SQL Managed Instance has this support currently available, as well (without the private preview limitation). If you have access to either of these options, then reading Parquet files into SQL Server can be accomplished using a relatively streamlined and familiar approach. Due to the limited scope and access to these features, we will not delve into them as part of this article.

To otherwise read and write parquet files from SQL Server requires a set of different processes to be pieced together. I like to refer to this as a “Lego-Based Approach”, with each process being a different Lego brick.

For this project, the components that will be used include:

  • SQL Server
  • SQL Server Agent
  • Python
    • Including some free widely available libraries
  • Anaconda

There are many ways to accomplish the goals of this project, including other programming languages, task-scheduling agents, data platforms, and more. Feel free to customize the approach to meet the needs/limitations of your data platform and software development environments.

Why Python?

Python is the Swiss-army-knife of tools in the best possible way and is the glue that brings everything in this project together. It is extremely versatile, with a seemingly endless supply of well-maintained libraries, ensuring that further customization is not difficult or impossible.

A huge bonus: Most platforms support Python. Whether your code is in Azure, AWS, on-premises, or elsewhere, Python is a readily available tool that will continue to serve us well for many years to come.

Lastly, learning and using Python is not too challenging. Even for someone with little coding experience, the vast tutorials and example code that are available make it easy to solve problems and understand how scripts work.

If you are a PowerShell enthusiast, then the answer to your next question is: “Yes”. Parquet files can be written and read using PowerShell, and if you are fluent in PowerShell scripting, then the task to replicate the contents of this article there will not be any more challenging that what is presented here.

Installing/Configuring the Tools

To avoid 30 long/boring pages on configuration and install settings, I will include links here to official documentation and instructions on how to set up each tool needed for this project. If you already have any/all installed, then feel free to skim/skip those steps.

Some of these instructions will be second nature to a database engineer/administrator, while others are likely to be easier for a developer. This project dives into a bit of both worlds!

Download and install the Anaconda distribution for Windows:

https://www.anaconda.com/download/success

Create an Environment in Anaconda Navigator for testing our Python scripts:

https://docs.anaconda.com/navigator/tutorials/manage-environments

Install the following Python libraries (and dependencies) from the packages list:

For more details:
https://docs.anaconda.com/anacondaorg/user-guide/packages/installing-packages/

Create a 64-bit ODBC local data source for the SQL Server:

https://learn.microsoft.com/en-us/sql/integration-services/import-export-data/connect-to-an-odbc-data-source-sql-server-import-and-export-wizard?view=sql-server-ver16

Create a SQL Auth login to SQL for use by this project.:

https://learn.microsoft.com/en-us/sql/relational-databases/security/authentication-access/create-a-login?view=sql-server-ver16

Create database users/access for the SQL Auth login created above:

https://learn.microsoft.com/en-us/sql/t-sql/statements/create-user-transact-sql?view=sql-server-ver16

If using Visual Studio Code to create/modify python files, then it can be downloaded for free here (or in the appropriate app store):

https://code.visualstudio.com/download

The Python Parquet Project

Python is the language of choice here and will be used as the glue to bind together SQL Server and file system. The general logic that will be followed here will be:

  1. Import libraries
  2. Define SQL Server connection properties & file system properties
  3. Write a SQL Query for the data that is to be written to the file system.
  4. Connect to SQL Server and read the query data
  5. Write the data to a parquet file.

Each step will be explored in detail so there is no confusion. This will also allow for easy customization, making scripts like this more extensible. I’m using Visual Studio Code to edit the python file, but you are welcome to use whatever code/text editor you are most comfortable with.

Import Libraries

This is the shortest and simplest task. The following libraries are used in this project:

  • Pandas: For reading data from SQL Server into a data frame.
  • Pyarrow: To read data from the data frame.
  • Pyarrow.csv: Support for writing to the CSV format (for demo purposes).
  • Pyarrow.parquet: Support for writing to the parquet format.
  • Pyarrow.dataset: Used to demo writing a data set to a partitioned file.
  • Sqlalchemy: Used to create the SQL Server connection.

The following python code will import the libraries listed above. Libraries may be omitted that are for demo purposes and you decide to not use going forward:

If any errors are encountered that indicate a library cannot be found, be sure that it is installed and available to python for use in this project.

SQL Server Connection & File System Properties

To connect to SQL Server, the server details need to be defined. The following variables will be used to store these details:

To keep things simple, I am using my local SQL Server for demo purposes. The login credentials are provided explicitly here, as well. If you decide to use code like this in a production environment, be sure to stash passwords somewhere secure to reduce the potential for unauthorized data access.

(You can download the AdventureWorks sample database here and a copy of SQL Server here) if you don’t already have them available where you can work through the examples)

That handles where the data is coming from. A file path must also be defined where files will be written to:

For this demo, I’m writing directly to Dropbox (because some HA at home is a good thing!) In a production environment, this is more likely to be a path to a shared drive or cloud service, such as Azure Blob Storage or Amazon s3.

Create a SQL Query

For this demo, a data set will be created with some sales analytics:

Nothing fancy here. The goal is to be able to quickly and efficiently demo process. Experimenting with different or larger data sets requires only altering the query defined above.

Connect to SQL Server and Read Data

This task is made easy by leaning on some of the libraries that were imported earlier in this article:

The first line defines the connect string using variables declared earlier. If changes are needed, adjust the variables or make new ones to splice into here. Doing so is far more scalable than manually hacking the connection string whenever an adjustment has to be made.

A data frame is used to store the data in-memory, prior to writing it to a file. From there, the data is staged in a table, which is the ideal format for outputting the data to a file of any format. Note that there is no formatting of data occurring as it is read. Ideally, data should be manipulated as little as possible here as it is very easy to accidentally introduce bugs into data types. For example:

  • Accidentally rounding a decimal to the wrong place.
  • Truncating a string or number.
  • Storing a data or bit as a less obvious data type
  • Adjusting formatting unintentionally

Ideally an analytics or business application should format data, rather than data engineers knee-deep in code 😊

From here, code can be crafted to export the data stored in that Pyarrow table into whatever file format we wish, Before doing so, let’s quickly review how we will run this project.

Running the Project

From Anaconda Navigator, the project begins in an environment I named ParquetTesting, which was set up according to the instructions from earlier in this article:

Clicking the play icon on the environment and choose the “Open Terminal” option, which will open a command prompt:

Nothing exciting to see here, yet. Before starting, be sure to change to the directory where the Python script resides. In this example, the following command is executed:

We are ready to go! The following code can be added to the Python script, which will export the data in the Pyarrow table to a CSV file called customer_totals.csv:

All files are being maintained in the same directory where the Python script was created. Any paths can be customized if needed relatively easily.

After saving, the Python file can be executed from the Anaconda terminal with the following command:

The result is a new file in the working directory:

Opening the file shows a sample of the expected CSV output:

Note that the header column names were explicitly included via the Python script and can be removed by instead setting the property: include_header=false. The file size for the CSV is 787KB, which will soon be relevant to our discussion of compression.

The file creation code can be adjusted to instead use the parquet library from Pyarrow:

For this first parquet test, no compression will be used. The result is a 552KB parquet file:

The file size decreased to 552KB! While no specific compression algorithm was applied, the columnstore nature of the parquet file format allows for files to generally be smaller, even with no additional efforts applied to further shrinking it.

Parquet supports a variety of compression algorithms. We can test a handful of them here using the following code:

When executed, three new files will be created:

Each is smaller than the previous examples, with the resulting compression ratios being dependent on the data and the algorithm used. For this data set, the gzip algorithm provided the largest compression ratio, shrinking the file to less than 50% of the initial parquet file size. Experimentation can be used to determine the ideal compression algorithm, or a default can be chosen based on availability, comfort, or organizational standards.

The key takeaway at this point in the experiment is that the gzip compressed parquet file is about 30% of the size of the original CSV file, highlighting the immense storage savings provided here. Those savings will compound when files need to be copied, backed up, or stored/read in other systems.

Conclusion

The beauty of Python is that it is highly customizable. Libraries exist for so many tasks and more are being created/improved every day. This article provided a glimpse of how data can be exported from SQL Server into the Parquet file format using Python and some freely available libraries.

The data source could be Postgres, MySQL, or something else. Similarly, file formats, names, and the processes for ingesting these files are wholly customizable.

Your imagination is the limit in terms of how processes like this can be built, customized, and implemented.

We dove into the Parquet file format for this article as there is no out-of-the-box way to accomplish the tasks that were tackled here. Despite no native one-tool process, solutions can be built, customized, and implemented to accomplish what we wanted to.

This is the first of two articles that investigate the use of Python to create, manage, move, and read Parquet files to and from SQL Server. The second article will introduce additional options, including how to adjust Parquet file format attributes and file properties.

Load comments

About the author

Edward Pollack

See Profile

Ed Pollack has 20+ years of experience in database and systems administration, which has developed his passion for performance optimization, database design, and making things go faster. He has spoken at many SQLSaturdays, 24 Hours of PASS, and PASS Summit. This led him to organize SQLSaturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, traveling, cooking exceptionally spicy foods, and hanging out with his amazing wife and sons.