{"id":90000,"date":"2021-02-22T16:32:40","date_gmt":"2021-02-22T16:32:40","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=90000"},"modified":"2021-06-03T16:48:14","modified_gmt":"2021-06-03T16:48:14","slug":"a-data-transformation-problem-in-sql-and-scala-dovetailing-declarative-solutions","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/a-data-transformation-problem-in-sql-and-scala-dovetailing-declarative-solutions\/","title":{"rendered":"A data transformation problem in SQL and Scala: Dovetailing declarative solutions"},"content":{"rendered":"<h2>Part I: Problem Space, SQL Server Simulation and Analysis<\/h2>\n<p>This article covers an interesting approach to a software problem. The software to be built must transform data in an operating system file. Would it make sense to include SQL Server in the development environment?<\/p>\n<p>Solving problems first in T-SQL provides result sets against which software results can be compared. Adding to that, query execution plans may serve as blueprints for software development.<\/p>\n<p>In this Part I of the series, I\u2019ll fashion business rules, requirements, and sample data into a denormalized table and query. The query\u2019s WHERE clause includes several logical operators, but its query plan is trivial. To get one that can serve as a software design model, I\u2019ll rewrite the query\u2019s logical connectives as their set operator equivalents. The new query plan is more complex but suggests a logical approach for the software effort.<\/p>\n<p>In Part II, you\u2019ll see T-SQL components realized as structures you know: lists, sets, and maps. What you may not know, though, is how to code algorithms declaratively \u2013 e.g. without long-winded loops. The software makes extensive use of map functions, which take functions as arguments and are<a id=\"post-90000-_Hlk62080423\"><\/a> supported to varying degrees in many Visual Studio languages.<\/p>\n<p>For this article, you\u2019ll need development experience in a relational database and some knowledge of relational theory. For the next, any experience in an imperative or declarative language suffices. Though I\u2019ve written the code in Scala and solely in the functional paradigm, I hope to present just enough background for you to grasp the core strategy of the solutions.<\/p>\n<p>You can download the\u00a0<a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/03\/Data-Transformation-Problem.zip\">complete files for both articles.\u00a0<\/a><\/p>\n<h2>Problem Space: Ice Cream<\/h2>\n<p>Ice cream manufacturers and retailers, whom I\u2019ll refer to as <em>makers<\/em>, purvey many flavors. Here are the business rules:<\/p>\n<ul>\n<li>There is a many-to-many relationship between makers and flavors.<\/li>\n<li>There is a many-to-one relationship from flavor to base flavor (functional dependency: flavor \u2192 base flavor).<\/li>\n<\/ul>\n<p>The term <em>base flavor<\/em> is an article device not an industry standard.<\/p>\n<p>Participation is mandatory in both. This illustration shows the relationships for five of twelve rows from the operating system file in entity-occurrence style:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"268\" class=\"wp-image-90001\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/diagram-venn-diagram-description-automatically-g.png\" alt=\"Diagram, venn diagram\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 1. Many-to-many and many-to-one relationships<\/strong><\/p>\n<p>This is the statement of the data manipulation requirement to be done on the operating system file in software:<\/p>\n<p><em>For rows in which maker names end in \u2018Creamery\u2019 or the base flavor is Vanilla, further restrict these to their flavors being Mint, Coffee, or Vanilla. Return result for all three columns.<\/em><\/p>\n<p>The reference to maker name of suffix \u2018Creamery\u2019 will result in a non-searchable argument (SARG) that will be handled transparently by the T-SQL but must be addressed directly in the software.<\/p>\n<p>There is a subtle incongruity between the business rules and the data transformation requirement. Do you see it? The file and table will have the same twelve mockup rows but imagine thousands in a real project and how this could impact performance. I\u2019ll point out the issue and resolve it later in a query and take what was learned when developing the software.<\/p>\n<h2>Problem Space: The Ice Cream Database<\/h2>\n<p>The goal of the table in the simulation database simply is to mirror the data in the file. The many-to-many relationship implies that (Maker, Flavor) is the sole candidate key. Since flavor \u2192 base flavor, the table would not be normalized; in fact, it is an intersection table in violation of 2NF. Further, there are no foreign keys or other tables as they would add nothing for the software design.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"348\" class=\"wp-image-90002\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/a-picture-containing-text-black-scoreboard-scre.png\" alt=\"A picture containing text, black, scoreboard, screenshot\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 2. File sample rows having redundancies \u2013 not in 2NF<\/strong><\/p>\n<p>Denormalization is visible in rows 5 and 6 and in 11 and 12. The non-clustered covering index (added post-population) will give a better query execution plan (query plan). More importantly, it will be matched by a key structure in the software \u2013 I\u2019m tuning the database only to the extent that it will be relevant later.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"447\" height=\"149\" class=\"wp-image-90003\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/graphical-user-interface-text-application-descr.png\" alt=\"Graphical user interface, text, application\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 3. Denormalized intersection table and covering index creation<\/strong><\/p>\n<p>All database work was done in Azure Data Studio connecting to a local copy of SQL Server 2019. The script is in the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/03\/Data-Transformation-Problem.zip\">download<\/a>.<\/p>\n<h2>Solution Space: T-SQL Queries<\/h2>\n<p>The first goal is to create a result set against which the software result can be verified. Adhering to the business rules and requirements, the query constrains on four logical (Boolean-valued) operators, LIKE, IN, AND, and OR:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"401\" height=\"85\" class=\"wp-image-90004\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/text-description-automatically-generated-with-med.png\" alt=\"Text\n\nDescription automatically generated with medium confidence\" \/><\/p>\n<p><strong>Figure 4. Original query that creates rowset for software verification<\/strong><\/p>\n<p>This is the query outcome over the sample rows:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"614\" height=\"145\" class=\"wp-image-90005\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/graphical-user-interface-description-automaticall.png\" alt=\"Graphical user interface\n\nDescription automatically generated with low confidence\" \/><\/p>\n<p><strong>Figure 5. Verification rowset with single-step query plan.<\/strong><\/p>\n<p>The Index Seek operator over the covering index expands the IN() operator, which implies logical OR, into Seek Keys (for a range scan count of 3), and further constrains on the Predicate as expected:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"349\" height=\"192\" class=\"wp-image-90006\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/text-description-automatically-generated.png\" alt=\"Text\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 6. Bottom part of expansion for Index Seek operator<\/strong><\/p>\n<p>All is correct to this point, and I can convert the index into a structure that will be used twice in software. However, the trivial query plan is not as useful as it could be. Relational theory is based on set theory, and the query rewrites next will more directly reflect this. This is important because the transformation algorithms in the software will all be set-based.<\/p>\n<h2>Query Rewrite I<\/h2>\n<p>The second goal, as you recall, is to fashion a query whose query plan will serve as a model for the software design. The trick I\u2019ll employ is to replace the logical connectives OR and AND with their T-SQL equivalent set operators UNION and INTERSECT.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"307\" height=\"184\" class=\"wp-image-90007\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/text-description-automatically-generated-1.png\" alt=\"Text\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 7. Query rewrite using set-based operators<\/strong><\/p>\n<p>It produces the detailed query plan I\u2019m looking for:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"820\" height=\"244\" class=\"wp-image-90008\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/a-screenshot-of-a-computer-description-automatica.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated with medium confidence\" \/><\/p>\n<p><strong>Figure 8. Query plan for Figure 7<\/strong><\/p>\n<p>The outer input uses the same non-clustered covering index and Seek Keys for range scans (scan count remains 3) as the original query but without the Predicate. Instead, each outer row directly joins via <em>Nested Loops (Inner Join)<\/em> on all columns to the rowset produced by the union of the Creamery and base flavor rows (discussed next). The Nested Loops operator, then, accomplishes a row-by-row set intersection.<\/p>\n<p>In the inner input to Nested Loops, the upper <em>Clustered Index Seek<\/em> on the primary key is for the LIKE(% Creamery) condition. Although the value to the LIKE logical operator is a non-SARG, a true index seek range scan (not table scan) is done on each outer row\u2019s (Maker, Flavor) prefix, during which the LIKE Predicate can be applied. The lower Clustered Index Seek parallels the upper, differing only in the predicate condition: BaseFlavor = N\u2018Vanilla.\u2019 Since the Inner Join is done on the superkey \u2013 i.e. all columns \u2013 and since flavor \u2192 base flavor, either lower input branch can produce at most one row. If both produce a row, they must, therefore, be the same row.<\/p>\n<p>The <em>Concatenation<\/em> simply appends one output to the other \u2013 think UNION ALL in T-SQL. To get the T-SQL UNION, then, the <em>Stream Aggregate (Aggregate)<\/em> operator collapses duplicates via grouping (and no aggregate expressions are computed). Specifically, both Concatenation and Stream Aggregate show only the same Output List, which consists of column header Union 1004, Union 1005, and Union 1006, and their rows are returned to the query.<\/p>\n<p>This query plan lends all the information needed to create index simulators and filter and map operations and algorithms in the software. An optimization, though, remains unaddressed \u2013 so almost all.<\/p>\n<h2>Query Rewrite II<\/h2>\n<p>There is no index on the BaseFlavor column, but in the middle subquery from Figure 7 above, BaseFlavor is constrained on value \u2018Vanilla.\u2019 This suggests that the lower-level software might need a data structure(s) and algorithm(s) to support the restriction. To start the analysis, I\u2019ll create this covering index, which parallels that on the Flavor column:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"353\" height=\"58\" class=\"wp-image-90009\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/text-description-automatically-generated-2.png\" alt=\"Text\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 9. Covering index on BaseFlavor key <\/strong><\/p>\n<p>The proposed index has no effect on the query plan in Figure 8 above, nor the original query plan in Figure 5. This is not proof, though, that separate resources are not needed in software to enforce the restriction.<\/p>\n<p>The next illustration shows the subquery run in isolation. A minor point is that the Query Optimizer chooses an Index Seek on the new index in place of an Index Scan using IX_MakerFlavor_Flavor. The central point is that subqueries that form part of a larger, single-statement T-SQL query will be translated into individual software resources that can and will be accessed in isolation.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"390\" height=\"190\" class=\"wp-image-90010\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/a-screenshot-of-a-computer-description-automatica-1.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated with medium confidence\" \/><\/p>\n<p><strong>Figure 10. Query in lower half of UNION run in isolation<\/strong><\/p>\n<p>Now rerun the query but this time constrain the \u2018Vanilla\u2019 value by the Flavor column:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"342\" height=\"136\" class=\"wp-image-90011\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/table-description-automatically-generated.png\" alt=\"Table\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 11. Constraining on Flavor column yields two fewer rows<\/strong><\/p>\n<p>This is not an idle thought experiment. The former result set contains two extraneous rows \u2013 column Flavor values being \u2018French Vanilla\u2019 and \u2018Cherry Vanilla\u2019 \u2013 that are not in the result set from Figure 4. The software effort needs a decision as to which constraint to enforce. Whether in queries or software transformations, performance is better served when rows not needed are culled early. The question I posed early on regarding the mismatch between the business rules and requirement can now be restated: Can the restriction on BaseFlavor be replaced by that on Flavor and still be correct?<\/p>\n<p>Constraining on the one side of a many-to-one relationship as does the top query form (BaseFlavor) is not wrong per se but is the root of the performance issue here. A business rule, however, signals the solution. The second states flavor \u2192 base flavor. Given this, and since the query\u2019s set intersection operation removes rows not matching on the Flavor column, the T-SQL can constrain on the many side (Flavor) without affecting correctness. The implication for software is that it obviates the need for special structures \u2013 those for simulating the proposed index to start the section \u2013 and handling (algorithms), as will be made clear in Part II.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"308\" height=\"182\" class=\"wp-image-90012\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/02\/text-description-automatically-generated-3.png\" alt=\"Text\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 12. Query in final form<\/strong><\/p>\n<h2>Testing: A Note on the Sample Data<\/h2>\n<p>Say set A denotes the rows produced by the query that restricts on the three flavor names and set B the rows from the union of the two other queries. Given the sample data, it happens that A B and hence B \u2229 A = A. Therefore, the query on flavor names alone yields the same result as the full query and the knowledge of which can aid testing.<\/p>\n<p>It is important to realize that this outcome is entirely data-dependent. The proper subset relationship may not hold over an operating system file having different data, and the software will make no assumptions regarding this limited analysis. In general, it is never a good idea to deduce business rules from data that can change over time. With that caveat in mind, sample runs done here and in software over <em>these<\/em> <em>rows<\/em> can only boost confidence in results.<\/p>\n<h2><strong>In Conclusion<\/strong><\/h2>\n<p>You saw how business rules and software requirement were mapped to a single intersection relational table in which normalization and referential integrity were nonessential. The initial query on the table provided the rowset against which the software result could be verified, but its query plan was inadequate as a pattern from which the software design could proceed.<\/p>\n<p>Rewrites using T-SQL set operators did yield a query plan as software blueprint and provided other insights as well, chief among them the optimization of the base flavor column.<\/p>\n<h2><strong>Last Word<\/strong><\/h2>\n<p>Neither loops nor branches were used in any of the T-SQL work, but that doesn\u2019t mean they aren\u2019t there. They are \u2013 visible underneath in the query plans, which describe the process steps (operators and data flow) the SQL Server database engine follows to produce results. A salient example of looping is the Nested Loops operator used in the T-SQL rewrite query plan to match each outer row to an inner row, i.e. perform the INTERSECT. This style of coding, in which the logic of the computation is expressed rather than the lower-level control flow detail, is <em>declarative programming<\/em>, and is a major theme of this series. (The latter being imperative programming.)<\/p>\n<p>The next article, which uses functional programming to solve the series\u2019 problem, continues in this vein. The logic written to the map and filter functions encapsulated in the data structures is again compact and elegant, even as the compiler translates it to control flow statements.<\/p>\n<p>Translating insights gained here into functional programming will be straightforward but not paint-by-numbers simple. You may have found this discussion challenging \u2013 I hope stimulating \u2013 and may find <a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/sql-development\/a-data-transformation-problem-in-sql-and-scala-dovetailing-declarative-solutions-part-ii\/\">Part II<\/a> the same as well.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article is an interesting approach to solving a data transformation problem in SQL and Scala. Shel Burkow uses a SQL execution plan for software design.&hellip;<\/p>\n","protected":false},"author":230507,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143531],"tags":[5134],"coauthors":[20272],"class_list":["post-90000","post","type-post","status-publish","format-standard","hentry","category-featured","category-t-sql-programming-sql-server","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/90000","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/230507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=90000"}],"version-history":[{"count":8,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/90000\/revisions"}],"predecessor-version":[{"id":90432,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/90000\/revisions\/90432"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=90000"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=90000"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=90000"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=90000"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}