{"id":86978,"date":"2020-04-21T22:22:06","date_gmt":"2020-04-21T22:22:06","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=86978"},"modified":"2021-05-03T14:04:42","modified_gmt":"2021-05-03T14:04:42","slug":"translating-a-sql-server-schema-into-a-cassandra-table-iii-many-to-many-attribute-closure-and-solution-space","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/nosql\/translating-a-sql-server-schema-into-a-cassandra-table-iii-many-to-many-attribute-closure-and-solution-space\/","title":{"rendered":"Translating a SQL Server Schema into a Cassandra Table: Part III Many-to-Many, Attribute Closure and Solution Space"},"content":{"rendered":"<p><strong>The series so far:<\/strong><\/p>\n<ul>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/nosql-databases\/translating-a-sql-server-schema-into-a-cassandra-table-part-i-problem-space-and-cassandra-primary-key\/\">Translating a SQL Server Schema into a Cassandra Table: Part I Problem Space and Cassandra Primary Key<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/nosql-databases\/translating-a-sql-server-schema-into-a-cassandra-table-part-ii-integrity-constraints\/\">Translating a SQL Server Schema into a Cassandra Table: Part II Integrity Constraints<\/a>\u00a0<\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/nosql-databases\/translating-a-sql-server-schema-into-a-cassandra-table-iii-many-to-many-attribute-closure-and-solution-space\/\">Translating a SQL Server Schema into a Cassandra Table: Part III Many-to-Many, Attribute Closure and Solution Space\u00a0<\/a><\/li>\n<\/ul>\n\n<p>This last in the series continues with an exploration of the many-to-many relationship. Then it returns to an analysis by functional dependency with attribute closure which is important in establishing \u201cuniqueness\u201d in the Cassandra key and other table properties. Finally, the four solutions to the original problem introduced in Part I are given. Note that code used in this article can be found <a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/Cassandra-Part-III-download.7z\">here<\/a>.<\/p>\n<p>First, though, is a basic definition of redundancy.<\/p>\n<h2>Redundancy<\/h2>\n<p>Previous articles gave examples of redundancy. This section is part review and part new information. Download all code used in this article <a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/Cassandra-Part-III-download-1.7z\">here<\/a>.<\/p>\n<p>Structuring in redundancy is inherent to the Cassandra design process. Redundancy between rows in the same partition or between partitions, though, can result in inconsistent data. Still, it is to be managed not avoided. Now an informal definition:<\/p>\n<p><em>Redundancy occurs whenever a determinant of a given value and any of its non-trivial dependents could be in more than one row.<\/em><\/p>\n<p>A trivial dependent set Y has the form X \u2192 Y where Y \u2286 X (reflexivity from Armstrong\u2019s Axioms.)<\/p>\n<p>Recall that form \u2018X \u2192 Y\u2019 establishes set \u2018X\u2019 as determinant, the \u2018\u2192\u2019 as symbol for \u2018functionally determines,\u2019 and set \u2018Y\u2019 as dependent. Whenever rows repeat the exact same values in X columns, the Y columns must be invariant.<\/p>\n<p>This definition is true regardless of whether the determinant and dependent sets are all in the key, all as non-key attributes, or as spanning key and non-key attributes.<\/p>\n<p>(Redundancy, as defined here, is not related to replication, in which a partition and its rows are duplicated and synched over more than one node in the cluster(s).)<\/p>\n<p>As an example, here is a query readout from the CQL shell over a sample table:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"564\" height=\"75\" class=\"wp-image-86979\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-close-up-of-a-sign-description-automatically-ge.png\" alt=\"A close up of a sign\n\nDescription automatically generated\" \/><\/p>\n<p>(Partition columns are red, clustering columns are aqua, and regular columns mauve.)<\/p>\n<p>Figure 5 below is the dependency diagram created for the article series problem (and discussed later). The sample table here is bound by three of its functional dependencies, SC \u2192 G, S \u2192 N, and CS \u2192 P where S: state, C: city, G: city rating, N: state region and P: city population. Of them, only data involving FD S \u2192 N can be duplicated between rows in the same partition. The design is viable and the two rows shown don\u2019t violate the FD \u2013 although they could.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"534\" height=\"60\" class=\"wp-image-86980\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-close-up-of-a-sign-description-automatically-ge-1.png\" alt=\"A close up of a sign\n\nDescription automatically generated\" \/><\/p>\n<p>Erroneous input into a different partition as above, however, causes all three FDs in the table to be violated. As you know from the previous article, there is no mechanism in the Cassandra system to prevent this and finding it with CQL only can be difficult.<\/p>\n<p>FDs can change as requirements are better understood. Say SC \u2192 G becomes ISC \u2192 G, where \u201cI\u201d means \u201crating type;\u201d prepend it to the clustering column list:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"402\" height=\"157\" class=\"wp-image-86981\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p>As with SC \u2192 G, dependency ISC \u2192 G can only be broken in rows spanning partitions, but S \u2192 N as before along with CS \u2192 P now can break in the same partition:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"695\" height=\"76\" class=\"wp-image-86982\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/word-image-59.png\" \/><\/p>\n<p>This sketch applies both to one-to-many and one-to-one relationships \u2013 the functional dependencies as they occur in the unified relation. It also applies, in a way, to the many-to-many relationship as well.<\/p>\n<h2>Many-to-Many<\/h2>\n<p>Not discussed in the series so far is the common case of the many-to-many (m:n) relationship. The restrictions placed on attribute sets for functional dependencies in the primary key \u2013 see section <em>Placement of Functional Dependencies<\/em> in Part II \u2013 don\u2019t apply to key attributes on either side. Consider this simplified ERD snippet:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"579\" height=\"312\" class=\"wp-image-86983\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-picture-containing-text-map-description-automa.png\" alt=\"A picture containing text, map\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 1. Conceptual trial model (Chen notation)<\/p>\n<p>In the unified relation, the \u201csite\u201d one-to-many relationship establishes an FD with trial ID as determinant and courthouse name as dependent.<\/p>\n<p>In contrast, the \u201cparticipation\u201d binary m:n relationship does not establish the m-side or n-side attributes as FDs in the unified relation. The relationship indeed combines two one-to-many relationships. However, they are not mutually independent and cannot be modeled in ERD as two distinct relationships. As a result, the lawyer identifier does not functionally determine trial attributes, nor vice-versa.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"453\" height=\"126\" class=\"wp-image-86984\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-picture-containing-map-description-automaticall.png\" alt=\"A picture containing map\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 2. Not equivalent in meaning to Figure 1 \u201cparticipation\u201d m:n relationship<\/p>\n<p>However, in relational and Cassandra modeling, the two one-to-many relationships are directly applicable. Say \u201clawyer name\u201d and \u201ctrial ID\u201d are both in the Cassandra key. If lawyer Stuart appears K times in combination with different trial IDs in \u201cparticipation\u201d, then pairings of Stuart and trial ID in the key yield K duplicates for each associated Lawyer entity attribute placed as a regular column. The n:1 relationship with the Trial entity is parallel.<\/p>\n<p>Placing identifiers for both lawyer and trial in any part of the key in either order is always correct. They can both be in the partition, and their dependents made static (if clustering columns exist). If one or both is a clustering attribute(s), the second identifier implies expansion in the conceptual row nesting disc layout (see Part I).<\/p>\n<p>For example, take C to mean courthouse, L lawyer and T trial identifiers. Given they are all in the key, restrictions are that C comes before T because T \u2192 C and T and C cannot both be in the partition for the same reason. Functional dependencies and m:n relationships mix easily in the primary key and table. Legal keys, then, include ((C)TL), ((C)LT), ((CL)T) and ((L)CT). Satisfy yourself that every successive clustering attribute in each key expands row nesting.<\/p>\n<p>Any n-ary degree m:n relationship can cause significant redundancy within a table. Say the key is ((S)LCT) where S is the state wherein the courthouse resides (C \u2192 S). If lawyer Rudy has appeared in three courthouses and argued in three trials in each, then needed attributes in a relationship with Rudy appear as regular columns in nine rows. Attributes related to each courthouse are duplicated three times. If Rudy has a role in trials in more than one state, this redundancy is spread over multiple partitions.<\/p>\n<p>This section concludes with a final example in logical and physical design based on Figure 1. Here is the access pattern for which to solve:<\/p>\n<p>Q<sub>n<\/sub>. <em>Find trials started in a time frame by courthouse and lawyer<\/em><\/p>\n<p>Notice first that courthouse and lawyer are in a m:n relationship.<\/p>\n<p>To satisfy the access pattern\u2019s implied range queries, Trial\u2019s non-key attribute \u201cstart date time\u201d \u2013 lower case \u201ct\u201d \u2013 must be a clustering attribute. It must follow lawyer and courthouse IDs in the key since by CQL rules, once a clustering column is sliced by an inequality operator in the WHERE clause, no clustering columns following it in the list may be restricted.<\/p>\n<p>The trial ID (T) could follow t in the key because T \u2192 t. still, it seems that given a lawyer and a point in time, the combination should yield a trial, i.e. the FD Lt \u2192 T. This cannot be deduced using Armstrong\u2019s Axioms. It requires a new business rule: <em>A lawyer can start only one trial at a time<\/em>. Now T must be removed from the key and made a regular column along with its related attributes (precisely because Lt \u2192 T).<\/p>\n<p>Logical key possibilities then are ((C)Lt), ((CL)t) \/ ((LC)t) and ((L)Ct). The first ((C)Lt) would create the largest number of rows on fewest partitions, and the second, ((CL)t), the smallest number on most partitions. These are considerations for the physical stage and somewhat in the logical as well. The decision is made to use key ((L)Ct).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"403\" height=\"232\" class=\"wp-image-86985\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-1.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 3. Table for intermediate-sized partition<\/p>\n<p>Access pattern Q<sub>n<\/sub> can now be satisfied with equality or range queries over time:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"705\" height=\"257\" class=\"wp-image-86986\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-2.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p>This table may be part of a two-pass query, the second retrieving documents and other trial details given the trial ID; refer to the application workflow.<\/p>\n<h2>Attribute Closure: Implications for Cassandra Design<\/h2>\n<p>This section returns to analysis by functional dependencies only.<\/p>\n<p>These next subsections dovetail with techniques for discovering further properties of Cassandra tables and primary key design. The first presents a diagram of functional dependency relationships inferred from the SQL Server normalized physical design in Part I. The second, a code module that references the diagram as input for computing attribute closures given sets of initial attributes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"621\" height=\"435\" class=\"wp-image-86987\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-3.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 4. Normalized physical design (Part I)<\/p>\n<h3>The Functional Dependency Diagram<\/h3>\n<p>Functional dependencies placed into the unified relation come from the primary and alternate keys, the latter not shown by the SQL Server design tool. One FD not included (the dotted arrow) indicates that realtor location additional data is not relevant to the series problem\u2019s access pattern.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1182\" height=\"775\" class=\"wp-image-86988\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-close-up-of-a-map-description-automatically-gen.png\" alt=\"A close up of a map\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 5. Functional Dependency diagram from normalized schema<\/p>\n<p>I\u2019ve abbreviated attribute names to single letters, intuitive when possible. Herein I\u2019ll often refer to attribute names as their single-letter counterparts.<\/p>\n<p>The arrow recipients in the diagram are the dependents, and the arrows emanate from the determinants. Composite determinants are grouped in boxes; the double arrow implies a (1:1) relationship, I\u2019ve elided redundant arrows in such relationships and so on. Not surprisingly is the double arrow between Y and SC (City# and State\/City names}. (In the SQL Server logical design, the latter was key for the City relation, and in the physical design, the surrogate City# was added as primary key, and the composite was enforced as alternate key). Follow the arrows to see the partial and transitive dependencies.<\/p>\n<p>By visual inspection alone, you know that L, EZ and H functionally determine, directly or transitively, all other attributes. As each is irreducible, each is a candidate key. The diagram is an important artifact. What is needed, though, is an additional tool for proving certain properties of key and table design choices.<\/p>\n<h3>The Attribute Closure Module<\/h3>\n<p>The elementary code module below does a closure over an initial set of attributes in the unified relation given the relation\u2019s set of functional dependencies. Its purpose is to determine which key attribute additions make for \u201cuniqueness,\u201d and which attributes become static or regular attributes in different Cassandra key design choices. This analysis is essential for larger, more complex relations.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"803\" class=\"wp-image-86989\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-4.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 6. Attribute closure function<\/p>\n<p>I wrote the solution in Scala. If you\u2019ve done C# or C++ or Java or other imperative\/object-oriented language, the algorithm shouldn\u2019t be difficult to follow. The one twist is that Scala also fully supports functional programming. This style allows functions to appear wherever a variable could. Thus, I\u2019ve nested the \u201cgo\u201d function inside function \u201cattributeClosure\u201d rather than placing it separately as a private function in the wrapper class (or object) where it most certainly wouldn\u2019t be reused by other methods.<\/p>\n<p>Essentially, each recursive invocation of the inner function uses set operations to glean the next set of dependent attributes implied by two structures. The first structure is the invariant set of FDs (\u201cfdependencies\u201d) as passed into the outer function. Second, the inner function\u2019s parameter (\u201cattrClosureSet\u201d): the set of already captured attributes, acting as the source of possible determinant combinations. This set is initialized from parameter \u201cclosureAttrs\u201d.<\/p>\n<p>At each recursive step, if new (dependent) attributes are added to the closure, iteration continues. If not, the inner function returns the closure to the outer. The outer then returns a two-tuple consisting of the closure plus the Boolean decision as to whether the initial attribute set attrClosure is a superkey for the relation.<\/p>\n<p>The dependencies input, \u201cfdependencies\u201d, mirrors the dependency diagram (Figure 5):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"507\" height=\"361\" class=\"wp-image-86990\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-5.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 7. Sample structuring of functional dependencies in Scala<\/p>\n<p>The input dependencies for this implementation are irreducible, meaning it meets these criteria: a) right-irreducible: each dependent is a singleton; b) left-irreducible: each determinant set is minimal; and c) there are no trivial FDs (A \u2192 subset(A)) or FDs derivable from other FDs. In this implementation, c) is waivable. For any implantation, b) is required.<\/p>\n<p>Beyond the design stage, functional diagrams and this algorithm can be instrumental in tracking down FD data errors in tables, e.g. via a Spark application.<\/p>\n<p>Sample run readouts are deferred for the next section.<\/p>\n<h2>Solution Space<\/h2>\n<p>The series began with the stated goal of transforming a normalized SQL Server schema (Figure 4) into a Cassandra table designed for this access pattern:<\/p>\n<p>Q<sub>k<\/sub>\u00a0<em>Find listings for a realtor company in a city<\/em><\/p>\n<p>Realtors and cities are in a m:n relationship, so their ordering in the key doesn\u2019t matter.<\/p>\n<p>The solutions can reference the results of applying the FD set in Figure 5 and initial attribute sets to the closure function in Figure 4.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"649\" height=\"279\" class=\"wp-image-86991\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-6.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 8a. Sample closure runs (eclipse software)<\/p>\n<p>Incidentally, Scala runs in PowerShell; see readme.txt file included in the downloads for instructions on installing and using Scala.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"586\" height=\"371\" class=\"wp-image-86992\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-7.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 8b. Sample closure runs<\/p>\n<p>I\u2019ll repeat a central point. The closure is an important method for deciding whether and which attributes need be appended to the primary key for \u201cuniqueness\u201d: i.e. to cover all attributes using minimal collections. This is not the same as making the key unique; the Cassandra key is always unique. Collection types make this possible. If the key were R, e.g., its singleton row would have collections structured over combinations of CS-Y and L and their related attributes. You saw this in Part II.<\/p>\n<p>To satisfy the access pattern, RSC or RY \u2013 not both! \u2013 must be in the key as search attributes. There are now two decisions to make. The first, as you expect: is it RSC or RY? In consultation with the analytics team, secondary stakeholders to this OLTP database, they would prefer RSC as it simplifies querying over RS by itself.<\/p>\n<p>The second concerns selecting partition attributes. In the logical model, R by itself is sufficient; in the physical, assume the row count and size are manageable and performant, so the base key remains ((R)SC). Closure results indicate, among other conclusions, that attributes T, B and D must be static.<\/p>\n<p>This discussion ends what has been considerable background, except for one more tool.<\/p>\n<h3>Design A: The Minimal Superkey<\/h3>\n<p>As per the closure, L is a relational candidate key. Could it be the partition and full primary key (no clustering attributes) for its singleton row while satisfying the access pattern?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"181\" height=\"456\" class=\"wp-image-86993\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-8.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 9. Possible logical model from the KDM<\/p>\n<p>The Kashlev Data Modeler (KDM) is a tool for automating Apache Cassandra logical and physical designs (see references). It produced the logical design above given key L and an ERD. KDM uses Chebotko symbols to identify Cassandra column types. The \u201cK\u201d is for partition column.<\/p>\n<p>Here is the actual table definition in CQL I devised (not from KDM):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"429\" height=\"476\" class=\"wp-image-86994\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-text-description-automatically-ge.png\" alt=\"A screenshot of text\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 10. Physical design featuring minimal superkey and UDTs<\/p>\n<p>This is the query showing the first three rows returned (almost):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"601\" height=\"184\" class=\"wp-image-86995\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/word-image-60.png\" \/><\/p>\n<p>Well, the CQL shell tried. The error message with terms \u201cdata filtering\u201d and \u201cunpredictable performance\u201d means this: The system can\u2019t initially use the WHERE clause predicate conditions but must visit every singleton row in every partition \u2013 essentially, a table scan. Only then can It filter out rows not meeting the (RSC) search conditions. Depending on several factors, the query may involve many or most or even all cluster nodes, potentially vitiating the excellent performance of which Cassandra is capable. Then again, performance may be fine \u2013 just a warning and you can append the ALLOW FILTERING clause. Still, it\u2019s an unacceptable warning for the main access pattern for which a table was designed.<\/p>\n<p>Secondary indexes help in some cases but not this one. In short, the table is too highly transactional, and the ALLOW FILTERING clause would still be required. I placed indexes on all RSC columns and Cassandra did use the one with the highest selectivity \u2013 the one on the city name:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1040\" height=\"884\" class=\"wp-image-86996\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-9.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p>Among several examples of data duplication here is owing to dependency CS \u2192 GP (city to rating and population), or alternatively Y \u2192 GP (city UUID to same). Thus (San Francisco. California) yields (4.7, 884,363) for every listing row in this city in different partitions. This particular redundancy is better managed in the remaining solutions, which build from the base ((R)SC).<\/p>\n<p>I stress again that the major problem is the table scan. Regardless, you may have concluded that the candidate key approach given Q<sub>k<\/sub> would work better later in the application workflow with a different access pattern. There, its attributes can be more specific to the listing such as a video, floorplan, and information about schools.<\/p>\n<h3>Design B: Key Sans Uniqueness<\/h3>\n<p>In this solution, the key is ((R)SC). The redundancy owing to CS \u2192 GP is eliminated. However, as per closure results (Figure 8), it can\u2019t reach all attributes \u2013 specifically, those for the listing. Since the listing identifier functionally determines RSC, the solution requires placing the listing identifier and its directly reachable attributes coalesced into a collection:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"449\" height=\"398\" class=\"wp-image-86997\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-social-media-post-description-a.png\" alt=\"A screenshot of a social media post\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 11. Physical design where key lacks uniqueness column<\/p>\n<p>Read this to mean that a realtor company in a city has zero or more listings.<\/p>\n<p>Notice that given R \u2192 BDT and ((R)SC), Design B introduces static (UDT) column \u201caddress\u201d (Figure 10). The static designation ensures that address is not repeated into each row in a partition. It further enforces the first rule of functional dependencies as introduced in Part II: a determinant (R) and its dependents (BDT) can never appear together in the partition key.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"741\" height=\"296\" class=\"wp-image-86998\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-close-up-of-a-sign-description-automatically-ge-2.png\" alt=\"A close up of a sign\n\nDescription automatically generated\" \/><\/p>\n<p>The query result shows listings for a single row. For the testbed, the design collapses six individual listings into one row. A realistic example could have hundreds of listings. In querying, the client potentially can access numerous rows. The size of the return set (each row listing collection being indivisible), the highly transactional nature of the collection and other factors render this solution unsuitable.<\/p>\n<p>The recommendation is to use collections when data is smaller, bounded and need little updating. Think, as a simple example, book authors in a list.<\/p>\n<h3>Design C: The Non-Minimal Superkey<\/h3>\n<p>More to standard is appending the listing identifier as uniqueness attribute to make ((R)SCL); logical and physical designs are as follows:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"542\" height=\"457\" class=\"wp-image-86999\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-10.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p class=\"caption\">Figure 12. Recommended designs for access pattern Q<sub>k<\/sub><\/p>\n<p>Continuing Chebotko logical notation, the C\u2191 is clustering column ascending order, and the S is static.<\/p>\n<p>By constraining on the prefix of key columns short of listing ID, the query satisfies Q<sub>k<\/sub> as the first three rows returned from the sample query show:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"945\" height=\"889\" class=\"wp-image-87000\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-11.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p>Listings that were compressed in Design B are now each in their own row. No less information is retrieved \u2013 in fact, the design accounts for more redundancy as seen in regular columns for state and city. Unlike B, however, row modifications in this highly transactional table are simpler and safer, and clients can take advantage of paging of listings over realtor-city pairings.<\/p>\n<h3>Design D: The Non-Minimal Superkey with Additional Ordering<\/h3>\n<p>Sometimes a table is correctly designed but still falls short of needs. The problem, then, is the access pattern itself: it needs a fuller description.<\/p>\n<p>Perhaps searching on or organizing data by zip codes, or range querying over asking price is also desired. Say the access pattern is altered thus:<\/p>\n<p>Q<sub>k<\/sub>\u00a0<em>Find listings for a realtor company in a city sorted from least to highest asking price<\/em><\/p>\n<p>Starting from Design C, pull the asking price out of the listing standard UDT (Figure 10) and wedge it between the search and uniqueness columns: ((R)SCAL):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"602\" height=\"448\" class=\"wp-image-87001\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-12.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 13. Logical and physical designs with ordering on asking price<\/strong><\/p>\n<p>Client software can now page through the table already sorted by price, or do a range query as seen here:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"866\" height=\"632\" class=\"wp-image-87002\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/04\/a-screenshot-of-a-cell-phone-description-automati-13.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" \/><\/p>\n<p>Always think conceptual row nesting on disc. As a general rule, inserting a new attribute(s) into the clustering list must expand row possibilities for key attributes defined before it. You may want to review Part I for specific rules on clustering column type and placement.<\/p>\n<h2>Last Word<\/h2>\n<p>Relational SQL Server and NoSQL Cassandra are meant for different use cases. As such, there may be more differences than similarities between them. You\u2019ve seen a number of differences, including rules for key design and the ability to enforce integrity constraints. I won\u2019t list them or introduce others here, but I will leave you with one final thought.<\/p>\n<p>If conventional wisdom says that the relational model and theory has no valuable insights for NoSQL systems, do you (still) believe it?<\/p>\n<h2>References<\/h2>\n<p>J. King, \u201cDS220: Data Modeling,\u201d <a href=\"https:\/\/academy.datastax.com\/resources\/ds220-data-modeling\">https:\/\/academy.datastax.com\/resources\/ds220-data-modeling<\/a>, a DataStax Academy online video series course on Apache Cassandra.<\/p>\n<p>DataStax Documentation, \u201cCQL for DataStax Enterprise 6.7,\u201d <a href=\"https:\/\/docs.datastax.com\/en\/dse\/6.7\/cql\/cql\/cql_using\/queriesTOC.html\">https:\/\/docs.datastax.com\/en\/dse\/6.7\/cql\/cql\/cql_using\/queriesTOC.html<\/a>, primary key and querying.<\/p>\n<p>J. Carpenter, E. Hewitt, <em>Cassandra The Definitive Guide Distributed Data at We Scale<\/em>, 2nd ed. Sebastopol: O\u2019Reilly Media Inc., 2016.<\/p>\n<p>A. Kashlev, \u201cThe Kashlev Data Modeler,\u201d <a href=\"http:\/\/kdm.dataview.org\/\">http:\/\/kdm.dataview.org\/<\/a>, an online automated tool for developing Apache Cassandra logical and physical designs.<\/p>\n<p>DataStax, \u201cData Modeling in Apache Cassandra 5 Steps to an Awesome Data Model,\u201d <a href=\"https:\/\/www.datastax.com\/sites\/default\/files\/content\/whitepaper\/files\/2019-10\/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf?mkt_tok=eyJpIjoiWmpnMk4yWTVaR0l6WTJOaiIsInQiOiJWTFQyYW5LRkljSWt2Q28xSDR5RXBBNmNlM1pLbThFTFFwSHJLd0E1NXdYM2VmbHJaam1HelFSSThCYW1JRVZVSE9uM1YzTmYzbDVTUlB5c3RESlNyTTJYMSs3bE90QlNiVmRrZkg2ZDZINlYrS1MralJ1Zzh2K1pKNVZ0K0g2aSJ9\">https:\/\/www.datastax.com\/sites\/default\/files\/content\/whitepaper\/files\/2019-10\/CM2019236%20-%20Data%20Modeling%20in%20Apache%20Cassandra%20%E2%84%A2%20White%20Paper-4.pdf?mkt_tok=eyJpIjoiWmpnMk4yWTVaR0l6WTJOaiIsInQiOiJWTFQyYW5LRkljSWt2Q28xSDR5RXBBNmNlM1pLbThFTFFwSHJLd0E1NXdYM2VmbHJaam1HelFSSThCYW1JRVZVSE9uM1YzTmYzbDVTUlB5c3RESlNyTTJYMSs3bE90QlNiVmRrZkg2ZDZINlYrS1MralJ1Zzh2K1pKNVZ0K0g2aSJ9<\/a>, White paper .pdf.<\/p>\n<p>C.J. Date, <em>An Introduction to Database Systems<\/em>, 8th ed. Boston: Pearson Education Inc., 2004, esp. Chapter 11 Functional Dependencies.<\/p>\n<p>D.R. Howe, <em>Data Analysis for Database Design<\/em>, 3rd ed. Oxford: Butterworth-Heinemann, 2001, esp. Chapters 9 Properties of relationships and 10 Decomposition of many:many relationships.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the third and final article of the series, Shel Burkow demonstrates several designs that cover many-to-many relationships, attribute closure, and solution space with Cassandra tables.&hellip;<\/p>\n","protected":false},"author":230507,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143535],"tags":[95509],"coauthors":[20272],"class_list":["post-86978","post","type-post","status-publish","format-standard","hentry","category-featured","category-nosql","tag-standardize"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/86978","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/230507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=86978"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/86978\/revisions"}],"predecessor-version":[{"id":87008,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/86978\/revisions\/87008"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=86978"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=86978"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=86978"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=86978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}