{"id":98869,"date":"2024-09-09T18:14:58","date_gmt":"2024-09-09T18:14:58","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=98869"},"modified":"2024-07-26T17:52:55","modified_gmt":"2024-07-26T17:52:55","slug":"uniqueness-keys-and-identity","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/theory-and-design\/uniqueness-keys-and-identity\/","title":{"rendered":"Uniqueness, Keys, and Identity"},"content":{"rendered":"<p>If you\u2019ve ever had a traditional logic course, you\u2019ll have run into \u201cThe Law of Identity\u201d as the founding principle of all Western thought. It says that:<\/p>\n<p>\u201cTo be is to be something in particular; to be nothing in particular or anything in general, is to be nothing at all.\u201d<\/p>\n<p>This usually gets expressed as \u201cA = A\u201d thanks to Ayn Rand\u2019s <a href=\"https:\/\/aynrandsociety.org\/resources\/overview-of-objectivism\/\">Objectivism<\/a> and credited to Aristotle. You want to dig a little deeper, however, it actually predates Aristotle\u2019s laws of logic.<\/p>\n<p>More to the point, this is why we have keys in the relational data model and why those keys give an identity to each row in a table which models an entity in a set.<\/p>\n<p>Later <a href=\"https:\/\/plato.stanford.edu\/entries\/leibniz\/\">Gottfried Leibniz<\/a>, whom you might remember as the \u201cother guy\u201d who invented calculus, held that Entity <strong>x<\/strong> is the same as Entity <strong>y<\/strong> if and only if every predicate true of <strong>x<\/strong> is true of <strong>y<\/strong> as well. What it means for practical purposes for us RDBMS people is that you should not have two or more tables with exactly the same structure. One set of entities should have one and only one table to model it.<\/p>\n<p>This can be a little more subtle than you might think. There is a design flaw called <strong>attribute splitting<\/strong>, in which has defined extra tables based on the values of an attribute. For example, you would never define your primary human resources table based on the type of phone they use: <code>iPhone_Using_Persons<\/code> and another table for <code>Android_Using_Persons<\/code>. The type of phone an employee uses is clearly an attribute and this design have splits what should be one table into two based on the value of an attribute, and not on attribute itself. And then what if another phone type emerges? New table?<\/p>\n<p>There are caveats of course. You might need a table such as this to record specific attributes of that person, but that is a different topic. It would not be the main Persons table.<\/p>\n<h2>Concept of Keys in RDBMS<\/h2>\n<p>In RDBMS, we establish uniqueness of the rows in the table by using a key. A key is a subset of the columns of that table which is unique in that table. Please note that by definition, a key cannot be a table property, like <code>IDENTITY<\/code> or a <code>GUID<\/code>. A key must be made up of attributes, which means actual table columns.<\/p>\n<p><code>IDENTITY<\/code> is a table property, much like the parking space number in a garage. It is not an attribute of the entity being modeled by the table (like the VIN number burned into the engine block), but a value derived from the<em> physical storage <\/em>of the data. Trust me, no automobile, employee, piece of merchandise, or anything else you\u2019re trying to represent with the table came with some <code>IDENTITY<\/code> tattooed on it or as an innate attribute. Just different parking garages can have spaces with the same numbers, so totally unrelated tables (even in different schemas!) can have the same value in an <code>IDENTITY<\/code> column.<\/p>\n<p>There\u2019s only one <code>IDENTITY<\/code> allowed per a table. If it were an attribute, it might be possible for an entity to have more than one of that same kind of attribute. For example, a table that models automobile trips can have ZIP codes for both the departure and arrival locations. Attributes also don\u2019t have to be all the same data type, but the <code>IDENTITY<\/code> column is always a whole number (typically using an integer type of some sort, but can also use a numeric type.) The Microsoft implementation is also issued sequentially. This sequence is based on the order in which the physical data is inserted into the physical table. This is why I like to use that parking garage analogy.<\/p>\n<p>The bad news is that I can take the same raw data, declare the same table structure to model the same entity on two different machines, shuffle the input a little bit, and then do one or more <code>INSERT INTO<\/code> statements. The <code>IDENTITY<\/code> columns which are supposed to represent attributes of the same entity in each copy of the same data will be different. <code>IDENTITY<\/code> is not an intrinsic property of the entity, but of the storage!<\/p>\n<p>The <code>IDENTITY<\/code> feature exists because the first SQL engines were built on top of existing hardware and software. UNIX file systems were based on magnetic tape files which have to have sequential access by their physical nature. Indexing came along a little bit later and was put on top of this file model. The world might have been different if file systems had started off with hashing, but that would be rather complicated.<\/p>\n<h2>Dr. Codd and The Primary Key<\/h2>\n<p>In the original papers of RDBMS, Dr. Codd introduced the concept of keys and in particular the concept of a \u201cprimary key\u201d which was dropped later. Keys were a part of the relational model from the start, but the primary key was supposed to be special in some way. And there was even a convention of underlying the column names that make up the primary key in the data declarations in the early papers. This was a holdover from sequential file days. Obviously, a physical file can only be sorted one way. File searches, particularly those on mag tape, depended on having a sorted order.<\/p>\n<p>By definition, keys had to be all non-<code>NULL<\/code> values; remember that a <code>NULL<\/code> is not technically a value. But someone looked around and realized that the relational model was based on sets, which have no ordering. All keys have equal \u201ckey-ness\u201d by their nature, and nobody is special.<\/p>\n<p>Another construct in SQL was the <code>UNIQUE<\/code> constraint. It explains itself, but it also permits a single <code>NULL<\/code> value in each column. Remember that SQL has two models for comparisons. The first method is regular equality which compares all the <code>NULL<\/code>s as <code>UNKNOWN<\/code> with any values or other <code>NULL<\/code>s.<\/p>\n<p>The second method puts all the <code>NULL<\/code>s Into a single group for purposes of the GROUP BY and a few other minor operators. We currently have added the infixed comparison operator:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">IS [NOT] DISTINCT FROM<\/pre>\n<p>which uses the grouping method, not equality. This means that it always returns <code>TRUE<\/code> or <code>FALSE<\/code>, so we don\u2019t have to worry about comparing a <code>NULL<\/code> to a value or another <code>NULL<\/code>, and getting an <code>UNKNOWN<\/code> result.<\/p>\n<p>Basically, <code>PRIMARY KEY (c1,c2,..cn) \u21d2 UNIQUE (c1,c2,..cn) NOT NULL<\/code>. In fact, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Backus%E2%80%93Naur_form\">BNF<\/a> for these two constructs are part of the same production in the SQL grammar.<\/p>\n<p>A foreign key is not a key in the referencing table but is it a key in a second (possibly the original) referenced table. The correct terms in SQL and RDBMS are <em>not \u201cparent\u201d and \u201cchild\u201d <\/em>those terms <em>are <\/em>how we used, to identify records in network databases. The correct terms in RDBMS are \u201creferenced\u201d and \u201creferencing\u201d tables. Typically, in network databases, such relationships were implemented with pointer chains. Find an old timer and ask him about IDMS, TOTAL, Image, IMS and other early database products<\/p>\n<h2>Key Subtypes<\/h2>\n<p>There are a couple of terms that that it can be helpful to understand when dealing with key as there are two typical subtypes that are commonly discussed: simple and compound keys.<\/p>\n<h3>Simple Keys<\/h3>\n<p>A simple key is simply one column. this makes life easier simply by making your coding shorter. In most commercial databases, that one column is probably going to be some kind of identifier, such as a Vehicle Identification Number (VIN), International Standard Book Number (ISBN), UPC or some other industry code. Within that identifier you\u2019re going to get worried about syntax, check digits, and verification procedures appropriate in that industry. Simple keys are not always that simple internally.<\/p>\n<h3>Compound Keys<\/h3>\n<p>A compound key is made up of two or more columns. Generally speaking, the relationship among the columns in the key can be as coordinates or as a hierarchy. The columns in a coordinate relationship are independent of each other; the most obvious example of this would be (Longitude, Latitude) pairs. This can be generalized to {c1, c2, .. cn}.<\/p>\n<p>Another type of compound key is hierarchical. Another obvious example would be cities within states {city_name, state_code}, so we might have things like (\u2018Montgomery\u2019, \u2018OH\u2019), (\u2018Springfield\u2019, \u2018OH\u2019) and (\u2018Springfield\u2019, \u2018IL\u2019). Please note that the <em>combination of columns<\/em> in a hierarchical compound key is what must be unique, but the components can have duplicates. Did you ever wonder why television shows use \u201cSpringfield\u201d as a town name so often? It\u2019s the most common town name in the United States, so you can leave the audience guessing in what state your show is taking place.<\/p>\n<p>It is also possible to have a compound key that has both hierarchical and coordinate relationships among its columns.<\/p>\n<h4>Overlapping Compound Keys<\/h4>\n<p>Overlapping keys can be either hierarchical, coordinates, or compound. Their defining characteristic is that the two or more overlapping keys share columns in their declarations. One example of this might be different political districts drawn on the same geography. For example, in Texas, school districts, water districts, political districts and other administrative districts all overlap in weird ways. People living across the street from each other and find that they have to vote for different school board members and other services.<\/p>\n<h2>FOREIGN KEY and REFERENCES Clauses<\/h2>\n<p>There is a data integrity feature that I wish more people would use. It is the <code>REFERENCES<\/code> clause. It says, in effect, that the columns mentioned in this are not really in this table and column but are <em>referenced<\/em> here. They are foreign keys which live somewhere else. It has to be a key so that a reference doesn\u2019t go to more than one value. Here is a skeleton:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">CREATE TABLE Persons\n(employee_id CHAR(10) NOT NULL PRIMARY KEY\n --,..\n);\n\nCREATE TABLE Bowling_Teams\n(bowling_team_name CHAR(20) NOT NULL,\n player_id CHAR(10) NOT NULL\n REFERENCES Personnel (employee_id)\n --,..\n);\n\nCREATE TABLE Softball_Teams\n(softball_team_name CHAR(20) NOT NULL,\n player_id CHAR(10) NOT NULL\n REFERENCES Persons (employee_id)\n --,..\n);<\/pre>\n<p>Note, please, that it\u2019s possible to also declare <code>PRIMARY KEY (bowling_team_name,  player_id)<\/code> or <code>UNIQUE (bowling_team_name,  player_id) <\/code> or completely skip having any \u201cunique-ifier\u201d on the <code>Persons<\/code> table (and likewise on the <code>Bowling_Teams<\/code> table too).<\/p>\n<p>In this skeleton example, we are trying to enforce business rules which say only one of our personnel can play on our company teams. You can also define a <code>FOREIGN KEY<\/code> constraint on multiple columns, with the following SQL syntax:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\"> CONSTRAINT &lt;constraint name&gt; \n      FOREIGN KEY (&lt;referencing column list&gt;\n      REFERENCES &lt;referenced table name&gt;\n                  (&lt;referenced column list&gt;).<\/pre>\n<p>This detaches the constraint from a single column in the referencing table. This syntax lets you name multi-column foreign keys so you can handle them as a single unit of work. All the references had to reference what was actually there, so you couldn\u2019t just drop or alter the referenced columns and break the relationship. You have to change the referencing columns and then you could drop or alter the referenced columns.<\/p>\n<p>As a note, the only way to do this kind of references before constraints was with <code>TRIGGER<\/code> objects.<\/p>\n<p>Relational people tend to hate <code>TRIGGER<\/code> objects because they are procedural code and we want a declarative language. We looked at what was actually being done with the references, we found that 95% or more of the procedural code did only a few tasks (often implementing relationships. This led us to add declarative referential integrity (DRI) clauses to the language which go at the end of the foreign key reference clause. Here\u2019s a quick summary. For more on triggers and my disdain for them, I wrote <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/triggers-threat-menace\/\">this piece<\/a> for Simple Talk back in 2016.<\/p>\n<h3>Declarative Referential Integrity (DRI) Actions<\/h3>\n<p>In this section, I will introduce the different DRI actions that you have to choose from when building your foreign key references objects. When you create a constraint, as shown, we have syntax as shown in the previous section with the one addition:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CONSTRAINT &lt;constraint name&gt; \n      FOREIGN KEY (&lt;referencing column list&gt;\n      REFERENCES &lt;referenced table name&gt;\n                  (&lt;referenced column list&gt;) [DRI ACTION].<\/pre>\n<p>The <code>DRI ACTION<\/code> clause which will tell you what will happen when references rows are removed. In the following sections I will cover the action types. It is optional and the default will be covered in the first subsection.<\/p>\n<h4>ON [DELETE | UPDATE] NO ACTION<\/h4>\n<p>This is the default behavior if <code>ON DELETE<\/code> or <code>ON UPDATE<\/code> is not specified. If an attempt is made to delete referenced rows that are currently referenced, the database will raise an error and the delete will be rolled back.<\/p>\n<h4>ON [DELETE | UPDATE] RESTRICT<\/h4>\n<p>For the majority of relational database systems, <code>RESTRICT<\/code> is equivalent to <code>NO ACTION<\/code>. Where <code>RESTRICT<\/code> is specifically implemented by a database system, the difference is that <code>RESTRICT<\/code> results in the constraint being checked at the start of processing instead of at the end, which is the case with <code>NO ACTION<\/code>.<\/p>\n<h4>ON [DELETE | UPDATE] SET NULL<\/h4>\n<p>If referenced row is deleted, the referencing values are updated to <code>NULL<\/code>. Obviously, the referencing column have to be <code>NULL<\/code>-able. This option is not used often in practice because the referenced columns are most often primary keys or unique constraints being used to define keys. People tend to avoid <code>NULLs<\/code> in such constructs.<\/p>\n<h4>ON [DELETE | UPDATE] SET DEFAULT<\/h4>\n<p>When referenced rows are deleted or updated, the foreign keys are changed to their <code>DEFAULT<\/code> value. Obviously, the referenced columns must either have a <code>DEFAULT<\/code> value specified, or be <code>NULL<\/code>-able, in which case <code>NULL<\/code> is considered the <code>DEFAULT<\/code> value. Any non-<code>NULL<\/code> <code>DEFAULT<\/code> value must be legal in the referencing table. You can think of <code>NULL<\/code> as the universal default default.<\/p>\n<h4>ON [DELETE | UPDATE] CASCADE<\/h4>\n<p>This is the workhorse. Let\u2019s break it down into both options starting with <code>ON DELETE CASCADE. <\/code>When a reference row is deleted, all the referencing rows are also deleted. This means that if we fire employee #123, then he or she is automatically removed from the bowling team and the softball team, without us having to invoke separate <code>DELETE FROM<\/code> statements. But more than that, cascade can keep going!<\/p>\n<p>I can build a network of cascades, so that when the employee is removed from the Personnel table, they are automatically removed from the <code>Softball_Teams<\/code> table, removing them from the softball team table then can cascade to the \u201c<code>Beer and Pizzas<\/code>\u201d table that was referenced by the <code>Softball_Teams<\/code>, and so forth for several levels. You might want to look at how many levels your SQL engine can handle before you get too deep in the nesting. The engine must remember all of this, just in case there\u2019s a <code>ROLLBACK<\/code>.<\/p>\n<p>The <code>ON UPDATE CASCADE<\/code> behaves pretty much the same way. An update to the key in the referenced table is propagated to the referencing tables automatically. Again, if you build chains of references, then the system has to remember the previous values in case of a <code>ROLLBACK<\/code>.<\/p>\n<h3>Self Referencing Constraints<\/h3>\n<p>If you want to create a table with a self-referencing <code>FOREIGN KEY<\/code> a column, you can do this in one <code>CREATE TABLE<\/code> statement in SQL Server.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE TABLE Personnel\n(emp_id CHAR(10) NOT NULL PRIMARY KEY,\n emp_name VARCHAR(20),\n manager_emp_id CHAR(10)  --this is NULL-able\n CONSTRAINT ref_mgr_emp_id\n    FOREIGN KEY (manager_emp_id) \n        REFERENCES Personnel(emp_id))<\/pre>\n<p>Please ignore the fact that this is a generally bad design. Being somebody\u2019s manager is not an attribute of the employee; it\u2019s a relationship among employees. But the important point is that before you can reference something it has to exist. In SQL, a <code>CREATE TABLE <\/code>statement occurs all at once as a transaction.<\/p>\n<p>That means its columns all appear the same time. If you\u2019re interested. Standard SQL has a <code>CREATE SCHEMA<\/code> statement that brings everything into existence all at once. This means that some circular references can be constructed and a lot of other things I will not bother with here.<\/p>\n<p>All of this can get a little tricky. Let\u2019s say I want to create a history table such that the start date of one event is guaranteed to be a termination date of a prior event. Basically, were building a temporal chain. The problem is getting this started in SQL. When I insert my initial event, the preceding event column has nothing to reference! Even the initial event doesn\u2019t exist yet because were trying to insert it.<\/p>\n<p>The solution is to turn off the constraints, insert the initial event, and then turn them back on again. So, the second event can reference back to the first event. You can do this in SQL Server with some techniques developed by Alex Kuznetsov or in ANSI\/ISO Standard SQL some ugly looking syntax features which control when a constraint is applied.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE TABLE Events\n(\n  event_id CHAR(10) NOT NULL,\n  previous_event_end_date DATE NOT NULL\n  CONSTRAINT FK_Chained_Dates\n    REFERENCES Events (event_end_date),\n  event_start_date DATE NOT NULL,\n  -- NULL means event ongoing\n  event_end_date DATE UNIQUE,\n  PRIMARY KEY (event_id, event_start_date),\n  --Start must be less or equal to end\n  CONSTRAINT Event_Order_Valid\n     CHECK (event_start_date &lt;= event_end_date),\n  --previous end must be one day before start date\n  CONSTRAINT CK_Chained_Dates\n     CHECK (DATEADD(DAY, 1, previous_event_end_date) \n                          = event_start_date)\n-- alternative solution\n-- CHECK (previous_event_end_date + INTERVAL '01' DAYS) \n-- = event_start_date)\n);\n\n-- disable the Chained_Dates constraint\nALTER TABLE Events NOCHECK CONSTRAINT FK_Chained_Dates;\n\n-- insert a starter row\nINSERT INTO Events(event_id, previous_event_end_date, \n                   event_start_date, event_end_date)\nVALUES ('Foo Fest', '2010-01-01', \n        '2010-01-02', '2010-01-05');\n\n-- enable the constraint in the table\nALTER TABLE Events CHECK CONSTRAINT FK_Chained_Dates;\n\n-- this works\nINSERT INTO Events(event_id, previous_event_end_date, \n                   event_start_date, event_end_date)\nVALUES ('Glob Week', '2010-01-05', \n        '2010-01-06', '2010-01-10');\n\n--Fails because the previous_event_end date is not in the table\n--AND fails because the previous_end_date is not 1 day before the \n--event_start_date\nINSERT INTO Events(event_id, previous_event_end_date, \n                   event_start_date, event_end_date)\nVALUES ('Snoob', '2010-01-09', \n        '2010-01-11', '2010-01-15');<\/pre>\n<p>Starting to see why I\u2019ve been saying that most of the hard work in SQL is done in the DDL and not in the DML?<\/p>\n<h2>Conclusion<\/h2>\n<p>Much like so much of my advice on employing DDL to protect data, using <code>PRIMARY KEY<\/code>, <code>UNIQUE<\/code>, <code>FOREIGN KEY<\/code>\/<code>REFERENCES<\/code> type objects is still a great way to protect the integrity of your data rather than attempting to do this on your own using SQL code, TRIGGER objects, or though application code.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019ve ever had a traditional logic course, you\u2019ll have run into \u201cThe Law of Identity\u201d as the founding principle of all Western thought. It says that: \u201cTo be is to be something in particular; to be nothing in particular or anything in general, is to be nothing at all.\u201d This usually gets expressed as&#8230;&hellip;<\/p>\n","protected":false},"author":96214,"featured_media":103278,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143539],"tags":[159072],"coauthors":[6781],"class_list":["post-98869","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-featured","category-theory-and-design","tag-database-theory"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98869","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/96214"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=98869"}],"version-history":[{"count":9,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98869\/revisions"}],"predecessor-version":[{"id":103279,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98869\/revisions\/103279"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/103278"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=98869"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=98869"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=98869"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=98869"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}