{"id":98080,"date":"2023-09-28T18:37:41","date_gmt":"2023-09-28T18:37:41","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=98080"},"modified":"2023-08-23T19:19:23","modified_gmt":"2023-08-23T19:19:23","slug":"t-sql-fundamentals-controlling-duplicates","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/t-sql-fundamentals-controlling-duplicates\/","title":{"rendered":"T-SQL Fundamentals: Controlling Duplicates"},"content":{"rendered":"<p>When people start learning a new field, for example T-SQL, it\u2019s tempting to spend very little time trying to understand the fundamentals of the field so that you can quickly get to the advanced parts. You might think that you already understand what certain ideas, concepts and terms mean, so you don\u2019t necessarily see the value in dwelling on them. That\u2019s often the case with newcomers to T-SQL, especially because soon after you start learning the language, you can already write queries that return results, giving you a false impression that it\u2019s a simple or easy language. However, without a good understanding of the foundations and roots of the language, you\u2019re bound to end up writing code that doesn\u2019t mean what you think it means. To be able to write robust and correct T-SQL code, you really want to spend a lot of energy on making sure that you have an in-depth understanding of the fundamentals.<\/p>\n<p>A great example for a fundamental concept in T-SQL that seems simple but actually involves many subtleties is: duplicates. The concept of duplicates is at the heart of relational database design and T-SQL. Understanding what duplicates are is consequential to many of its features. It\u2019s also important to understand the differences between how T-SQL handles duplicates versus how relational theory does.<\/p>\n<p>In this article I focus on duplicates and how to control them. In order to understand duplicates correctly, you need to understand the concepts of distinctness and equality, which I\u2019ll cover as well.<\/p>\n<p>In my examples I will use the sample database TSQLV6. You can download the script file to create and populate this database and its ER diagram in a .ZIP file <a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/ItzikBenGan_TSQLV6_SampleDatabase.zip\">here<\/a>. (You can find all of my downloads on my website at the following address: <a href=\"https:\/\/itziktsql.com\/r-downloads\">https:\/\/itziktsql.com\/r-downloads<\/a>)<\/p>\n<h2>What are duplicates in T-SQL?<\/h2>\n<p>T-SQL is a dialect of <a href=\"https:\/\/www.iso.org\/standard\/76584.html\">standard SQL<\/a> So, the obvious place to look for the definition of duplicates is in the standard\u2019s text, assuming you have access to it (you can purchase a copy using the link provided) and the stomach for it. If you do look for the definition of duplicates in the standard\u2019s text, you\u2019ll quickly start wondering how deep the rabbit hole goes.<\/p>\n<p>Here\u2019s the definition of duplicates in the standard:<\/p>\n<p><strong><em>Duplicates<\/em><\/strong><\/p>\n<p><em>Two or more members of a multiset that are not distinct<\/em><\/p>\n<p>So now you need to figure out what <em>multiset<\/em> and <em>distinct<\/em> mean.<\/p>\n<p>Let\u2019s start with multiset. This can get a bit confusing since there\u2019s both a mathematical concept called multiset and a specific feature in the SQL standard called <code>MULTISET<\/code> (one of two kinds of collection types: <code>ARRAY<\/code> and <code>MULTISET<\/code>). T-SQL doesn\u2019t support the standard collection types <code>ARRAY<\/code> and <code>MULTISET<\/code>. However, the mathematical concept of a multiset is quite important to understand for T-SQL practitioners.<\/p>\n<p>Here&#8217;s the description of a multiset from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multiset\">Wikipedia<\/a>:<\/p>\n<p><strong><em>Multiset<\/em><\/strong><\/p>\n<p><em>In mathematics, a multiset (or bag, or mset) is a modification of the concept of a <\/em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Set_(mathematics)\"><em>set<\/em><\/a><em> that, unlike a set, allows for multiple instances for each of its elements.<\/em><\/p>\n<p>In other words, whereas a set is a collection of distinct elements, a multiset allows duplicates. Oops\u2026<\/p>\n<p>I\u2019m not sure why they chose in the standard to define duplicates via multisets. To me, it should be the other way around. I think that it\u2019s sufficient to understand the concept of duplicates via distinctness alone, and later talk about what multisets are.<\/p>\n<p>So, back to the definitions section in the SQL standard, here\u2019s the definition of distinctness:<\/p>\n<p><strong><em>Distinct (of a pair of comparable values)<\/em><\/strong><\/p>\n<p><em>Capable of being distinguished within a given context.<\/em><\/p>\n<p>Hmm\u2026 not very helpful. But wait, there\u2019s more! There\u2019s NOTE 8:<\/p>\n<p><strong><em>NOTE 8<\/em><\/strong><em> \u2014 Informally, two values are distinct if neither is null and the values are not equal. A null value and a nonnull value are distinct. Two null values are not distinct. See Subclause 4.1.5, \u201cProperties of distinct\u201d, and the General Rules of Subclause 8.15, \u201c&lt;distinct predicate&gt;\u201d.<\/em><\/p>\n<p>Let\u2019s start with the meaning of <em>a given context<\/em>. This has to do with the contexts in SQL where a pair of values can be compared. It could be a query filter, a join predicate, a set operator, the <code>DISTINCT<\/code> set quantifier in the <code>SELECT<\/code> list or an aggregate function, grouping, uniqueness, and so on.<\/p>\n<p>Next, let\u2019s talk about the meaning of <em>a pair of comparable values<\/em>. The concept of distinctness is relevant to a pair of values that can be compared. Later the standard contrasts this with values of a user-defined type that has no comparison type, in which case distinctness is undefined.<\/p>\n<p>I also need to mention here that when the standard uses the term <em>value<\/em>, it actually includes both non-<code>NULL<\/code> values and <code>NULLs<\/code>. To some, myself included, a <code>NULL<\/code> is a marker for a missing value, and therefore the term <em>NULL value<\/em> is actually incorrect. But what could be an inclusive term for both <code>NULL<\/code>s and non-<code>NULL<\/code> values? I don\u2019t know that there\u2019s a common industry term that is simple, intuitive and accurate. If you know one, let me know! Since our focus is things that can be compared, maybe we\u2019ll just use the term comparands. And by the way, in SQL, the concepts of duplicates and distinctness are of course relevant not just to scalar comparands, but also to row comparands.<\/p>\n<p>What\u2019s left to understand is what <em>capable of being distinguished<\/em> means. Note 8 goes into details trying to explain distinctness via equality, with exceptions when <code>NULL<\/code>s are involved. And as usual, it sends you elsewhere to read about <em>properties of distinct<\/em>, and the <em>distinct predicate<\/em>.<\/p>\n<p>What\u2019s crucial to understand here is that there\u2019s an important difference between equality-based comparison (or inequality) and distinctness-based comparison in SQL. You understand the difference using predicate logic. Given comparands <code>c1<\/code> and <code>c2<\/code>, here are the truth values for the inequality-based predicate <code>c1 &lt;&gt; c2<\/code> versus the distinctness-based counterpart <code>c1 is distinct from c2<\/code>:<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p><strong>c1<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c2<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c1 &lt;&gt; c2<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c1 is distinct from c2<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>non-NULL Y<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>any non-NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>any non-NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As you can see, with inequality, when both comparands are non-<code>NULL<\/code>, if they are the same the comparison evaluates to false and if they are different it evaluates to true. If any of the comparands is <code>NULL<\/code>, including both, the comparison evaluates to the truth value unknown.<\/p>\n<p>With distinctness, when both comparands are non-<code>NULL<\/code>, the behavior is the same as with inequality. When you have <code>NULL<\/code>s involved, they are basically treated just like non-<code>NULL<\/code> values. Meaning that when both comparands are <code>NULL<\/code>, <code>IS DISTINCT FROM<\/code> evaluates to false, otherwise to true.<\/p>\n<p>Similarly, given comparands <code>c1<\/code> and <code>c2<\/code>, here are the truth values for the equality-based predicate <code>c1 = c2<\/code> versus the distinctness-based counterpart <code>c1<\/code> is not distinct from <code>c2<\/code>:<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p><strong>c1<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c2<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c1 = c2<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>c1 is not distinct from c2<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>non-NULL X<\/p>\n<\/td>\n<td>\n<p>non-NULL Y<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>any non-NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>any non-NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>false<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>unknown<\/p>\n<\/td>\n<td>\n<p>true<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As you can see, with equality, when both comparands are non-<code>NULL<\/code>, if they are the same the comparison evaluates to true and if they are different it evaluates to false. If any of the comparands is <code>NULL<\/code>, including both, the comparison evaluates to the truth value unknown.<\/p>\n<p>With non-distinctness, when both comparands are non-<code>NULL<\/code>, the behavior is the same as with equality. When you have <code>NULL<\/code>s involved, they are basically treated just like non-<code>NULL<\/code> values. Meaning that when both comparands are <code>NULL<\/code>, <code>IS NOT DISTINCT FROM<\/code> evaluates to true, otherwise to false.<\/p>\n<p>In case you\u2019re not already aware of this, starting with SQL Server 2022, T-SQL supports the explicit form of the standard distinct predicate, using the syntax <code>&lt;comparand 1&gt; IS [NOT] DISTINCT FROM &lt;comparand 2&gt;<\/code>. You can find the details <a href=\"https:\/\/sqlperformance.com\/2022\/08\/sql-server-2022\/additional-t-sql-improvements-in-sql-server-2022\">here<\/a>.<\/p>\n<p>As you can gather, trying to understand things from the standard can be quite an adventure.<\/p>\n<p>So let\u2019s try to simplify our understanding of duplicates.<\/p>\n<p>First, you need to understand the concept of distinctness. This is done using predicate logic by understanding the difference between equality-based comparison and distinctness-based comparison, and familiarity with the distinct predicate and its rules.<\/p>\n<p>These concepts are not trivial to digest for the uninitiated, but they are critical for a correct understanding of the concept of duplicates.<\/p>\n<p>Assuming you have the concept of distinctness figured out, you can then understand duplicates.<\/p>\n<p><strong>Duplicates<\/strong><\/p>\n<p>Comparands c1 and c2 are duplicates if <code>c1<\/code> is not distinct from <code>c2<\/code>. That is, if the predicate <code>c1 IS NOT DISTINCT FROM c2<\/code> evaluates to true.<\/p>\n<p>The comparands <code>c1<\/code> and <code>c2<\/code> can be scalar comparands, in contexts like filter and join predicates, or row comparands, in contexts like the <code>DISTINCT<\/code> quantifier in the <code>SELECT<\/code> list and set operators.<\/p>\n<h2>T-SQL\/SQL Versus Relational Theory<\/h2>\n<p>Even though standard SQL and the T-SQL dialect, which is based on it, are founded in relational theory, they deviate from it in a number of ways, one of which is the handling of duplicates. The main structure in relational theory is a <em>relation<\/em>. Relational expressions operate on relations as inputs and emit a relation as output.<\/p>\n<p>The counterpart to a relation is a table in SQL. Similar to relational expressions, table expressions, such as ones defined by queries, operate on tables as inputs and emit a table as output.<\/p>\n<p>A relation has a heading and a body. The heading of a relation is a set of attributes. Similarly, the heading of a Table is a set of columns. There are interesting differences between the two, but that\u2019s not the focus of this article.<\/p>\n<p>The body of a relation is a <em>set<\/em> of tuples. Recall that a set has no duplicates. This means that by definition, a relation must have at least one candidate key.<\/p>\n<p>Unlike the body of a relation, the body of a table is a <em>multiset<\/em> of rows. Recall, a multiset is similar to a set, only it does allow duplicates. Indeed, you don\u2019t have to define a key in a table (no primary key or unique constraint), and if you don\u2019t, the table can have duplicate rows.<\/p>\n<p>Furthermore, even if you do have a key defined in a table; unlike a relational expression, a table expression that is based on a base table with a key, does not by default eliminate duplicates from the result table that it emits.<\/p>\n<p>Suppose that you need to project the countries where you have employees. In relational theory you formulate a relational expression doing so, and you don\u2019t need to be explicit about the fact that you don\u2019t want duplicate countries in the result. This is implied from the fact that the outcome of a relational expression is a relation. Suppose you try achieving the same in T-SQL using the following table expression:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT country FROM HR.Employees<\/pre>\n<p>As an aside, you might be wondering now why I didn\u2019t terminate this expression, despite the fact that I keep telling people how important it is to terminate all statements in T-SQL as a best practice. Well, you terminate <em>statements<\/em> in T-SQL. Statements do something. My focus now is the table expression returning a table with the cities where you have employees. The expression is the query part without the terminator, and is the part that can be nested in more elaborate table expressions.<\/p>\n<p>Back to our discussion about duplicates. Despite the fact that the underlying Employees table has a key and hence no duplicates, this table expression which is based on Employees, does have duplicates in its table result:<\/p>\n<p>Run the following code:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">USE TSQLV6;\r\n\r\nSELECT country FROM HR.Employees;<\/pre>\n<p>You get the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"135\" height=\"190\" class=\"wp-image-98081\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-screenshot-of-a-computer-description-automatica-45.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Of course, T-SQL does give you tools to eliminate duplicates in a table expression if you want to, it\u2019s just that in some cases it doesn\u2019t do so by default. In the above example, as you know well, you can use the <code>DISTINCT<\/code> set quantifier for this purpose. People often learn a dialect of SQL like T-SQL without learning the theory behind it. Many are so used to the fact that returning duplicates is the default behavior, that they don\u2019t realize that that&#8217;s not really normal in the underlying theory.<\/p>\n<h2>Controlling duplicates in T-SQL\/SQL<\/h2>\n<p>If I tried covering all aspects of controlling duplicates in T-SQL, I\u2019d probably easily end up with dozens of pages. To make this article more approachable, I\u2019ll focus on features that involve using quantifiers to allow or restrict duplicates.<\/p>\n<h2>SELECT List<\/h2>\n<p>T-SQL allows you to apply a set quantifier <code>ALL<\/code> | <code>DISTINCT<\/code> to the <code>SELECT<\/code> list of a query. The default is <code>ALL<\/code> if you don\u2019t specify a quantifier.<\/p>\n<p>The earlier query returning the countries where you have employees, without removal of duplicates, is equivalent to the following:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT ALL country FROM HR.Employees;<\/pre>\n<p>As you know, you need to explicitly use the DISTINCT quantifier to remove duplicates, like so:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT DISTINCT country FROM HR.Employees;<\/pre>\n<p>This code returns the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"168\" height=\"61\" class=\"wp-image-98082\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-blue-rectangle-with-black-text-description-auto.png\" alt=\"A blue rectangle with black text\n\nDescription automatically generated\" \/><\/p>\n<h2>Aggregate Functions<\/h2>\n<p>Similar to the <code>SELECT<\/code> list\u2019s quantifier, you can apply a set quantifier <code>ALL<\/code> | <code>DISTINCT<\/code> to the input of an aggregate function. Also here the <code>ALL<\/code> quantifier is the default if you don\u2019t specify one explicitly. With the <code>ALL<\/code> quantifier\u2014whether explicit or implied\u2014redundant duplicates are retained. The following two queries are logically equivalent:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">SELECT COUNT(country) AS cnt FROM HR.Employees;\r\n\r\nSELECT COUNT(ALL country) AS cnt FROM HR.Employees;<\/pre>\n<p>Both queries return a count of 9 since there are nine rows where country is not <code>NULL<\/code>.<\/p>\n<p>Again, you can use the <code>DISTINCT<\/code> quantifier if you want to remove redundant duplicates, like so:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT COUNT(DISTINCT country) AS cnt FROM HR.Employees;<\/pre>\n<p>This query returns 2 since there are two distinct countries where you have employees.<\/p>\n<p>Note that at the time of writing, T-SQL supports the <code>DISTINCT<\/code> quantifier with grouped aggregate functions, but not with windowed aggregate functions. There are <a href=\"https:\/\/www.itprotoday.com\/sql-server\/what-you-need-know-about-distinct-windowed-aggregate-calculations\">workarounds<\/a>, but they are far from being trivial.<\/p>\n<h2>Set operators<\/h2>\n<p>Set operators allow you to combine data from two input table expressions. The SQL standard supports three set operators <code>UNION<\/code>, <code>INTERSECT<\/code> and <code>EXCEPT<\/code>, each with two possible quantifiers <code>ALL<\/code> | <code>DISTINCT<\/code>. With set operators the <code>DISTINCT<\/code> quantifier is the default if you don\u2019t specify one explicitly.<\/p>\n<p>At the time of writing, T-SQL supports only a subset of the standard set operators. It supports <code>UNION<\/code> (implied <code>DISTINCT<\/code>), <code>UNION<\/code> <code>ALL<\/code>, <code>INTERSECT<\/code> (implied <code>DISTINCT<\/code>) and <code>EXCEPT<\/code> (implied <code>DISTINCT<\/code>). It doesn\u2019t allow you to be explicit with the <code>DISTINCT<\/code> quantifier, although that\u2019s the behavior that you get by default, and it supports the <code>ALL<\/code> option only with the <code>UNION ALL<\/code> operator. It currently does not support the <code>INTERSECT ALL<\/code> and <code>EXPECT ALL <\/code>operators. I\u2019ll explain all standard variants, and for the missing ones in T-SQL, I\u2019ll provide workarounds.<\/p>\n<p>You apply a set operator to two input table expressions, which I\u2019ll refer to as TE1 and TE2:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">TE1\r\n&lt;set operator&gt; \r\nTE2<\/pre>\n<p>The above represents a table expression.<\/p>\n<p>A statement based on a table expression with a set operator can have an optional <code>ORDER BY<\/code> clause applied to the result, using the following syntax:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">TE1 &lt;set operator&gt;  TE2 \r\n[ORDER BY &lt;order by list&gt;];<\/pre>\n<p>You\u2019re probably familiar with the set operators that T-SQL supports. Still, let me briefly explain what each operator does:<\/p>\n<ul>\n<li><code>UNION<\/code>: Returns distinct rows that appear in <code>TE1<\/code>, <code>TE2<\/code> or both. That is, if row <code>R<\/code> appears in <code>TE1<\/code>, <code>TE2<\/code> or both, irrespective of number of occurrences, it appears exactly once in the result.<\/li>\n<li><code>UNION ALL<\/code>: Returns all rows that appear in <code>TE1<\/code>, <code>TE2<\/code> or both. That is, if row R appears m times in <code>TE1<\/code> and n times in <code>TE2<\/code>, it appears <code>m + n<\/code> times in the result.<\/li>\n<li><code>INTERSECT<\/code>: Returns distinct rows that are common to both <code>TE1<\/code> and <code>TE2<\/code>. That is, if row <code>R<\/code> appears at least once in <code>TE1<\/code>, and at least once in <code>TE2<\/code>, it appears exactly once in the result.<\/li>\n<li><code>INTERSECT ALL<\/code>: Returns all rows that are common to both <code>TE1<\/code> and <code>TE2<\/code>. That is, if row R appears m times in <code>TE1<\/code> and n times in <code>TE2<\/code>, it appears <code>minimum(m, n)<\/code> times in the result. For example, if <code>R<\/code> appears 5 times in <code>TE1<\/code> and 3 times in <code>TE2<\/code>, it appears 3 times in the result.<\/li>\n<li><code>EXCEPT<\/code>: Returns distinct rows that appear in <code>TE1<\/code> but not in <code>TE2<\/code>. That is, if a row <code>R<\/code> appears in <code>TE1<\/code>, irrespective of the number of occurrences, and does not appear in <code>TE2<\/code>, it appears exactly once in the output.<\/li>\n<li><code>EXCEPT ALL<\/code>: Returns all rows that appear in <code>TE1<\/code> but don\u2019t have an occurrence match in <code>TE2<\/code>. That is, if a row <code>R<\/code> appears m times in <code>TE1<\/code>, and n times in <code>TE2<\/code>, it appears <code>maximum((m - n), 0)<\/code> times in the result. For example, if <code>R<\/code> appears 5 times in <code>TE1<\/code> and 3 times in <code>TE2<\/code>, it appears 2 times in the result. If <code>R<\/code> appears 3 times in <code>TE1<\/code> and 5 times in <code>TE2<\/code>, it doesn\u2019t appear in the result.<\/li>\n<\/ul>\n<p>An interesting question is why would you use a set operator to handle a given task as opposed to alternative tools to combine data from multiple tables, such as joins and subqueries? To me, one of the main benefits is the fact that when set operators compare rows, they implicitly use distinctness-based comparison and not equality-based comparison. Recall that distinctness-based comparison handles <code>NULL<\/code>s and non-<code>NULL<\/code> values the same way, essentially using two-valued logic instead of three-valued logic. That\u2019s often the desired behavior, and with set operators it simplifies the code a great deal. For example, the following code identifies distinct locations that are both customer locations and employee locations:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT country, region, city FROM Sales.Customers\r\nINTERSECT\r\nSELECT country, region, city FROM HR.Employees;<\/pre>\n<p>This code generates the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"284\" height=\"89\" class=\"wp-image-98083\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-screenshot-of-a-phone-description-automatically-2.png\" alt=\"A screenshot of a phone\n\nDescription automatically generated\" \/><\/p>\n<p>Since set operators implicitly use the distinct predicate to compare rows (not to confuse with the fact that without an explicit quantifier they use the <code>DISTINCT<\/code> quantifier by default), you didn\u2019t need to do anything special to get a match when comparing two <code>NULL<\/code>s and a nonmatch when comparing a <code>NULL<\/code> with a non-<code>NULL<\/code> value. The location UK, <code>NULL<\/code>, London is part of the result since it appears in both inputs.<\/p>\n<p>Also, with no explicit quantifier specified, a set operator uses an implicit <code>DISTINCT<\/code> quantifier by default. Remember that with <code>INTERSECT<\/code>, as long as at least one occurrence of a row appears in both sides, <code>INTERSECT<\/code> returns one occurrence of the row in the result.<\/p>\n<p>As an exercise, I urge you to write a logically equivalent solution to the above code, using either joins or subqueries. Of course it\u2019s doable, but not this concisely.<\/p>\n<p>As mentioned, the standard also supports an <code>ALL<\/code> version of <code>INTERSECT<\/code>. For example, the following standard query returns all occurrences of locations that are both customer locations and employee locations (don\u2019t run it against SQL Server since it\u2019s not supported in T-SQL):<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT country, region, city FROM Sales.Customers\r\nINTERSECT ALL\r\nSELECT country, region, city FROM HR.Employees;<\/pre>\n<p>If you want to use a solution that is supported in T-SQL, the trick is to compute row numbers in each input table expression to number the duplicates, and then apply the operation to the inputs including the row numbers. You can then exclude the row numbers from the result by using a named table expression like a CTE. Here\u2019s the complete code:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">WITH C AS\r\n(\r\n  SELECT country, region, city, \r\n    ROW_NUMBER() OVER(PARTITION BY country, region, city \r\n                              ORDER BY (SELECT NULL)) AS rownum\r\n  FROM Sales.Customers\r\n  \r\n  INTERSECT\r\n  \r\n  SELECT country, region, city, \r\n    ROW_NUMBER() OVER(PARTITION BY country, region, city \r\n                              ORDER BY (SELECT NULL)) AS rownum\r\n  FROM HR.Employees\r\n)\r\nSELECT country, region, city\r\nFROM C;<\/pre>\n<p>This code generates the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"307\" height=\"137\" class=\"wp-image-98084\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-screenshot-of-a-computer-description-automatica-46.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>The location <code>UK<\/code>, <code>NULL<\/code>, <code>London<\/code> appears 6 times in the first input table expression and 4 times in the second, therefore 4 occurrences intersect.<\/p>\n<p>You can handle an except need very similarly. If you\u2019re interested in an except distinct operation, you use the <code>EXCEPT<\/code> (implied <code>DISTINCT<\/code>) in T-SQL. For example, the following code returns distinct employee locations that are not customer locations:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT country, region, city FROM HR.Employees\r\nEXCEPT\r\nSELECT country, region, city FROM Sales.Customers;<\/pre>\n<p>This code generates the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"229\" height=\"63\" class=\"wp-image-98085\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-screenshot-of-a-cellphone-description-automatic.png\" alt=\"A screenshot of a cellphone\n\nDescription automatically generated\" \/><\/p>\n<p>If you wanted all employee locations that per occurrence don\u2019t have a matching customer location, the standard code for this looks like so:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT country, region, city FROM HR.Employees\r\nEXCEPT ALL\r\nSELECT country, region, city FROM Sales.Customers;<\/pre>\n<p>However, T-SQL doesn\u2019t support the <code>ALL<\/code> quantifier with the <code>EXCEPT<\/code> operator. You can use a similar trick to the one you used to achieve the equivalent of <code>INTERSECT ALL<\/code>. You can achieve the equivalent of <code>EXCEPT ALL<\/code> by applying <code>EXCEPT<\/code> (implied <code>DISTINCT<\/code>) to inputs that include row numbers that number duplicate, like so:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">WITH C AS\r\n(\r\n  SELECT country, region, city, \r\n    ROW_NUMBER() OVER(PARTITION BY country, region, city \r\n                            ORDER BY (SELECT NULL)) AS rownum\r\n  FROM HR.Employees\r\n  \r\n  EXCEPT\r\n  \r\n  SELECT country, region, city, \r\n    ROW_NUMBER() OVER(PARTITION BY country, region, city \r\n                            ORDER BY (SELECT NULL)) AS rownum\r\n  FROM Sales.Customers\r\n)\r\nSELECT country, region, city\r\nFROM C;<\/pre>\n<p>This code generates the following output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"88\" class=\"wp-image-98086\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/08\/a-screenshot-of-a-phone-description-automatically-3.png\" alt=\"A screenshot of a phone\n\nDescription automatically generated\" \/><\/p>\n<p>Curiously, Seattle appears once in this result of except all but didn\u2019t appear at all in the result of the except distinct version. You might initially think that there\u2019s a bug in the code. But think carefully what could explain this? There are two employees from Seattle and one customer from Seattle. The except distinct operation isn\u2019t supposed to return any occurrences in the result, yet the except all operation is indeed supposed to return one occurrence.<\/p>\n<h2>Conclusion<\/h2>\n<p>Without proper understanding of the foundations of T-SQL\u2014primarily relational theory and its own roots\u2014it\u2019s hard to truly understand what you\u2019re dealing with. In this article I focused on duplicates. A concept that to most seems trivial and intuitive. However, as it turns out, a true understanding of duplicates in T-SQL is far from being trivial.<\/p>\n<p>You need to understand the differences between how relational theory treats duplicates versus SQL\/T-SQL. A relation\u2019s body doesn\u2019t have duplicates whereas a table\u2019s body can have those.<\/p>\n<p>It\u2019s very important to understand the difference between distinctness-based comparison, such as when using the distinct predicate explicitly or implicitly, versus equality-based comparison. Then you realize that in T-SQL, two comparands are duplicates when one is not distinct from the other.<\/p>\n<p>You also need to understand the nuances of how T-SQL handles duplicates, the cases where it retains redundant ones versus cases where it removes those. You also need to understand the tools that you have to change the default behavior using quantifiers such as <code>DISTINCT<\/code> and <code>ALL<\/code>.<\/p>\n<p>I discussed controlling duplicates in a query\u2019s <code>SELECT<\/code> list, aggregate functions and set operators. But there are other language elements where you might need to control them such as handling ties with the <code>TOP<\/code> filter, window functions, and others.<\/p>\n<p>What I hope that you take away from this article is the significance of investing time and energy in learning the fundamentals. And if you haven\u2019t had enough of T-SQL Fundamentals, and I\u2019m allowed a shameless plug, check out my new book <a href=\"https:\/\/www.amazon.com\/s?k=T-SQL+Fundamentals\">T-SQL Fundamentals 4th Edition<\/a>.<\/p>\n<p>May the 4th be with you!<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When people start learning a new field, for example T-SQL, it\u2019s tempting to spend very little time trying to understand the fundamentals of the field so that you can quickly get to the advanced parts. You might think that you already understand what certain ideas, concepts and terms mean, so you don\u2019t necessarily see the&#8230;&hellip;<\/p>\n","protected":false},"author":341753,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143531,143539],"tags":[159009],"coauthors":[158990],"class_list":["post-98080","post","type-post","status-publish","format-standard","hentry","category-featured","category-t-sql-programming-sql-server","category-theory-and-design","tag-summitspeaker2023itzikbengan"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98080","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/341753"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=98080"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98080\/revisions"}],"predecessor-version":[{"id":98088,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98080\/revisions\/98088"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=98080"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=98080"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=98080"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=98080"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}