{"id":516,"date":"2009-02-11T00:00:00","date_gmt":"2009-02-11T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/removing-duplicates-from-a-table-in-sql-server\/"},"modified":"2021-09-29T16:22:07","modified_gmt":"2021-09-29T16:22:07","slug":"removing-duplicates-from-a-table-in-sql-server","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/removing-duplicates-from-a-table-in-sql-server\/","title":{"rendered":"Removing Duplicates from a Table in SQL Server"},"content":{"rendered":"<div id=\"pretty\">\n<p class=\"start\">Only rarely will you need to remove duplicate entries from a table on a production database. The tables in these databases should have a constraint, such as a primary key or unique constraint, to prevent these duplicate entries occurring in the first place. However, last year at SQL Bits 3 in Reading, I asked my audience how many of them needed to remove duplicate rows from a table, and almost eighty percent raised a hand.<\/p>\n<p>How is it that duplicates can get into a properly-designed table?&#160; Most commonly, this is due to changes in the business rules that define what constitutes a duplicate, especially after the merging of two different systems.&#160; In this article, I will look at some ways of removing duplicates from tables in SQL Server 2000 and later versions, and at some of the problems that may arise.<\/p>\n<h1>Checking for Duplicates<\/h1>\n<p>On any version of SQL Server, you can identify duplicates using a simple query, with <b>GROUP BY<\/b> and <b>HAVING<\/b>, as follows:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE (data VARCHAR(20))\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\n&#160;\nSELECT&#160; data\n&#160;&#160;&#160;&#160;&#160; , COUNT(data) nr\nFROM&#160;&#160;&#160; @table\nGROUP BY data\nHAVING&#160; COUNT(data) &gt; 1\n<\/pre>\n<p>The result indicates that there are two occurrences of the row containing the &#8220;duplicate row&#8221; text:<\/p>\n<pre>data&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; nr\n-------------------- -----------\nduplicate row&#160;&#160;&#160;&#160;&#160;&#160;&#160; 2\n<\/pre>\n<h1>Removing Duplicate Rows in SQL Server<\/h1>\n<p>The following sections present a variety of techniques for removing duplicates from SQL Server database tables, depending on the nature of the table design.<\/p>\n<h2>Tables with no primary key<\/h2>\n<p>When you have duplicates in a table that has no primary key defined, and you are using an older version of SQL Server, such as SQL Server 2000, you do not have an easy way to identify a single row. Therefore, you cannot simply delete this row by specifying a <b>WHERE<\/b> clause in a <b>DELETE<\/b> statement. <\/p>\n<p>You can, however, use the <b>SET ROWCOUNT 1<\/b> command, which will restrict the subsequent <b>DELETE<\/b> statement to removing only one row. For example:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE (data VARCHAR(20))\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\n&#160;\nSET ROWCOUNT 1\nDELETE FROM @table WHERE data = 'duplicate row'\nSET ROWCOUNT 0\n<\/pre>\n<p>In the above example, only one row is deleted. Consequently, there will be one remaining row with the content &#8220;duplicate row&#8221;. If you have more than one duplicate of a particular row, you would simply adjust the <b>ROWCOUNT<\/b> accordingly. Note that after the delete, you should reset the <b>ROWCOUNT<\/b> to 0 so that subsequent queries are not affected.<\/p>\n<p>To remove all duplicates in a single pass, the following code will work, but is likely to be horrendously slow if there are a large number of duplicates and table rows:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE (data VARCHAR(20))\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\n&#160;\nSET NOCOUNT ON\nSET ROWCOUNT 1\nWHILE 1 = 1\n&#160;&#160; BEGIN\n&#160;&#160;&#160;&#160;&#160; DELETE &#160;&#160;FROM @table\n&#160;&#160;&#160;&#160;&#160; WHERE&#160;&#160;&#160; data IN (SELECT&#160; data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160; @table\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; GROUP BY data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; HAVING&#160; COUNT(*) &gt; 1)\n&#160;&#160;&#160;&#160;&#160; IF @@Rowcount = 0 \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; BREAK ;\n&#160;&#160; END\nSET ROWCOUNT 0\n<\/pre>\n<p>When cleaning up a table that has a large number of duplicate rows, a better approach is to select just a distinct list of the duplicates, delete all occurrences of those duplicate entries from the original and then insert the list into the original table. <\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE(data VARCHAR(20))\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('second duplicate row')\nINSERT INTO @table VALUES ('second duplicate row')\n&#160;\nSELECT&#160;&#160; data\nINTO&#160;&#160;&#160;&#160; #duplicates\nFROM&#160;&#160;&#160;&#160; @table\nGROUP BY data\nHAVING&#160;&#160; COUNT(*) &gt; 1\n&#160;\n-- delete all rows that are duplicated\nDELETE&#160;&#160; FROM @table\nFROM&#160;&#160;&#160;&#160; @table o INNER JOIN #duplicates d\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ON d.data = o.data\n&#160;\n-- insert one row for every duplicate set\nINSERT&#160;&#160; INTO @table(data)\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; SELECT&#160;&#160; data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160;&#160; #duplicates\n<\/pre>\n<p>As a variation of this technique, you could select all the data, without duplicates, into a new table, delete the old table, and then rename the new table to match the name of the original table:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">CREATE TABLE duplicateTable3(data VARCHAR(20))\nINSERT INTO duplicateTable3 VALUES ('not duplicate row')\nINSERT INTO duplicateTable3 VALUES ('duplicate row')\nINSERT INTO duplicateTable3 VALUES ('duplicate row')\nINSERT INTO duplicateTable3 VALUES ('second duplicate row')\nINSERT INTO duplicateTable3 VALUES ('second duplicate row')\n&#160;\nSELECT DISTINCT data\nINTO&#160;&#160;&#160; tempTable\nFROM&#160;&#160;&#160; duplicateTable3 \nGO\nTRUNCATE TABLE duplicateTable3\nDROP TABLE duplicateTable3\nexec sp_rename 'tempTable', 'duplicateTable3'\n<\/pre>\n<p>In this solution, the <b>SELECT DISTINCT<\/b> will select all the rows from our table except for the duplicates. These rows are immediately inserted into a table named<b> tempTable<\/b>. This is a temporary table in the sense that we will use it to temporarily store the unique rows. However, it is not a true temporary table (i.e. one that lives in the temporary database), because we need the table to exist in the current database, so that it can later be renamed, using <b>sp_Rename<\/b>.<\/p>\n<p>The s<b>p_Rename <\/b>command<b> <\/b>is an absolutely horrible way of renaming textual objects, such as stored procedures, because it does not update all the system tables consistently. However, it works well for non-textual schema objects, such as tables.<\/p>\n<p>Note that this solution is usually used on table that has no primary key. If there is a key, and there &#160;are foreign keys referencing the rows that&#160; are identified as being &#160;duplicates, then the foreign key constraints need to be dropped and re-created again during the table swap. <\/p>\n<h2>Tables with a primary key, but no foreign key constraints<\/h2>\n<p>If your table has a primary key, but no foreign key constraints, then the following solution offers a way to remove duplicates that is much quicker, as it entails less iteration:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE(\n&#160;&#160;&#160;&#160;&#160; id INT IDENTITY(1, 1)\n&#160;&#160;&#160; , data VARCHAR(20)\n&#160;&#160;&#160; )\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\n&#160;\nWHILE 1 = 1\n&#160;&#160; BEGIN\n&#160;&#160;&#160;&#160;&#160; DELETE&#160;&#160; FROM @table\n&#160;&#160;&#160;&#160;&#160; WHERE&#160;&#160;&#160; id IN (SELECT&#160;&#160; MAX(id)\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160;&#160; @table\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; GROUP BY data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; HAVING&#160;&#160; COUNT(*) &gt; 1)\n&#160;&#160;&#160;&#160;&#160; IF @@Rowcount = 0 \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; BREAK ;\n&#160;&#160; END \n<\/pre>\n<p>Unfortunately, this sort of technique does not scale well. <\/p>\n<p>If your table has a reliable primary key, for example one that has an assigned a value that can be used in a comparison, such as a numeric value in a column with the IDENTITY property enabled, then the following approach is probably the neatest and best. Essentially, it deletes all the duplicates except for the one with the highest value for the primary key. If a table has a unique column such as a number or integer, that will reliably return just one value with&#160; MAX() or MIN(), then you can use this technique &#160;to identify the chosen survivor of the group of duplicates.<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE @table TABLE (\n&#160;&#160;&#160;&#160;&#160; id INT IDENTITY(1, 1)\n&#160;&#160;&#160; , data VARCHAR(20)\n&#160;&#160;&#160; )\nINSERT INTO @table VALUES ('not duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('duplicate row')\nINSERT INTO @table VALUES ('second duplicate row')\nINSERT INTO @table VALUES ('second duplicate row')\n&#160;\n&#160;\nDELETE&#160; FROM @table\nFROM&#160;&#160;&#160; @table o\n&#160;&#160;&#160;&#160;&#160;&#160;&#160; INNER JOIN ( SELECT data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160; @table\n&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;GROUP BY data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; HAVING COUNT(*) &gt; 1\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ) f ON o.data = f.data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160; LEFT OUTER JOIN ( SELECT&#160;&#160;&#160; [id] = MAX(id)\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160;&#160;&#160; @table\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; GROUP BY&#160; data\n&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;HAVING&#160;&#160;&#160; COUNT(*) &gt; 1\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ) g ON o.id = g.id\nWHERE&#160;&#160; g.id IS NULL\n<\/pre>\n<p>This can be simplified even further, though the logic is rather harder to follow.<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DELETE&#160;&#160; FROM f\nFROM&#160;&#160;&#160;&#160; @table AS f INNER JOIN @table AS g\n&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;ON g.data = f.data \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AND f.id &lt; g.id\n<\/pre>\n<h2>Tables that are referenced by a Foreign Key<\/h2>\n<p>If you&#8217;ve you&#8217;ve set up your constraints properly then you will be unable to delete duplicate rows from a table that is referenced by another table, using the above techniques unless you have specified cascading deletes in the foreign key constraints.<\/p>\n<p>You can alter existing foreign key constraints by adding a cascading delete&#160;on the foreign key&#160;constraint. This means that rows in other tables that refer to the duplicate row via a foreign key constraint will be deleted.&#160; Because you will lose the referenced data as well as the duplicate, you are more likely to wish to save the duplicate data in its entirety first in a holding table.&#160; When you are dealing with real data, you are likely to need to identify the duplicate rows that are being referred to, and delete the duplicates that are not referenced, or merge duplicates and update the references. This task will probably have to be done manually in order to ensure data integrity.<\/p>\n<h2>Tables with columns that cannot have a UNIQUE constraint<\/h2>\n<p>Sometimes, of course, you may have columns on which you cannot define a unique constraint, or you cannot even use the DISTINCT keyword. Large object types, like NTEXT, TEXT and IMAGE in SQL Server 2000 are good examples of this.&#160; These are data types that cannot be compared, and so the above solutions would not work.<\/p>\n<p>In these situations, you will need to add an extra column to the table that you could use as a surrogate key. Such a surrogate key is not derived from the application data. Its value may be automatically generated, similarly to the identity columns in our previous examples. Unfortunately, in SQL Server, you cannot add an identity column to a table as part of the ALTER TABLE command. The only way to add such a column is to rebuild the table, using SELECT INTO and the <b>IDENTITY()<\/b> function, as follows:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">CREATE TABLE duplicateTable4 (data NTEXT)\nINSERT INTO duplicateTable4 VALUES ('not duplicate row')\nINSERT INTO duplicateTable4 VALUES ('duplicate row')\nINSERT INTO duplicateTable4 VALUES ('duplicate row')\nINSERT INTO duplicateTable4 VALUES ('second duplicate row')\nINSERT INTO duplicateTable4 VALUES ('second duplicate row')\n&#160;\n&#160;\nSELECT&#160; IDENTITY( INT, 1,1 ) AS id,\n&#160;&#160;&#160;&#160;&#160;&#160;&#160; data\nINTO&#160;&#160;&#160; duplicateTable4_Copy\nFROM&#160;&#160;&#160; duplicateTable4\n<\/pre>\n<p>The above will create the <b>duplicateTable4_Copy<\/b> table. This table will have an identity column named <b>id<\/b>, which will already have unique numeric values set. Note that although we are creating an Identity column, uniqueness is not enforced in this case; you will need to add a unique index or define the <b>id<\/b> column as a primary key.<\/p>\n<h2>Using a cursor<\/h2>\n<p>People with application development background would consider using a cursor to try to eliminate duplicates. The basic idea is to order the contents of the table, iterate through the ordered rows, and check if the current row is equal to the previous row. If it does, then delete the row. This solution could look like the following in T-SQL:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">CREATE TABLE duplicateTable5 (data varchar(30))\nINSERT INTO duplicateTable5 VALUES ('not duplicate row')\nINSERT INTO duplicateTable5 VALUES ('duplicate row')\nINSERT INTO duplicateTable5 VALUES ('duplicate row')\nINSERT INTO duplicateTable5 VALUES ('second duplicate row')\nINSERT INTO duplicateTable5 VALUES ('second duplicate row')\nDECLARE @data VARCHAR(30),\n&#160;&#160;&#160; @previousData VARCHAR(30)\nDECLARE cursor1 CURSOR SCROLL_LOCKS\n&#160;&#160;&#160; FOR SELECT&#160; data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160; duplicateTable5\n&#160;&#160;&#160;&#160;&#160;&#160;&#160; ORDER BY data\n&#160;&#160;&#160; FOR UPDATE\nOPEN cursor1\n&#160;\nFETCH NEXT FROM cursor1 INTO @data\nWHILE @@FETCH_STATUS = 0\n&#160;&#160;&#160; BEGIN\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; IF @previousData = @data \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; DELETE&#160; FROM duplicateTable5\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; WHERE CURRENT OF cursor1\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; SET @previousData = @data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FETCH NEXT FROM cursor1 INTO @data\n&#160;&#160;&#160; END \nCLOSE cursor1\nDEALLOCATE cursor1\n<\/pre>\n<p>The above script will not work, because once you apply the&#160;<strong>ORDER BY<\/strong>&#160;clause in the cursor declaration the cursor will become read-only. If you remove the <b>ORDER BY<\/b> clause, then there will be no guarantee that the rows will be in order, and checking two subsequent rows would no longer be sufficient to identify duplicates. Interestingly, since the above example creates a small table where all the rows fit onto a single database page and duplicate rows are inserted in groups, removing the <b>ORDER BY<\/b> clause does make the cursor solution work. It will fail, however, with any table that is larger and has seen some modifications.<\/p>\n<h1>New Techniques for Removing Duplicate Rows in SQL Server 2005<\/h1>\n<p>SQL Server 2005 has introduced the <b>row_number()<\/b> function, which provides an alternative means of identifying duplicates. Rewriting the first example, for tables with no primary key, we can now assign a row number to each row in a duplicate group, with a command such as:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE&#160; @duplicateTable4 TABLE (data VARCHAR(20))\nINSERT INTO @duplicateTable4 VALUES ('not duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\n&#160;\nSELECT&#160; data\n&#160;&#160;&#160;&#160;&#160; , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr\nFROM&#160;&#160;&#160; @duplicateTable4\n<\/pre>\n<p>The result will show:<\/p>\n<pre>data&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;nr\n-------------------- --------------------\nduplicate row&#160;&#160;&#160;&#160;&#160;&#160;&#160; 1\nduplicate row&#160;&#160;&#160;&#160;&#160;&#160;&#160; 2\nnot duplicate row&#160;&#160;&#160; 1\nsecond duplicate row 1\nsecond duplicate row 2\n<\/pre>\n<p>In the above example, we specify an ordering and partitioning for the <b>row_number()<\/b> function. Note that the <b>row_number<\/b>() is a ranking window function, therefore the <b>ORDER BY<\/b> and the <b>PARTITION BY <\/b>in the <b>OVER<\/b> clause are used only to determine the value for the <b>nr<\/b> column, and they do not affect the row order of the query. Also, while the above is similar to our previous <b>GROUP BY<\/b> clause, there is a big difference concerning the returned rows. With <b>GROUP BY<\/b> you must use an aggregate on the columns that are not listed after the <b>GROUP BY<\/b>. With the <b>OVER<\/b> clause there is no such restriction, and you can get access to the individual rows in the groups specified by the <b>PARTITION BY<\/b> clause. This gives us access to the individual duplicate rows, so we can get not only the number of occurrences, but also a sequence number for the individual duplicates. To filter out the duplicate rows only, we could just put the above query into a CTE or a subquery. The CTE approach is as follows:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE&#160; @duplicateTable4 TABLE (data VARCHAR(20))\nINSERT INTO @duplicateTable4 VALUES ('not duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\n;\nWITH&#160;&#160;&#160; numbered\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AS ( SELECT&#160;&#160; data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160;&#160; @duplicateTable4\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; )\n&#160;&#160;&#160; SELECT&#160; data\n&#160;&#160;&#160; FROM&#160;&#160;&#160; numbered\n&#160;&#160;&#160; WHERE&#160;&#160; nr &gt; 1\n<\/pre>\n<p>This is not really any different from what we could do on SQL Server 2000.&#160; However, here comes an absolutely amazing feature in SQL Server 2005 and later: We can refer to, and identify, a duplicate row based on the <b>row_number()<\/b> column and then, with the above CTE expression, we can use a <b>DELETE<\/b> statement instead of a SELECT, and directly remove the duplicate entries from our table. <\/p>\n<p>We can demonstrate this technique with the following example:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DECLARE&#160; @duplicateTable4 TABLE (data VARCHAR(20))\nINSERT INTO @duplicateTable4 VALUES ('not duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\nINSERT INTO @duplicateTable4 VALUES ('second duplicate row')\n;\nWITH&#160;&#160;&#160; numbered\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AS ( SELECT&#160;&#160; data\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; , row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; FROM&#160;&#160;&#160;&#160; @duplicateTable4\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; )\n&#160;&#160;&#160; DELETE&#160; FROM numbered\n&#160;&#160;&#160; WHERE&#160;&#160; nr &gt; 1\n<\/pre>\n<p>This solution will even work with large objects, if you stick to the new large object types introduced in SQL Server 2005: i.e. use <b>VARCHAR(MAX)<\/b> instead of <b>TEXT<\/b>, <b>NVARCHAR(MAX) <\/b>instead of <b>NTEXT,<\/b> and <b>VARBINARY(MAX)<\/b> instead of<b> IMAGE.<\/b> These new types are comparable to the deprecated <b>TEXT,<\/b> <b>NTEXT<\/b> and I<b>MAGE,<\/b> and they have the advantage that you will be able to use them with both <b>DISTINCT <\/b>and <strong>row_number().<\/strong><\/p>\n<p>&#160;I find this last solution, using <b>CTE<\/b>, <b>ROW_NUMBER() <\/b>and <b>DELETE<\/b>, fascinating. Partly because now we can identify rows in a table when there is no other alternative way of doing it, and partly because it is a solution to a problem that should not, in theory, exist at all since production tables will have a unique constraint or a primary key to prevent duplicates getting into the table in the first place.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes, in SQL, it is the routine operations that turn out to be the trickiest for a DBA or developer. The cleaning up, or de-duplication, of data is one of those. Andr\u00e1s runs through a whole range of  methods and tricks, and ends with a a fascinating technique using CTE, ROW_NUMBER() and DELETE&hellip;<\/p>\n","protected":false},"author":9298,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143531],"tags":[4150,5134,4151,4252],"coauthors":[7547],"class_list":["post-516","post","type-post","status-publish","format-standard","hentry","category-t-sql-programming-sql-server","tag-sql","tag-sql-prompt","tag-sql-server","tag-t-sql-programming"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/516","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/9298"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=516"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/516\/revisions"}],"predecessor-version":[{"id":40049,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/516\/revisions\/40049"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=516"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=516"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=516"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=516"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}