In a recent article ‘Removing Duplicates from Strings in SQL Server’, Aaron Bertrand comes up with the interesting problem of removing duplicates from lists of strings. The example he gives is …
1 |
'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta' |
… and you want to make each word unique in the list…
1 |
‘Alpha/Bravo/Delta/Tango’. … You want just one occurrence of Bravo, for example. |
He goes on to describe the conventional ‘number-table’ solution.
It made me wonder if it would be possible to do an alternative solution that didn’t require functions but would also work on a whole set of lists. Meybe, even a single expression.
Normally, these lists will be quite long. There is no place in a relational database for lists. However, there are times when we are forced to resort to them, if only when importing data from other systems. There is a subtle problem with them. Sometimes, lists not only describe the members of a set but also their order in the set.
In this case, it is the order of the first occurrence of the item in the list. To do the latter, the solution is slightly more complicated.
Let’s start with a set of lists. Just for simplicity, we’ll use the same delimiter for each though you’ll probably spot how to specify the delimiter and even how to allow multi-character delimiters. I’m leaving all that stuff out to keep all this code as simple as possible
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
SELECT AsciiList FROM (VALUES ('Hickory/Dickory/Dock/dock'), ('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'), ('One/Two/Three/Three/Three/Four'), ('Baa/Baa/Black/Sheep'))ListsWithDuplicates(AsciiList); /* which gives … AsciiList ———————————————– Hickory/Dickory/Dock/dock Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta One/Two/Three/Three/Three/Four Baa/Baa/Black/Sheep All these strings have duplicate entries. 'Dock', in the first case, and everything bar 'Tango' in the second, 'Three' in the third one and 'Baa' in the fourth So we can test a simple expression that removes the duplicates. Note that the offending column in your ETL table would substitute for our samples in a real use-case. */ SELECT –the re-assembled list replace(replace(Replace( stuff( (SELECT DISTINCT '/'+x.y.value('.','varchar(20)') FROM XMLList.nodes('/list/i/text()') AS x ( y ) FOR XML PATH('')),1,1,'') ,'<','<'),'>','>') ,'&','&') as Deduplicated–using the XML trick FROM (–XML version of the original list SELECT convert(XML,'<list><i>' +replace( replace( replace(ASCIIList,'&','&') ,'<','<') ,'/','</i><i>')+'</i></list>') FROM (VALUES ('Hickory/Dickory/Dock/dock'), ('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'), ('One/Two/Three/Three/Three/Four'), ('Baa/Baa/Black/Sheep') )XMLlist(AsciiList) )lists(XMLlist); GO /* giving … Deduplicated ————————- Dickory/Dock/Hickory Alpha/Bravo/Delta/Tango Four/One/Three/Two Baa/Black/Sheep But this would be wrong if the order of the list is important. A list can just be a simple unordered list, or there could be an implied order there. Let's alter this query to return the deduplicated list in the order of the first occurence of each item. It is easier done by example than explained! */ SELECT /*the re-assembled list*/ replace(replace(Replace( stuff( (SELECT '/'+TheValue FROM (SELECT x.y.value('.','varchar(20)') AS [Thevalue!cdata], row_number() OVER (ORDER BY (SELECT 1)) AS TheOrder FROM XMLList.nodes('/list/i/node()') AS x ( y ) )Nodes(Thevalue,TheOrder) GROUP BY TheValue ORDER BY min(TheOrder) FOR XML PATH('') ),1,1,'') ,'<','<'),'>','>'),'&','&') as Deduplicated FROM (/*XML version of the original list*/ SELECT convert(XML,'<list><i>' +replace( replace( replace(ASCIIList,'&','&') ,'<','<') ,'/','</i><i>')+'</i></list>') FROM (VALUES ('Hickory/Dickory/Dock/dock'), ('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'), ('One/Two/Three/Three/Three/Four'), ('Baa/Baa/Black/Sheep'), ('XML characters<>&''/or html/can be/ awkward') )XMLlist(AsciiList) )lists(XMLlist); GO /* Deduplicated −−−−−−−−−−−−−−−−−−−−−−−−−−- Hickory/Dickory/Dock Bravo/Alpha/Tango/Delta One/Two/Three/Four Baa/Black/Sheep Sure, this sort of job will never be quick because there are still correlated subqueries in there to upset the CPU! Can this be improved, performance-wise? Well, an obvious improvement would be to use XQuery and flwor to do the 'distinctifying' operation. The only problem could be that the SQL Server version of fn:distinct-values doesn't allow collations so you get binary comparisons. It is a small price to pay if performance is your first consideration.*/ SELECT replace(replace(replace( stuff(replace(convert(VARCHAR(MAX), xmllist.query(' for $x in distinct-values((data(/list/y/@i))) return (concat("/",$x)) ')), ' /','/') ,1,1,'') ,'<','<'),'>','>'),'&','&') AS result FROM (/*XML version of the original list*/ SELECT convert(XML,'<list><y i="'+ replace(replace(replace(ASCIIList,'&','&') ,'<','<'),'/','" /><y i="')+'"/></list>') FROM (VALUES ('Hickory/Dickory/Dock/Dock'), ('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'), ('One/Two/Three/Three/Three/Four'), ('<some<>/awkward&/<>bits'), ('Baa/Baa/Black/Sheep') )XMLlist(AsciiList) )lists(XMLlist); /* The next refinement came from a reader ‘xcelsius’ who suggested a simpler way of doing the XML ‘distinct-values’ trick. It needed a bit of tweaking and didn’t display properly as a comment so I reproduce the version here. Although it was initially slower than the version above, a small change had it going faster.*/ select replace( Replace( Replace (Replace(cast(d.query('distinct-values(/i/text())') as varchar(50)),'&','&') ,'<','<') ,' ','/') ,' ',' ') as Deduplicated from (SELECT cast('<i>'+replace( replace( Replace( replace(AsciiList,' ',' '),'&','&') ,'<','<') ,'/','</i><i>') +'</i>' as xml ) as d FROM (VALUES ('Hickory/Dickory/Dock/dock'), ('Bra vo/Alpha/Bra vo/Tango/Delta/Bravo/Alpha/Delta'), ('One/Two/Three/Three/Three/Four'), ('Baa/Baa/Black/Sheep'), ('<some/awkward&/bits'), ('Case/Dickory/case') )XMLlist(AsciiList) )lists; /* |
So there you have it. With XML tricks and window functions, we have more opportunity for kicking out any need for functions. To use this code, you’d just swap out the select statement that supplied my samples to the routine, for the lists that you want to deduplicate. I am intrigued that there are such different ways of doing a solution for this task in SQL server. Are there yet other ways of doing it?
- 22-Jan-2016 Fixed a bug spotted by Aaron Bertrand)
- 28-Jan-2016 Fixed an XML escaping bug spotted by xcelsius
- 29-Jan-2016 Added fast version using XQuery and flwor
- 12-Feb-2016 Added a new faster solution based on that of xcelsius
Load comments