De-duplicating Delimited Lists in SQL Server

In a recent article ‘Removing Duplicates from Strings in SQL Server’, Aaron Bertrand comes up with the interesting problem of removing duplicates from lists of strings. The example he gives is …

1	'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'

… and you want to make each word unique in the list…

1	‘Alpha/Bravo/Delta/Tango’. … You want just one occurrence of Bravo, for example.

He goes on to describe the conventional ‘number-table’ solution.

It made me wonder if it would be possible to do an alternative solution that didn’t require functions but would also work on a whole set of lists. Meybe, even a single expression.

Normally, these lists will be quite long. There is no place in a relational database for lists. However, there are times when we are forced to resort to them, if only when importing data from other systems. There is a subtle problem with them. Sometimes, lists not only describe the members of a set but also their order in the set.

In this case, it is the order of the first occurrence of the item in the list. To do the latter, the solution is slightly more complicated.

Let’s start with a set of lists. Just for simplicity, we’ll use the same delimiter for each though you’ll probably spot how to specify the delimiter and even how to allow multi-character delimiters. I’m leaving all that stuff out to keep all this code as simple as possible

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

SELECT AsciiList FROM (VALUES

('Hickory/Dickory/Dock/dock'),

('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'),

('One/Two/Three/Three/Three/Four'),

('Baa/Baa/Black/Sheep'))ListsWithDuplicates(AsciiList);

/* which gives …

AsciiList

———————————————–

Hickory/Dickory/Dock/dock

Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta

One/Two/Three/Three/Three/Four

Baa/Baa/Black/Sheep

All these strings have duplicate entries. 'Dock', in the first case, and everything bar 'Tango'

in the second, 'Three' in the third one and 'Baa' in the fourth

So we can test a simple expression that removes the duplicates. Note that the offending column in your ETL table would

substitute for our samples in a real use-case.

SELECT –the re-assembled list

replace(replace(Replace(

stuff(

(SELECT DISTINCT '/'+x.y.value('.','varchar(20)')

FROM XMLList.nodes('/list/i/text()') AS x ( y )

FOR XML PATH('')),1,1,'')

,'<','<'),'>','>')

,'&','&') as Deduplicated–using the XML trick

FROM (–XML version of the original list

SELECT convert(XML,'<list>'

+replace(

replace(

replace(ASCIIList,'&','&')

,'<','<')

,'/','')+'</list>')

FROM (VALUES

('Hickory/Dickory/Dock/dock'),

('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'),

('One/Two/Three/Three/Three/Four'),

('Baa/Baa/Black/Sheep')

)XMLlist(AsciiList)

)lists(XMLlist);

/* giving …

Deduplicated

————————-

Dickory/Dock/Hickory

Alpha/Bravo/Delta/Tango

Four/One/Three/Two

Baa/Black/Sheep

But this would be wrong if the order of the list is important. A list can just be a simple unordered list, or there could be an implied order there.

Let's alter this query to return the deduplicated list in the order of the first occurence of each item. It is easier done by example than explained! */

SELECT /*the re-assembled list*/

replace(replace(Replace(

stuff(

(SELECT '/'+TheValue FROM

(SELECT

x.y.value('.','varchar(20)') AS [Thevalue!cdata],

row_number() OVER (ORDER BY (SELECT 1)) AS TheOrder

FROM XMLList.nodes('/list/i/node()') AS x ( y )

)Nodes(Thevalue,TheOrder)

GROUP BY TheValue

ORDER BY min(TheOrder)

FOR XML PATH('')

),1,1,'')

,'<','<'),'>','>'),'&','&')

as Deduplicated

FROM (/*XML version of the original list*/

SELECT convert(XML,'<list>'

+replace(

replace(

replace(ASCIIList,'&','&')

,'<','<')

,'/','')+'</list>')

FROM (VALUES

('Hickory/Dickory/Dock/dock'),

('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'),

('One/Two/Three/Three/Three/Four'),

('Baa/Baa/Black/Sheep'),

('XML characters<>&''/or html/can be/ awkward')

)XMLlist(AsciiList)

)lists(XMLlist);

Deduplicated

−−−−−−−−−−−−−−−−−−−−−−−−−−-

Hickory/Dickory/Dock

Bravo/Alpha/Tango/Delta

One/Two/Three/Four

Baa/Black/Sheep

Sure, this sort of job will never be quick because there are still correlated subqueries in there to upset the CPU! Can this be improved, performance-wise? Well, an obvious improvement would be to use XQuery and flwor to do the 'distinctifying' operation. The only problem could be that the SQL Server version of fn:distinct-values doesn't allow collations so you get binary comparisons. It is a small price to pay if performance is your first consideration.*/

SELECT

replace(replace(replace(

stuff(replace(convert(VARCHAR(MAX),

xmllist.query('

for $x in distinct-values((data(/list/y/@i)))

return (concat("/",$x))

')),

' /','/')

,1,1,'')

,'<','<'),'>','>'),'&','&') AS result

FROM (/*XML version of the original list*/

SELECT convert(XML,'<list><y i="'+

replace(replace(replace(ASCIIList,'&','&')

,'<','<'),'/','" /><y i="')+'"/></list>')

FROM (VALUES

('Hickory/Dickory/Dock/Dock'),

('Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta'),

('One/Two/Three/Three/Three/Four'),

('<some<>/awkward&/<>bits'),

('Baa/Baa/Black/Sheep')

)XMLlist(AsciiList)

)lists(XMLlist);

The next refinement came from a reader ‘xcelsius’ who suggested a simpler way of doing the XML ‘distinct-values’ trick. It needed a bit of tweaking and didn’t display properly as a comment so I reproduce the version here. Although it was initially slower than the version above, a small change had it going faster.*/

select

replace(

Replace(

Replace

(Replace(cast(d.query('distinct-values(/i/text())') as varchar(50)),'&','&')

,'<','<')

,' ','/')

,' ',' ') as Deduplicated

from

(SELECT

cast(''+replace(

replace(

Replace(

replace(AsciiList,' ',' '),'&','&')

,'<','<')

,'/','')

+'' as xml

) as d

FROM (VALUES

('Hickory/Dickory/Dock/dock'),

('Bra vo/Alpha/Bra vo/Tango/Delta/Bravo/Alpha/Delta'),

('One/Two/Three/Three/Three/Four'),

('Baa/Baa/Black/Sheep'),

('<some/awkward&/bits'),

('Case/Dickory/case')

)XMLlist(AsciiList)

)lists;

So there you have it. With XML tricks and window functions, we have more opportunity for kicking out any need for functions. To use this code, you’d just swap out the select statement that supplied my samples to the routine, for the lists that you want to deduplicate. I am intrigued that there are such different ways of doing a solution for this task in SQL server. Are there yet other ways of doing it?

22-Jan-2016 Fixed a bug spotted by Aaron Bertrand)
28-Jan-2016 Fixed an XML escaping bug spotted by xcelsius
29-Jan-2016 Added fast version using XQuery and flwor
12-Feb-2016 Added a new faster solution based on that of xcelsius

Register for Simple Talk

De-duplicating Delimited Lists in SQL Server

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

Recommended

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases