Have you ever fancied being able to put Simple-Talk’s content into a database? We haven’t either, but we’ll do it just for practice. The objective of this workbench is to show just how easy it is to implement an RSS-based datafeed in SQL Server 2005. The purpose is to try out downloading an XML file from the internet, checking it for validity, and if not, seeing if it is a redirect. We’ll show a cool SQL Server 2005 way of reading a text file into a VARCHAR(MAX)
variable. We’ll do some shredding of an XML file too. By the time we’ve finished we hope we will have illustrated some useful techniques and given you an application that you can turn into a production-quality newsreader.
It is very easy to read RSS feeds and OPML feeds into SQL Server 2005. This is because the files are XML and are easily eaten by the XML functions. RSS, which doesn’t actually stand for anything other than RDF Site Summary, is a simple way of publishing, or broadcasting, frequently updated Web content such as articles and blog entries. OPML, originally conceived for transferring outlines, is a format that has been used widely for distributing collections of RSS feeds. However, it has a large number of uses in transferring simple structured list-based information. As it is used for exchanging lists of RSS feeds, we’ll show how to read in an OPML list of feeds.
Please remember before we start that this code is for illustration and is not ‘production quality’. You will need to sort out the security issues (see the process delegation workbench) and add logging and more error-checking.
We won’t attempt to store all the information that RSS2 defines, simply because we can’t find any sites that use all this information. In fact, most feeds use only a small part of the information and few aggregators seem to display the extra information anyway. We’ll boil it down to two relational tables.
Firstly, create a new database called RSSFeedWorkbench. Make sure you have the rights to execute xp_cmdshell
too!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
USE RSSFeedWorkbench GO IF EXISTS ( SELECT * FROM sysobjects --delete it if it exists WHERE id = OBJECT_ID('RSSitem') ) DROP TABLE [RSSitem] /* We create two tables. RSSItem contains every news item that we collect. It is linked to an RSSFeed table which contains every feed that we want to read from. We have not included every parameter of a feed item because they are rarely used. */ CREATE TABLE dbo.RSSitem ( RSSitem_ID INT IDENTITY(1, 1) NOT NULL, RSSFeed_ID INT NULL, title VARCHAR(80) NOT NULL, link VARCHAR(200) NULL, [description] VARCHAR(MAX) NULL, PubDate VARCHAR(40) NULL, [GUID] VARCHAR(80) NULL, insertiondate DATETIME NOT NULL CONSTRAINT dfRssItemDate DEFAULT ( GETDATE() ), CONSTRAINT PK_RSSitem PRIMARY KEY CLUSTERED ( RSSitem_ID ASC ) ) ON [PRIMARY] /****** Object: Table dbo . RSSFeed Script Date: 07/02/2007 10:32:19 ******/ IF EXISTS ( SELECT * FROM sysobjects --delete it if it exists WHERE id = OBJECT_ID('RSSFeed') ) DROP TABLE [RSSFeed] |
The RSSFeed table contains every RSS feed that we want to read from. It is used to poll around the feeds. This routine checks to see whether the TYPE
field is set to RSS (or RSS2 etc). If it isn’t then it isn’t read.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
CREATE TABLE dbo.RSSFeed ( RSSFeed_ID INT IDENTITY(1, 1) NOT NULL, title VARCHAR(80) NOT NULL, link VARCHAR(200) NOT NULL, [description] VARCHAR(2000) NOT NULL DEFAULT '', [language] VARCHAR(10) NOT NULL DEFAULT 'en-US', Category VARCHAR(100) NOT NULL DEFAULT 'Generic', docs VARCHAR(80) NULL, Generator VARCHAR(80) NULL, ManagingEditor VARCHAR(80) NULL, Webmaster VARCHAR(80) NULL, actualURL VARCHAR(200) NULL, [type] VARCHAR(80) NOT NULL DEFAULT 'RSS',--is it RSS or ATOM? insertiondate DATETIME NOT NULL DEFAULT ( GETDATE() ), CONSTRAINT PK_RSSFeed PRIMARY KEY CLUSTERED ( RSSFeed_ID ASC ) ) ON [PRIMARY] GO ALTER TABLE dbo.RSSitem WITH CHECK ADD CONSTRAINT FK_RSSitem_RSSFeed FOREIGN KEY ( RSSFeed_ID ) REFERENCES dbo.RSSFeed ( RSSFeed_ID ) GO ALTER TABLE dbo.RSSitem CHECK CONSTRAINT FK_RSSitem_RSSFeed GO CREATE NONCLUSTERED INDEX [idxRSSFeedTitle] ON [dbo].[RSSFeed] ( [title] ASC ) CREATE NONCLUSTERED INDEX [idxTitle] ON [dbo].[RSSitem] ( [title] ASC ) GO |
Although there are more portable ways of getting the contents of the URL, we’ll use CURL.EXE, which is free, and very easy to install, and useful for a wide range of purposes. You’ll need to download and install this to get the Workbench to run. You’ll need to install the OpenSSL package too from here .
Some RSS feeds will try to stop non-browser software from reading an RSS feed. Curl has useful facilities that allow it to mimic a browser to the extent of allowing cookies and letting you specify the user agent.
This is the stored procedure that grabs the contents of the feed, checks to see if is one we know about and, if not, saves its attributes. then it saves any items it has not already saved. As it saves the url of the feed, you can use it to add the feed to your repository, and then it will automatically refresh it (if you install the spUpdateAllFeeds
stored procedure in the right place. This stored procedure is more complex than it might be because one can try to access a feed only to find it is a redirection to the actual feed. If this is the case then one has to read the file as an HTML file, get the anchor, and use the contents of the HREF as the source of the newsfeed, before trying again!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
IF EXISTS ( SELECT * FROM sysobjects WHERE id = OBJECT_ID('spUpdateRSSArchiveFrom') ) DROP PROCEDURE [spUpdateRSSArchiveFrom] GO CREATE PROCEDURE [dbo].[spUpdateRSSArchiveFrom] @url VARCHAR(200),--URL of the RSS datafeed to register and read from @tempfile VARCHAR(80) = 'c:\rssfeed.xml'--optional file to save as AS /* execute spUpdateRSSArchiveFrom 'http://www.simple-talk.com/community/forums/rss.aspx?ForumID=142&Mode=0', 'C:\forums.rss' */ SET nocount ON DECLARE @ErrorMessage VARCHAR(100) DECLARE @command NVARCHAR(255) --the command string used for spExecuteSQL DECLARE @ExitCode INT--the code returned by the xp_cmdShell procedure DECLARE @ExitCodeASCII VARCHAR(10)--the ascii version of above code DECLARE @Badfile VARCHAR(MAX)--we use this to pull in any HTML files to --examine if there was an error in reading in the XML file DECLARE @Retries INT--the number of retries we allow DECLARE @Ref VARCHAR(1000)--the reference supplied by the feed redirector SELECT @retries = 0--set up the retry count CREATE TABLE #lines ( line VARCHAR(2000) )--table used to remember what is --passed back from the command line when executing xp_cmdShell retry:--yes we use GOTOs. We sometimes do... RAISERROR ( 'processing %s', 0, 1, @url )--a way of getting an immediate --progress report /* now we make up the command line for CURL.EXE. You may want to change the parameters here for various reasons such as a proxy. If an error happens then whatever CURL prints out is returned to the procedure and printed out by the error processing */ SELECT @command = 'curl "' + @url + '" -o"' + @tempfile + '" -A"Mozilla/4.0 (compatible; MSIE 6.0;' +' Windows NT 5.0; .NET CLR 1.1.4322"' /*Firstly we read the RSSfeed into a file*/ INSERT INTO #lines ( line ) EXECUTE @exitcode= xp_cmdshell @command--and execute it SELECT @ErrorMessage = COALESCE(@errorMessage, '') + COALESCE(line, '') FROM #lines--put the entire result into one string IF EXISTS ( SELECT 1 FROM #lines WHERE line LIKE '%is not recognized as an%' ) BEGIN--the silly moo hasn't installed CURL! RAISERROR ( 'Sorry. You must have CURL installed first!', 16, 1 ) RETURN 1 END IF EXISTS ( SELECT 1--an error was returned from CURL FROM #lines WHERE line LIKE '%Could not resolve host%' ) OR EXISTS ( SELECT 1 FROM #lines WHERE line LIKE 'curl: (%' ) BEGIN RAISERROR ( 'Sorry. Could not get RSS feed because %s', 16, 1, @ErrorMessage ) RETURN 1 END IF @Exitcode <> 0 --ah CURL command line returned an error code BEGIN SELECT @ExitcodeASCII = CONVERT(VARCHAR(5), @Exitcode) SELECT @ErrorMessage = COALESCE(@errorMessage, '') + COALESCE(line, '') FROM #lines RAISERROR ( 'Sorry. Errorcode %s. Curl reports %s', 16, 1, @ExitcodeASCII, @ErrorMessage ) RETURN 1 END /* ...then we read it in...*/ DECLARE @RSSfeed XML DECLARE @Feed_ID INT BEGIN TRY--we will catch errors at this point as they are usually XML SELECT @Command = 'SELECT @RSSfeed = BulkColumn FROM OPENROWSET(BULK ''' + @TempFile + ''', SINGLE_BLOB) AS x ' EXEC sp_executesql @command, N'@RSSfeed xml output', @RSSfeed OUTPUT END TRY BEGIN CATCH IF @@Error IN (9413,9422)--if it was an XML parsing error BEGIN RAISERROR ( '%s was not valid XML. Was it a redirect?', 0, 1, @url ) SELECT @Command = 'SELECT @Badfile = BulkColumn FROM OPENROWSET(BULK ''' + @TempFile + ''', SINGLE_BLOB) AS x ' EXEC sp_executesql @command, --have a look at it --read the file as a TEXT file rather than an XML N'@BadFile varchar(MAX) output', @BadFile OUTPUT --we examine it to see if it is a redirect file SELECT @ref = SUBSTRING(@BadFille, CHARINDEX('<a href="', @BadFile + '<a href="', 1) + 9, 1000) SELECT @ref = LEFT(@Ref, CHARINDEX('"', @ref + '"') - 1) IF LEN(@Ref) < 10--check for obvious signs of problems OR @ref NOT LIKE 'http://%' BEGIN RAISERROR ( 'The url %s was not a valid redirect!', 16, 1, @URL ) RETURN 1 END UPDATE RSSFeed SET type='BAD' WHERE ActualURL=@URL SELECT @URL = @ref, @Retries = @Retries + 1 IF @retries < 3 GOTO retry END RAISERROR ( 'The url %s was not a valid RSS feed!', 16, 1, @URL ) RETURN 1 END CATCH /*Now we see if this is a new feed. If it is we add it to the feed database. */ DECLARE @RSSFeedAttributes TABLE ( title VARCHAR(80), link VARCHAR(200), [description] VARCHAR(2000), [language] VARCHAR(10) DEFAULT 'en-US', Category VARCHAR(100) DEFAULT 'Generic', docs VARCHAR(80), Generator VARCHAR(80), ManagingEditor VARCHAR(80), Webmaster VARCHAR(80) ) /* <title>My Title</title> <link>My Link</link> <description>My Description</description> <language>My Language e.g. en-us</language> <category>Newspapers</category><docs>My Docs URL</docs> <generator>My RSS Generator</generator> <managingEditor>My.Editorial@Email.Address</managingEditor> <webMaster>My.Webmaster@Email.Address</webMaster> */ INSERT INTO @RSSFeedAttributes ( title, link, [description], [language], Category, docs, Generator, ManagingEditor, Webmaster ) SELECT x.feed.value('title[1]', 'varchar(80)') AS title, x.feed.value('link[1]', 'varchar(200)') AS link, x.feed.value('description[1]', 'varchar(2000)') AS [description], COALESCE(x.feed.value('language[1]', 'Varchar(10)'), 'en-US') AS [language], COALESCE(x.feed.value('Category[1]', 'Varchar(100)'), 'Generic') AS [Category], x.feed.value('docs[1]', 'Varchar(80)') AS [docs], x.feed.value('generator[1]', 'Varchar(80)') AS [Generator], x.feed.value('managingeditor[1]', 'Varchar(80)') AS [ManagingEditor], x.feed.value('webmaster[1]', 'Varchar(80)') AS [Webmaster] FROM @RSSfeed.nodes('//rss/channel') AS x ( feed ) --insert the feed if it doesn't exist INSERT INTO RSSFeed ( title, link, [description], [language], Category, docs, Generator, ManagingEditor, Webmaster, ActualURL ) SELECT f.title, f.link, f.[description], f.[language], f.Category, f.docs, f.Generator, f.ManagingEditor, f.Webmaster, @URL FROM @RSSFeedAttributes f LEFT OUTER JOIN RSSFeed ON RSSfeed.title = f.title AND rssFeed.link = rssfeed.link WHERE rssFeed.rssFeed_ID IS NULL --..and get the ID of the feed SELECT TOP 1 @Feed_ID = rssFeed.rssFeed_ID FROM @RSSFeedAttributes f INNER JOIN RSSFeed ON RSSfeed.title = f.title AND rssFeed.link = rssfeed.link --and add any RSSfeed item that doesn't exist INSERT INTO RSSitem ( RSSFeed_ID, title, link, [description], PubDate, [GUID] ) SELECT @feed_ID, f.title, f.link, f.description, COALESCE(f.pubdate, '-'), COALESCE(f.[GUID], '-') FROM ( SELECT x.feed.value('title[1]', 'varchar(80)') AS title, x.feed.value('link[1]', 'varchar(200)') AS link, x.feed.value('description[1]', 'varchar(max)') AS [description], x.feed.value('pubdate[1]', 'Varchar(40)') AS [pubdate], CONVERT(VARCHAR(80), x.feed.value('guid[1]', 'Varchar(2000)')) AS [GUID] FROM @RSSfeed.nodes('//rss/channel/item') AS x ( feed ) ) f LEFT OUTER JOIN RSSitem ON COALESCE(f.title, '-') = RSSitem.title AND COALESCE(f.GUID, '-') = RSSitem.GUID AND COALESCE(f.PubDate, '-') = rssitem.pubdate WHERE rssitem.RSSitem_ID IS NULL GO IF EXISTS ( SELECT * FROM sysobjects WHERE id = OBJECT_ID('spReadOPMLFile') ) DROP PROCEDURE [spReadOPMLFile] GO /* Now we provide a routine for reading in any ompl FILE. This is just a list so it is relatively easy to read, but note that the list may be hierarchical. We are not too concerned about the OPML hierarchy so we won't bother to record the hierarchy. */ CREATE PROCEDURE spReadOPMLFile @Filename VARCHAR(100) /* spReadOPMLFile 'S:MyFavouriteRSSFeeds.opml' */ AS DECLARE @opmlfeed XML DECLARE @command NVARCHAR(255) SELECT @Command = 'SELECT @opmlfeed = BulkColumn FROM OPENROWSET(BULK ''' + @filename + ''', SINGLE_BLOB) AS x ' EXEC sp_executesql @command, N'@opmlfeed xml output', @opmlfeed OUTPUT INSERT INTO RSSfeed ( title, link, [description], [language], Category, actualURL, [type] ) SELECT f.* FROM ( SELECT x.opml.value('@title', 'nvarchar(80)') AS title, x.opml.value('@xmlUrl', 'nvarchar(200)') AS link, COALESCE(x.opml.value('@description', 'nvarchar(2000)'), '') AS [description], COALESCE(x.opml.value('@language', 'nvarchar(2000)'), 'Een-US') AS [language], 'Generic' AS [category], x.opml.value('@xmlUrl', 'nvarchar(200)') AS actualURL, COALESCE(x.opml.value('@version', 'nvarchar(80)'), 'RSS') AS [type] FROM @opmlfeed.nodes('/descendant::outline[@xmlUrl]') AS x ( opml ) ) f LEFT OUTER JOIN rssfeed ON f.link = rssfeed.link WHERE rssfeed.rssFeed_ID IS NULL GO IF EXISTS ( SELECT * FROM sysobjects WHERE id = OBJECT_ID('spUpdateAllFeeds') ) DROP PROCEDURE [spUpdateAllFeeds] GO /* this is the stored procedure that polls all your existing RSS feeds and reads in any new news entries*/ CREATE PROCEDURE spUpdateAllFeeds AS DECLARE @command VARCHAR(MAX) SELECT @Command = COALESCE(@command, '') + ' execute spUpdateRSSArchiveFrom ''' + actualURL + ''' ' FROM rssFeed WHERE type LIKE 'RSS%' EXECUTE ( @command )--execute them as a string GO |
In the list of files to download will be some sample OMPL files to try out. Now when you’ve loaded in a few RSS feeds, all you have to do is to put this stored procedure on the scheduler and this will keep your News up-to-date. Instant Blogrolls! The first person to write an OMPL exporter for this will get a Simple-Talk goodie bag!
Load comments