{"id":360,"date":"2008-03-27T00:00:00","date_gmt":"2008-03-27T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/getting-html-data-workbench\/"},"modified":"2021-09-29T16:22:12","modified_gmt":"2021-09-29T16:22:12","slug":"getting-html-data-workbench","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/getting-html-data-workbench\/","title":{"rendered":"Getting HTML Data: Workbench"},"content":{"rendered":"\n<p>There is a lot of data out there in the internet. There are times you want to get at it. You may , for example, want the latest exchange rates, or maybe commodity prices. The problem in getting it is that you want it in a form that can be easily consumed and assimilated  by SQL Server. We&#8217;ll show you how you might get an HTML table into a SQL Server table.<\/p>\n<p>There often comes a time when one wants to parse an XHTML or HTML document into its DOM in order to get data from it. Without thinking too hard, we can think of several times we&#8217;ve wanted it.<\/p>\n<ul>\n<li>to get data from a table on an HTML page on the internet (with CURL.EXE it is so easy!) <\/li>\n<li>to make sure that an XHTML snippet is valid for a blog comment (e.g. has it got a TH or TD without an enclosing TR or TABLE?) <\/li>\n<li>to check if a snippet has got <code>SCRIPT <\/code>in it? <\/li>\n<li>to convert HTML code into IFcodes or BBcodes. <\/li>\n<li>to find all the anchors on an HTML page <\/li>\n<li>to index up existing HTML pages for site searches.<\/li>\n<\/ul>\n<p>Occasionally, one can assign an XHTML snippet to an XML variable in SQL Server. We&#8217;ve never had much luck with this approach since perfectly valid XHTML can trigger an error in SQL Server&#8217;s XML.<\/p>\n<p>There are several sensible approaches to investigating the Document Object Model (DOM) of an HTML document. These will generally involve scripting, using one of the existing DOM parsers. Parsing it using a stored procedure or function isn&#8217;t generally included in the list of sensible options.<\/p>\n<p>You can, of course, use the Microsoft DOMDocument object, <a href=\"http:\/\/msdn2.microsoft.com\/en-us\/library\/aa290341(VS.71).aspx\">Microsoft.mshtml<\/a>, the mozilla DOMparser, or the XMLDocument class in .NET. In PHP5 you would use the DOM extension. There are several other ways.<\/p>\n<p>We&#8217;ll use TSQL.<\/p>\n<p>Why do this? It is because we have a fine control over the results. We can get a table containing the DOM, in all its hierarchical glory, each row being a single element, in such a way that we can use SQL Queries to get whatever data we are interested in extracted.<\/p>\n<p>One big problem with HTML is that browsers are deliberately tolerant of bad markup. This makes a robust way of analysing the structure of an HTML document extraordinarily elaborate. It is possible but the code is too long for a Workbench. Instead, we will provide the means to analyse the structure of an XHTML document or fragment. The difference here is that you are more likely to find that it is well-formed, and you can reject errors.<\/p>\n<p>We&#8217;ve chosen to show you this as a stored procedure because this code is really is the stuff of workbenches. We&#8217;d like you to improve upon it and see if you can bend it to other purposes. Using a stored procedure rather than a function means that you can add debug code more easily whilst experimenting.<\/p>\n<p>Don&#8217;t get us wrong. The code works well enough for the sort of jobs we have. XHTML without errors. (HTMLTidy is a wonderful app). We&#8217;d just be very pleased if you get interested in what we&#8217;ve done, and experiment with the ideas!<\/p>\n<p>First off, we need a table with all the valid markup.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">\r\n    --Creating the HTML Element table\r\n        \r\nIF OBJECT_ID(N'HTMLTags') IS NOT NULL \r\n     DROP TABLE HTMLTags--if it exists\r\n--don't try this technique in a live database with other users!\r\n--write the Tags\/Elements table\r\nSELECT [tag]='!DOCTYPE', [meaning]='Defines the document type',\r\n     [type]='Basic Tag',[HasclosingTag]=0 INTO  HTMLTags \r\nUNION SELECT '!|[CDATA|[', 'delimits a javascript area in XHTML','Basic Tag',0\r\n--note we have used the ESCAPE char | before the '[' character\r\n--as otherwise it would foul up the LIKE comparison\r\nUNION SELECT '?xml', 'flags an XML document','Basic Tag',0\r\nUNION SELECT 'html', 'Defines a html document','Basic Tag',1\r\nUNION SELECT 'body', 'Defines the body element','Basic Tag',1\r\nUNION SELECT 'h1', 'Defines header 1 ','Basic Tag',1\r\nUNION SELECT 'h2', 'Defines header 2 ','Basic Tag',1\r\nUNION SELECT 'h3', 'Defines header 3 ','Basic Tag',1\r\nUNION SELECT 'h4', 'Defines header 4 ','Basic Tag',1\r\nUNION SELECT 'h5', 'Definess header 5 ','Basic Tag',1\r\nUNION SELECT 'h6', 'Defines header 6 ','Basic Tag',1\r\nUNION SELECT 'p', 'Defines a paragraph','Basic Tag',1\r\nUNION SELECT 'br', 'Inserts a single line break','Basic Tag',0\r\nUNION SELECT 'hr', 'Defines a horizontal rule','Basic Tag',0\r\nUNION SELECT '!--', 'Defines a comment','Basic Tag',0\r\nUNION SELECT 'b', 'Defines bold text','Char Format',1\r\nUNION SELECT 'font', 'Defines the font face, size, and color of text',\r\n                                                   'Char Format',1\r\nUNION SELECT 'i', 'Defines italic text','Char Format',1\r\nUNION SELECT 'em', 'Defines emphasized text ','Char Format',1\r\nUNION SELECT 'big', 'Defines big text','Char Format',1\r\nUNION SELECT 'strong', 'Defines strong text','Char Format',1\r\nUNION SELECT 'small', 'Defines small text','Char Format',1\r\nUNION SELECT 'sup', 'Defines superscripted text','Char Format',1\r\nUNION SELECT 'sub', 'Defines subscripted text','Char Format',1\r\nUNION SELECT 'bdo', 'Defines the direction of text display',\r\n                                                   'Char Format',1\r\nUNION SELECT 'u', 'Defines underlined text','Char Format',1\r\nUNION SELECT 'pre', 'Defines preformatted text','Output',1\r\nUNION SELECT 'code', 'Defines computer code text','Output',1\r\nUNION SELECT 'tt', 'Defines teletype text','Output',1\r\nUNION SELECT 'kbd', 'Defines keyboard text','Output',1\r\nUNION SELECT 'dfn', 'Defines a definition term','Output',1\r\nUNION SELECT 'var', 'Defines a variable','Output',1\r\nUNION SELECT 'samp', 'Defines sample computer code','Output',1\r\nUNION SELECT 'xmp', 'Deprecated. Use &lt;pre&gt; instead','Output',1\r\nUNION SELECT 'acronym', 'Defines an acronym','Blocks',1\r\nUNION SELECT 'abbr', 'Defines an abbreviation','Blocks',1\r\nUNION SELECT 'address', 'Defines an address element','Blocks',1\r\nUNION SELECT 'blockquote', 'Defines an long quotation','Blocks',1\r\nUNION SELECT 'center', 'Defines centered text','Blocks',1\r\nUNION SELECT 'q', 'Defines a short quotation','Blocks',1\r\nUNION SELECT 'cite', 'Defines a citation','Blocks',1\r\nUNION SELECT 'ins', 'Defines inserted text','Blocks',1\r\nUNION SELECT 'del', 'Defines deleted text','Blocks',1\r\nUNION SELECT 's', 'Defines strikethrough text','Blocks',1\r\nUNION SELECT 'strike', 'Defines strikethrough text','Blocks',1\r\nUNION SELECT 'a', 'Defines an anchor','Links',1\r\nUNION SELECT 'link', 'Defines a resource reference','Links',0\r\nUNION SELECT 'frame', 'Defines a sub window (a frame)','Frames',1\r\nUNION SELECT 'frameset', 'Defines a set of frames','Frames',1\r\nUNION SELECT 'noframes', 'Defines a noframe section','Frames',1\r\nUNION SELECT 'iframe', 'Defines an inline sub window (frame)','Frames',1\r\nUNION SELECT 'form', 'Defines a form ','Input',1\r\nUNION SELECT 'input', 'Defines an input field','Input',0\r\nUNION SELECT 'textarea', 'Defines a text area','Input',1\r\nUNION SELECT 'button', 'Defines a push button','Input',1\r\nUNION SELECT 'select', 'Defines a selectable list','Input',1\r\nUNION SELECT 'optgroup', 'Defines an option group','Input',1\r\nUNION SELECT 'option', 'Defines an item in a list box','Input',1\r\nUNION SELECT 'label', 'Defines a label for a form control','Input',1\r\nUNION SELECT 'fieldset', 'Defines a fieldset','Input',1\r\nUNION SELECT 'legend', 'Defines a title in a fieldset','Input',1\r\nUNION SELECT 'isindex', 'Deprecated. Use &lt;input&gt; instead','Input',1\r\nUNION SELECT 'ul', 'Defines an unordered list','Lists',1\r\nUNION SELECT 'ol', 'Defines an ordered list','Lists',1\r\nUNION SELECT 'li', 'Defines a list item','Lists',1\r\nUNION SELECT 'dir', 'Defines a directory list','Lists',1\r\nUNION SELECT 'dl', 'Defines a definition list','Lists',1\r\nUNION SELECT 'dt', 'Defines a definition term','Lists',1\r\nUNION SELECT 'dd', 'Defines a definition description','Lists',1\r\nUNION SELECT 'menu', 'Defines a menu list','Lists',1\r\nUNION SELECT 'img', 'Defines an image','Images',0\r\nUNION SELECT 'map', 'Defines an image map ','Images',1\r\nUNION SELECT 'area', 'Defines an area inside an image map','Images',0\r\nUNION SELECT 'table', 'Defines a table','Tables',1\r\nUNION SELECT 'caption', 'Defines a table caption','Tables',1\r\nUNION SELECT 'th', 'Defines a table header','Tables',1\r\nUNION SELECT 'tr', 'Defines a table row','Tables',1\r\nUNION SELECT 'td', 'Defines a table cell','Tables',1\r\nUNION SELECT 'thead', 'Defines a table header','Tables',1\r\nUNION SELECT 'tbody', 'Defines a table body','Tables',1\r\nUNION SELECT 'tfoot', 'Defines a table footer','Tables',1\r\nUNION SELECT 'col', 'Defines attributes for table columns ','Tables',0\r\nUNION SELECT 'colgroup', 'Defines groups of table columns','Tables',1\r\nUNION SELECT 'style', 'Defines a style definition','Styles',1\r\nUNION SELECT 'div', 'Defines a section in a document','Styles',1\r\nUNION SELECT 'span', 'Defines a section in a document','Styles',1\r\nUNION SELECT 'head', 'Defines information about the document','Meta Info',1\r\nUNION SELECT 'title', 'Defines the document title','Meta Info',1\r\nUNION SELECT 'meta', 'Defines meta information','Meta Info',0\r\nUNION SELECT 'base', 'Defines base URL for all links in a page','Meta Info',0\r\nUNION SELECT 'basefont', 'Defines a base font','Meta Info',0\r\nUNION SELECT 'script', 'Defines a script','Programming',1\r\nUNION SELECT 'noscript', 'Defines a noscript section','Programming',1\r\nUNION SELECT 'applet', 'Defines an applet','Programming',1\r\nUNION SELECT 'object', 'Defines an embedded object','Programming',1\r\nUNION SELECT 'param', 'Defines a parameter for an object','Programming',0\r\n--we thought this was a rather useful list!\r\nGO\r\n<\/pre>\n<p>The stored procedure <code>spParseXHTML<\/code> takes an XHTML fragment or document and returns a result that gives the list of tags, their nesting level, the order in which they were retrieved, the parameters, the innerHTML, the index into the document where the start of the tag was, the index into the document where the end of the tag was, and where the parent tag started. Finally, it has the key of the parent tag. This should be sufficient to do a fair amount of data analysis.<\/p>\n<p>The stored procedure is iterative. It should be recursive, but SQL Server isn&#8217;t naturally recursive, so an iterative solution was used instead. It isn&#8217;t fast. A large XHTML page took, we found, around ten seconds to analyse &ndash; but then there was all that lovely data!<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">\r\n--Creating the Parsing procedure\r\n        \r\n\r\nIF OBJECT_ID(N'spParseXHTML') IS NOT NULL \r\n   DROP PROCEDURE spParseXHTML\r\nGO\r\nCREATE PROCEDURE [dbo].[spParseXHTML]\r\n   @Mydocument VARCHAR(MAX)--the XHTML document or fragment\r\nAS \r\n   SET NOCOUNT ON\r\n   DECLARE @TagParams VARCHAR(8000),\r\n           @originalDocument VARCHAR(MAX),--copy of document\r\n           @Beginning INT,--the point at which the search for tags starts\r\n          @StartOfTag INT-- the opening &lt; of the tag\r\n   DECLARE @StartOfParentTag INT,-- the opening &lt; of the parent tag\r\n          @EndOfTag INT,-- the closing &gt; of the tag\r\n          @BackOfTag INT,-- the end &gt; of the &lt;\/tag&gt;\r\n          @ClosingTag INT,--the start &lt; of the &lt;\/tag&gt;\r\n          @nested INT,--is this tag nested\r\n          @ii INT,--iteration counter (just in case!)\r\n          @innerHTML VARCHAR(MAX),--the innerHTML between the tags\r\n          @hAScLOSINGtAG INT,--does the tag has a closing tag?\r\n          @Nesting INT,--the nesting count\r\n          @Tag VARCHAR(20),--the current tag (e.g. input)\r\n          @Error VARCHAR(8000)--the current error if any\r\n\r\n   CREATE TABLE #dom--we create the table that holds our DOM info\r\n      (\r\n       Element_ID INT IDENTITY(1, 1),--the unique identifer of each tag\r\n       nesting INT,--the nesting level of the tag\r\n       tag VARCHAR(20),--the actual tag (e.g. DIV, P, SPAN)\r\n       tagparams VARCHAR(7800),--the parameters (e.g. class=\"blue\")\r\n       innerHTML VARCHAR(MAX),--the html code contained in the tag\r\n      startOfTag INT,--the start &lt; position of the &lt;tag&gt;\r\n      EndOfTag INT,-- the end position of the &gt; of the &lt;\/tag&gt;\r\n      StartOfParentTag INT,--the start &lt; position of the parent &lt;tag&gt;\r\n      parent INT-- the key (Element_ID) of the parent tag (null if none)\r\n      )\r\n\r\n   SELECT   @ii = 1, @beginning = 1, @nesting = 0,\r\n       @OriginalDocument=@MyDocument --initialise variables\r\n   IF @MyDocument IS NULL --nothing to do\r\n      OR LEN(@MyDocument) &lt; 5 OR CHARINDEX('&lt;', @MyDocument) = 0 \r\n      SELECT @ii=10000--nothing to do but we must show empty table\r\n\r\n   WHILE @Beginning &lt; LEN(@MyDocument) AND @ii &lt; 10000--for all the document...\r\n      BEGIN --lets search for the next element \r\n         SELECT  @ii = @ii + 1--tally of iterations, in case something goes wrong.\r\n        --so we find the next start of the initial tag and end of the initial tag\r\n         SELECT  @StartOfTag = CHARINDEX('&lt;', @MyDocument, @Beginning - 1), \r\n                @EndOfTag = CHARINDEX('&gt;', @MyDocument, @Beginning)\r\n         IF @EndOfTag = 0--all done, no end-tag found\r\n            OR @StartOfTag = 0 --or start tag either!\r\n            BREAK--because it has been finished\r\n         IF @EndOfTag &lt; @StartOfTag--probably random tag\r\n            BEGIN--we take out the offending end-tag\r\n               SELECT   @MyDocument = STUFF(@MyDocument, @EndOfTag, 1, ']')\r\n              SELECT @Error=COALESCE(@Error+', ','')+'Syntax anomaly at'\r\n                       +SUBSTRING(@MyDocument,@endOfTag+1,20)           \r\n               CONTINUE-- flag an error\r\n            END\r\n        --now we identify the tag from the table list we've prepared\r\n         SELECT TOP 1 @HasClosingTag = HasClosingTag, @Tag=tag,@innerHTML=''\r\n           FROM    htmltags\r\n           WHERE  SUBSTRING(@MyDocument,@StartOfTag+1,20) LIKE tag+'%' ESCAPE '|'\r\n          ORDER BY LEN (tag) DESC--the longest match \r\n         IF @@rowcount = 0---if unrecognised or spurious end\r\n            BEGIN --neutralise the tags. they seem to be in error!\r\n              IF (SUBSTRING(@MyDocument,@StartOfTag+1,1)='\/')--eek, end tag\r\n                    SELECT @Error=COALESCE(@Error+', ','')\r\n                   +'Missing opening tag for &lt;'+SUBSTRING(@MyDocument,@StartOfTag+1,15) \r\n              ELSE --a tag that isn't in our list\r\n                    SELECT @Error=COALESCE(@Error+', ','')\r\n                   +'Could not find tag at '+SUBSTRING(@MyDocument,@StartOfTag+1,20) \r\n               --neutralise it so we never re-read it\r\n               --and take out all the pesky '&lt;' and '&gt;' characters in the block\r\n               SELECT @MyDocument = STUFF(@MyDocument, @EndOfTag, 1, 'X'),\r\n                      @MyDocument = STUFF(@MyDocument, @StartOfTag, 1, 'X'),\r\n                      @Beginning = 1   \r\n               CONTINUE           \r\n            END\r\n       IF @tag='!--'--the comment tag is a special case, blast it\r\n           BEGIN\r\n               SELECT @EndOfTag = CHARINDEX('--&gt;', @MyDocument, @StartOfTag)+2\r\n               IF @EndOfTag=2--nothing found\r\n                   BEGIN\r\n                   SELECT @Error=COALESCE(@Error+', ','')\r\n                                     +'Could not find end of comment''--&gt;''',\r\n                           @EndOfTag=LEN(@MyDocument)\r\n                   END\r\n               --and take out all the pesky '&lt;' and '&gt;' characters in the block\r\n               SELECT @MyDocument = STUFF(@MyDocument, @StartOfTag, @EndOfTag-@StartOfTag, \r\n                   REPLACE(REPLACE(\r\n                       SUBSTRING(@MyDocument, @StartOfTag, @EndOfTag-@StartOfTag),\r\n                   '&gt;',']'),'&lt;','['))\r\n           END \r\n       IF @tag='!|[CDATA|['--the CDATA tag is another special case, damn it\r\n           BEGIN\r\n               SELECT @EndOfTag = CHARINDEX(']]&gt;', @MyDocument, @StartOfTag)+2\r\n               IF @EndOfTag=2\r\n                   BEGIN\r\n                   SELECT @Error=COALESCE(@Error+', ','')\r\n                                                +'Could not find end of CDATA block',\r\n                           @EndOfTag=LEN(@MyDocument)\r\n                   END\r\n               SELECT @MyDocument = STUFF(@MyDocument, @StartOfTag, @EndOfTag-@StartOfTag, \r\n                   REPLACE(REPLACE(\r\n                       SUBSTRING(@MyDocument, @StartOfTag, @EndOfTag-@StartOfTag)\r\n                   ,'&gt;',']'),'&lt;','['))\r\n           END         \r\n        SELECT   @tagparams = LTRIM(RTRIM(SUBSTRING(@MyDocument,@startOfTag+LEN(@Tag)+1,\r\n                       @EndOfTag-(@StartOfTag+LEN(@Tag)+1))))\r\n         IF @HasClosingTag &lt;&gt; 0 --it has a closing-tag (HTML often leaves this out)\r\n            BEGIN--now we find the closing tag\r\n               SELECT   @closingtag = CHARINDEX('&lt;\/' + @tag + '&gt;', @MyDocument, @EndofTag), \r\n                   @nested = CHARINDEX('&lt;', @MyDocument, @EndofTag)\r\n               IF @nested = 0 \r\n        --there is no closing tag so it cannot be legal\r\n            \r\n                  BREAK --nowt else to do!\r\n               IF @nested &lt; @closingTag --Aha! that was nested!\r\n                  BEGIN--increment the nesting counter, and look for the first\r\n                       --child tag\r\n                     SELECT   @Nesting = @nesting + 1, @beginning = @EndOfTag + 1,\r\n                           @StartOfParentTag=@StartOfTag\r\n                     CONTINUE\r\n                  END\r\n               IF @ClosingTag = 0 \r\n                  BEGIN\r\n                    SELECT @Error=COALESCE(@Error+', ','')\r\n                             +'Could not find a matching &lt;\/' + @tag + '&gt;'\r\n                     SELECT   @beginning = @endOfTag + 1\r\n                     CONTINUE\r\n                  END\r\n             --so now we neutralise all the &lt; &gt; characters in the tag so we\r\n             --dont find the same tag again\r\n             SELECT   @MyDocument = STUFF(@MyDocument, @closingTag, 1, '[')\r\n             SELECT   @MyDocument = STUFF(@MyDocument, @closingTag+LEN(@tag)+2, 1, ']')\r\n             SELECT   @MyDocument = STUFF(@MyDocument, @EndOfTag, 1, ']')\r\n             SELECT   @MyDocument = STUFF(@MyDocument, @StartOfTag, 1, '[')\r\n             --left as four select statements for clarity!\r\n              SELECT   @innerHTML = SUBSTRING(@OriginalDocument, @EndofTag + 1, \r\n                        @closingTag - @EndofTag - 1),--get the innerHTML\r\n                       @BackOfTag=@closingTag+LEN(@tag)+3\r\n           END\r\n           ELSE--it was a tag without a closing tag (e.g. hr, br, input)\r\n                   BEGIN\r\n                     IF LEN(@TagParams)&gt;0\r\n                       IF SUBSTRING(REVERSE(@tagparams),1,1)='\/'\r\n                           SELECT @TagParams=LTRIM(RTRIM(LEFT(@TagParams,\r\n                                                               LEN(@TagParams)-1)\r\n                                                   ))\r\n                     SELECT   @MyDocument = STUFF(@MyDocument, @EndOfTag, 1, ']'), \r\n                               @BackOfTag=@endOfTag\r\n                     SELECT   @MyDocument = STUFF(@MyDocument, @StartOfTag, 1, '[')\r\n                   END\r\n       --now all we need to do is to record the Element in the table\r\n         INSERT   INTO #dom\r\n                  (nesting, innerHTML, tag, tagparams, \r\n                                           StartOfParentTag, StartOfTag, EndOfTag)\r\n                  SELECT   @nesting, @innerHTML, @tag, @Tagparams,@StartOfParentTag, \r\n                               @StartOfTag,@BackOfTag\r\n         SELECT   @Nesting = 0,@beginning = 1\r\n      END --end of the WHILE loop\r\nUPDATE   #dom --now stitch in the keys to the parent tags and then all is done!\r\n   SET      parent = parentTag.Element_ID\r\n   FROM     #dom INNER JOIN #dom parentTag\r\n         ON #dom.StartOfParentTag = parentTag.StartOfTag \r\n        AND #dom.StartOfParentTag &lt;&gt; #dom.StartOfTag\r\n\r\nSELECT * FROM #dom\r\nIF @error IS NOT NULL--oops, an error. This should help clean up the XHTML\r\n   BEGIN\r\n   RAISERROR(@error,16,1)\r\n   RETURN(1)\r\n   END\r\nGO\r\n--Now, here is a small sample just to show you how to use it. We'll get data\r\n--out of a table (it is just as easy getting them from Anchors &lt;a&gt;  or\r\n--divs.\r\n            \r\n--Using the Parser\r\n            \r\n\r\n            \r\nCREATE TABLE #Ourdom\r\n      (\r\n       Element_ID INT,\r\n       nesting INT,\r\n       tag VARCHAR(20),\r\n       tagparams VARCHAR(7800),\r\n       innerHTML VARCHAR(MAX),\r\n      startOfTag INT,\r\n      EndOfTag INT,\r\n      StartOfParentTag INT,\r\n      parent INT\r\n      )\r\n\r\nINSERT INTO #OurDom\r\n   EXECUTE spParseXHTML '\r\n&lt;table class=\"data\"&gt;\r\n&lt;tr style=\"vertical-align: bottom !important\"&gt;\r\n&lt;th&gt;Currency&lt;\/th&gt;\r\n\r\n&lt;th&gt;Weights based on&lt;br&gt;\r\n1999-2001 trade data &lt;sup&gt;a&lt;\/sup&gt;&lt;\/th&gt;\r\n\r\n&lt;th&gt;Weights based on&lt;br&gt;\r\n1989-1991 trade data &lt;sup&gt;b&lt;\/sup&gt;\r\n&lt;\/th&gt;\r\n&lt;\/tr&gt;\r\n\r\n&lt;tr&gt;\r\n&lt;td&gt;U.S. dollar&lt;\/td&gt;\r\n&lt;td&gt;0.7618 &lt;\/td&gt;\r\n&lt;td&gt;0.5886&lt;\/td&gt;\r\n\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;Euro&lt;\/td&gt;\r\n&lt;td&gt;0.0931&lt;\/td&gt;\r\n&lt;td&gt;0.1943&lt;\/td&gt;\r\n\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;Japanese yen&lt;\/td&gt;\r\n&lt;td&gt;0.0527&lt;\/td&gt;\r\n&lt;td&gt;0.1279&lt;\/td&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;Chinese yuan&lt;\/td&gt;\r\n&lt;td&gt;0.0329&lt;\/td&gt;\r\n&lt;td&gt;&amp;mdash;&lt;\/td&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;Mexican peso&lt;\/td&gt;\r\n&lt;td&gt;0.0324&lt;\/td&gt;\r\n&lt;td&gt;0.0217&lt;\/td&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;U.K. pound&lt;\/td&gt;\r\n&lt;td&gt;0.0271&lt;\/td&gt;\r\n&lt;td&gt;0.0368&lt;\/td&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;South Korean won&lt;\/td&gt;\r\n&lt;td&gt;&amp;mdash;&lt;\/td&gt;\r\n&lt;td&gt;0.0307&lt;\/td&gt;\r\n&lt;\/tr&gt;\r\n&lt;\/table&gt;'\r\n\r\n<\/pre>\n<p>This will give the following table (some TRs are omitted so that the image size is not too great).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/489-domparsing.gif\" alt=\"489-domparsing.gif\" \/><\/p>\n<p> Now that we have our elements all in a table, we can take out just the data we want. Obviously, real data will usually have more information than this, but we didn&#8217;t want the workbench to become too unwieldy.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">\r\nSELECT \r\n  [Currency]=MAX(CASE WHEN col=1 THEN contents ELSE '' END),\r\n  [Weights (1999-2001)]=MAX(CASE WHEN col=2 THEN contents ELSE '' END),\r\n  [Weights (1989-1991)]=MAX(CASE WHEN col=3 THEN contents ELSE '' END)\r\n-- with a bit more trouble, and some Dynamic SQL, we can pick out the\r\n-- THs and dynamically build the column names. But the code would not\r\n-- be so easy to read, so we sat on our hands! (Phil, the old showoff\r\n-- was dying to show how to do it. R.P.) (I may add it in a comment P.F.)\r\nFROM\r\n   (SELECT  parent,[contents]=innerHTML, \r\n       [col]=(\r\n           SELECT COUNT(*) \r\n               FROM #ourdom siblings\r\n               WHERE siblings.parent=dom.parent\r\n               AND siblings.Element_ID&lt;dom.Element_ID\r\n           )+1\r\n   FROM #Ourdom dom\r\n   WHERE parent IN\r\n           (SELECT Element_ID\r\n            FROM   #Ourdom\r\n            WHERE  parent = \r\n               (SELECT TOP 1 Element_ID \r\n               FROM #Ourdom WHERE tag = 'table'\r\n                 ) \r\n           AND tag = 'tr'\r\n           ) \r\n     AND tag = 'td'\r\n    ) f\r\nGROUP BY parent ORDER BY parent\r\nGO\r\nDROP TABLE #OurDom\r\n\r\n\/*\r\nCurrency          Weights (1999-2001)  Weights (1989-1991)\r\n----------------- -------------------- -------------------- \r\nU.S. dollar       0.7618              0.5886\r\nEuro              0.0931               0.1943\r\nJapanese yen      0.0527               0.1279\r\nChinese yuan      0.0329               &amp;mdash;\r\nMexican peso      0.0324               0.0217\r\nU.K. pound        0.0271               0.0368\r\nSouth Korean won  &amp;mdash;              0.0307\r\n\r\n(7 row(s) affected)\r\n<\/pre>\n<p>So there we are. The next stage is to get the pages directly from the website. If your SQL Server has direct access to the internet (behind a NAT and Firewall I hope) you can do this quite easily and we&#8217;ve already shown you how to do that in the past! <\/p>\n<ul>\n<li>\n        <a href=\"http:\/\/www.simple-talk.com\/community\/blogs\/philfactor\/archive\/2006\/01\/20\/158.aspx\">Getting Stuff Into SQL Server<\/a>\n    <\/li>\n<li>\n        <a href=\"http:\/\/www.simple-talk.com\/community\/forums\/thread\/1564.aspx#1581\">Testing Links<\/a>\n    <\/li>\n<li>\n        <a href=\"http:\/\/www.xstandard.com\/en\/documentation\/xhttp\/#overview\">XStandard HTTP Component<\/a>\n    <\/li>\n<\/ul>\n<p>Maybe a future Workbench could give a few illustrations of ways of accessing pages on the internet.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Robyn and Phil start their investigation into XHTML by showing how to use TSQL to parse it to extract data, and demonstrate how to turn an XHTML table into a SQL Server Table!&hellip;<\/p>\n","protected":false},"author":172554,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143531],"tags":[4150,4151,4252,4190,4852,4460],"coauthors":[6812],"class_list":["post-360","post","type-post","status-publish","format-standard","hentry","category-t-sql-programming-sql-server","tag-sql","tag-sql-server","tag-t-sql-programming","tag-tsql","tag-tsql-parse-dom-example-free-code-sql-xhtml-html-table","tag-workbench"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/360","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/172554"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=360"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/360\/revisions"}],"predecessor-version":[{"id":92564,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/360\/revisions\/92564"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=360"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=360"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=360"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=360"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}