Getting HTML Data: Workbench

Robyn and Phil start their investigation into XHTML by showing how to use TSQL to parse it to extract data, and demonstrate how to turn an XHTML table into a SQL Server Table!

There is a lot of data out there in the internet. There are times you want to get at it. You may , for example, want the latest exchange rates, or maybe commodity prices. The problem in getting it is that you want it in a form that can be easily consumed and assimilated by SQL Server. We’ll show you how you might get an HTML table into a SQL Server table.

There often comes a time when one wants to parse an XHTML or HTML document into its DOM in order to get data from it. Without thinking too hard, we can think of several times we’ve wanted it.

  • to get data from a table on an HTML page on the internet (with CURL.EXE it is so easy!)
  • to make sure that an XHTML snippet is valid for a blog comment (e.g. has it got a TH or TD without an enclosing TR or TABLE?)
  • to check if a snippet has got SCRIPT in it?
  • to convert HTML code into IFcodes or BBcodes.
  • to find all the anchors on an HTML page
  • to index up existing HTML pages for site searches.

Occasionally, one can assign an XHTML snippet to an XML variable in SQL Server. We’ve never had much luck with this approach since perfectly valid XHTML can trigger an error in SQL Server’s XML.

There are several sensible approaches to investigating the Document Object Model (DOM) of an HTML document. These will generally involve scripting, using one of the existing DOM parsers. Parsing it using a stored procedure or function isn’t generally included in the list of sensible options.

You can, of course, use the Microsoft DOMDocument object, Microsoft.mshtml, the mozilla DOMparser, or the XMLDocument class in .NET. In PHP5 you would use the DOM extension. There are several other ways.

We’ll use TSQL.

Why do this? It is because we have a fine control over the results. We can get a table containing the DOM, in all its hierarchical glory, each row being a single element, in such a way that we can use SQL Queries to get whatever data we are interested in extracted.

One big problem with HTML is that browsers are deliberately tolerant of bad markup. This makes a robust way of analysing the structure of an HTML document extraordinarily elaborate. It is possible but the code is too long for a Workbench. Instead, we will provide the means to analyse the structure of an XHTML document or fragment. The difference here is that you are more likely to find that it is well-formed, and you can reject errors.

We’ve chosen to show you this as a stored procedure because this code is really is the stuff of workbenches. We’d like you to improve upon it and see if you can bend it to other purposes. Using a stored procedure rather than a function means that you can add debug code more easily whilst experimenting.

Don’t get us wrong. The code works well enough for the sort of jobs we have. XHTML without errors. (HTMLTidy is a wonderful app). We’d just be very pleased if you get interested in what we’ve done, and experiment with the ideas!

First off, we need a table with all the valid markup.

The stored procedure spParseXHTML takes an XHTML fragment or document and returns a result that gives the list of tags, their nesting level, the order in which they were retrieved, the parameters, the innerHTML, the index into the document where the start of the tag was, the index into the document where the end of the tag was, and where the parent tag started. Finally, it has the key of the parent tag. This should be sufficient to do a fair amount of data analysis.

The stored procedure is iterative. It should be recursive, but SQL Server isn’t naturally recursive, so an iterative solution was used instead. It isn’t fast. A large XHTML page took, we found, around ten seconds to analyse – but then there was all that lovely data!

This will give the following table (some TRs are omitted so that the image size is not too great).

489-domparsing.gif

Now that we have our elements all in a table, we can take out just the data we want. Obviously, real data will usually have more information than this, but we didn’t want the workbench to become too unwieldy.

So there we are. The next stage is to get the pages directly from the website. If your SQL Server has direct access to the internet (behind a NAT and Firewall I hope) you can do this quite easily and we’ve already shown you how to do that in the past!

Maybe a future Workbench could give a few illustrations of ways of accessing pages on the internet.