Adding Public Domain Data to your Software Demos with R

With all the extensions that are available, it is astonishing how little R code is required to achieve impressive BI graphics. You can even use R to take the latest data from a table in a website and apply it to a map in just a few lines of code: and here they are, with Sergei's explanation..

Sometimes I want to add public domain data, e.g. from the Bureau of Statistics, to my business intelligence (BI) software demonstrations to support some ideas, and I always try to side-step this boring, mundane task. It was a way of life for quite a while until I found how R can make this task real fun.

Let’s assume that our BI application contains a Year-to-Date Sales by Australian (AU) State Report. The application gives us an option to visualise this information directly on the AU map, shows it as a bar graph, or as a table.

It is very likely that this information strongly correlates with the population by state, but we do not have the population by state data in our database. The answer is to create a live link to a public domain site that contains the required information and that allows this type of access, bring it directly into the application (this operation is often called web scraping), transform it, and use it for the sales analysis and potentially in other parts of the application. You have to take some precautions if you plan to use it in production versions of your applications as the data source page might change in the future.

Web Scraping a Data Source

The web site I am going to fetch the data from is a Wikipedia page Ranked list of states and territories of Australia. The first HTML table on this page is shown below.

1904-clip_image001.png

Figure 1. The web scraping data source.

Our next tasks are to

  •  Create a live link to this table from our application
  •  Extract 2nd and 3rd columns of the table
  •  Take only rows that correspond to the main Australian states (rows 1 to 8)
  •  Change names of the extracted columns to State and Population (for brevity) respectively
  •  Display state population on the map, bar graph, or as a table, e.g. as it is shown below

1904-clip_image002.png

Figure 2. Distribution of the Australian population by state (tooltip shows the NT population).

Here is a just 14 line software fairy that does the magic in a fraction of a second (on my PC over a good Internet connection) …

Web Scraping R Code Snippet

Listing 1. Web Scraping and plotting R snippet.

How does it work? The lines 1 to 4 install, if it is required, and load two libraries, i.e. XML and googleVis. The first one liberates us from that boring, mundane task whilst the second library elegantly plots the extracted data.

The line 6 extracts the entire web page that contains the required data and saves it in the variable doc. The next line searches within the saved page, selects only the HTML tables and saves this collection in the ndx variable. It is assumed that the HTML tables are well-formed.

After that the line 8 gets the first table out of the ndx collection, the one we need, and saves it in the tbl data frame variable. The following two lines do the above mentioned transformation of removing redundant columns and changing names of the remaining ones.

Although values in the second column look like numbers, they are string.  The line 11 replaces commas and dashes with an empty string and converts strings to integer numbers. Finally, the line 12 filters out unneeded rows. Here is the tbl data frame ready to be visualised.

The visualisation is done by the last two lines in the Listing 1. The first statement …

… creates an HTML fragment that contains the tbl data frame converted  -into a JSON object (then into a Google Fusion table) and googleVis Flash component that is  rendered by the last line, plot(AUP).

There is no need to provide the state longitudes and latitudes. The googleVis works well with any address type strings such as 10 George Street, Sydney, NSW, Australia or shorter versions, e.g. Tasmania (addresses / geo names should comply with the ISO standards; see googleVis documentation for more details).

As I have mentioned, the tbl data frame can be also presented as a bar graph and/or a table. Here is an updated version of the code in Listing 1.

Listing 2: updated R code – the software fairy.

It generates both the above map (html fragment) and a corresponding bar graph (an image) that were integrated into an ASP.NET MVC 5 application. A screenshot of its home page is displayed below. Technique of integrating code fragments and images produced by R procedures has been explained in great details in my recent articles that can be found here.

1904-clip_image003.png