{"id":2012,"date":"2015-06-02T00:00:00","date_gmt":"2015-06-02T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/using-r-to-explore-data-by-analysis-for-sql-professionals\/"},"modified":"2021-05-17T18:33:14","modified_gmt":"2021-05-17T18:33:14","slug":"using-r-to-explore-data-by-analysis-for-sql-professionals","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/using-r-to-explore-data-by-analysis-for-sql-professionals\/","title":{"rendered":"Using R to Explore Data by Analysis &#8211; for SQL Professionals"},"content":{"rendered":"<div id=\"pretty\">\n<p><i>Data  analysis is the process by which data becomes understanding, knowledge and insight<\/i><\/p>\n<p class=\"start\">In writing this article, I aim to show how you  can explore data by analysis, using R to investigate data stored in flat files. In doing so, I&#8217;ll compare how some of  these tasks are done in T-SQL and in R, as a way of introducing some of the basics of R for T-SQL developers.<\/p>\n<p>Let&#8217;s assume that we are doing some preliminary  investigation of some fairly structured data, and we want to perform a rudimentary analysis just to see if we can get  value from it. This can of course be done with the tools of the trade in any RDBMS system, however with the R  environment this is significantly more valuable because, even if we don&#8217;t take advantage of its unique ability to do  parametric statistical analysis, it is free, takes little time compared to RDBMS systems and the data modelling part is  fairly easy to do.<\/p>\n<p>Here is what the data exploration process looks  like: <\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-img7.jpg\" alt=\"2204-img7.jpg\" \/><\/p>\n<p>This concept is not new, and it is applied  similarly in the RDBMS world. The main contrast is that in ordinary RDBMS systems, the data gets persisted in several  stages whereas in R the operations are done in memory as soon as the data is read into it. You can, of course, persist  the data at any point with R, though without the resilience of a database. An additional contrast is that there is no  need for a physical model layer in R, whereas the RDBMS heavily relies on it. &#160;<\/p>\n<p>It is essential to tidy the data and have a  manageable convention for it: In an RDBMS, every column refers to an attribute and every row denotes an entity. In R  every column is called <b>a variable<\/b>, and every row is called <b>an observation<\/b>. The purpose of tidying the data is to form properly the  variables and the observations, so it is easier to model, transform and visualize later on. <\/p>\n<p>After the data is tidy, we get into the loop of  modelling, transformation and visualization. This is an iterative task, which has the purpose of modeling and  transforming the data until the best visualization can be produced. There are many ways to do the separate steps of  modelling, transformation and visualization steps: However, in this article I have chosen to use <b>dplyr<\/b> and <b>ggplot2<\/b> packages. I  tried to be flippant by writing that the modelling is done by &#8216;trial and error&#8217; &#8211; there are many packages which perform  modelling and the most suitable choice depends on what our goal is: Are we doing regression? Are we doing predictive  analysis? In this case I will do just a rudimentary exploration analysis, just to see if there is anything even worth  extracting from the data. In the real world, this would be done first, and later on a modelling tool would be used to do  a deeper analysis, depending on the findings of the exploratory analysis. <\/p>\n<p>For this article, we will be using aircraft  performance data as sample data. As a side note: I do not have deep knowledge about aircraft, but my curiosity about the  field started early, after my first flight on a Tupolev-134. <\/p>\n<p>The tidying process will be omitted, since the  data I found on the net is very well structured, it is in a csv format already. You can download the datasets from the  top of this article. <\/p>\n<h1> The tools of the trade<\/h1>\n<p>We will be using the R language environment for  the purpose of data exploration: &#160;RStudio and two packages: dplyr and ggplot2. <b>Dplyr<\/b> is a powerful package for managing and transforming datasets and <b>ggplot2<\/b> is a comprehensive package which is used to visualize the data.<\/p>\n<p>Here are the most important commands in <b>dplyr<\/b> which we will be using in this article:<\/p>\n<ul>\n<li> \t\t<b> \t\tFilter<\/b>  &#8211; filters rows that match certain criteria (in T-SQL this is achieved by the WHERE clause)<\/li>\n<li> \t\t<b> \t\tSelect<\/b>  &#8211; pick only certain columns from the dataset (in T-SQL this is done by selecting only certain columns to be returned)<\/li>\n<li> \t\t<b> \t\tArrange<\/b>  &#8211; reorder rows (in T-SQL this is done with the ORDER BY clause)<\/li>\n<li> \t\t<b> \t\tMutate<\/b>  &#8211; add new variables (columns, computed or not) (In DBMS this is done in variety of ways &#8211; either as computed columns in  the schema layer, or as T-SQL expressions &#160;&#8217;on the fly&#8217; in the SELECT statements)<\/li>\n<li> \t\t<b> \t\tSummarize<\/b>  &#8211; reduce variables to values (in T_SQL this is achieved with different aggregate functions called on the selected  dataset and by performing some GROUP BY, if needed)<\/li>\n<\/ul>\n<p>The basic construction of these commands is  that the first argument is a dataframe, and subsequent arguments say what to do with the dataframe. The output is always  a dataframe. <span class=\"msoIns\">Think of a dataframe as a  structured table in a database or as a spreadsheet in Excel. <\/span><\/p>\n<p>Here are some examples:<\/p>\n<h3> Create a dataset:<\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">data &lt;- data.frame(\nname = c(\"Alan\", \"Joe\", \"Tina\", \"Alan\", \"Suzan\"),\nvalue = 1:5)\n<\/pre>\n<h3> \t \t&#160;Filter:<\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">filter(data, name == \"Alan\")\n<\/pre>\n<p>Note: just like in  T-SQL there are many operators: !=, ==, %in%, etc <\/p>\n<h3>  &#160;Select:   <\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">    select(data, name)\n    select(data, -name)\n<\/pre>\n<h3> \tArrange:<\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">arrange(data, name)\narrange(data, desc(value))\n<\/pre>\n<h3> \t \t&#160;Mutate:<\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">mutate(data, double = 2 * value)\nmutate(data, double = 2 * value,quadruple = 2 * double)\n<\/pre>\n<h3> Summarize:<\/h3>\n<pre class=\"theme:ssms2012 lang:tsql\">summarise(data, total = sum(value))\nby_name &lt;- group_by(data, name)\nsummarise(by_name, total = sum(value))\n<\/pre>\n<h1> \tData pipelines<\/h1>\n<p>The downside of a functional interface is that  it is very hard to read multiple operations. Consider even the above example of the summarize script. It creates a new  object <b>by_name<\/b>, and then it runs the summarize function with it as a  parameter. <\/p>\n<p>With data pipelining it is much easier to  express all this in a single line:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">by_name &lt;- group_by(data, name) %&gt;% summarise(total = sum(value))\n<\/pre>\n<p>Data pipeline means that:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">x %&gt;% f(y) \n<\/pre>\n<p><b> &#8230;  is<\/b> the same <b>as<\/b>  \t&#8230;<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">f(x, y)\n<\/pre>\n<p>This means that  &#160;the left side of the <code>%&gt;% <\/code>expression becomes the first parameter of the function which is on the right side.  <\/p>\n<h4>  &#160;It is time for the real data  exploration <\/h4>\n<h2> Get the data &#8211; load the data in variables<\/h2>\n<p>Now that we have the basic concepts of the  modification of&#160; dataframes, let&#8217;s get into some data exploration.  <\/p>\n<p>I have been curious about airplanes and  aviation in general for many years. I found some airplane specifications on the net containing technical data about the  airplanes and their performance and would like to use the datasets to learn a bit more about the aircraft. Further, I  would like to dig into the data and see if there are any interesting potential findings.  &#160;<\/p>\n<p>I have the aircraft specification data in  several files:<\/p>\n<ul>\n<li> \t\t<a href=\"https:\/\/www.simple-talk.com\/content\/file.ashx?file=11227\">Aircraft_EnginesSpecs.csv<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/content\/file.ashx?file=11228\">Aircraft_PassengerFeaturesSpecs.csv<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/content\/file.ashx?file=11229\">Aircraft_SpeedAndAltitudeSpecs.csv<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/content\/file.ashx?file=11230\">Aircraft_WeightAndFuelSpecs.csv<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/content\/file.ashx?file=11231\">AircraftPerformance.csv<\/a><\/li>\n<\/ul>\n<p>In the case of a RDBMS, we would have to import  the data via one of the available tools &#8211; for SQL Server this might be SSIS, or T-SQL&#8217;s BULK IMPORT command. In either  way, we would have to persist the data to disk in database tables with specific data types. <\/p>\n<p>In R we just import the data into memory.&#160; <\/p>\n<p>Let&#8217;s load the data in R by using the following  commands (before we can run these commands, make sure the files are in the work directory of the R environment):<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">df.engineSpecs &lt;- read.csv(\"Aircraft_EnginesSpecs.csv\",header = TRUE, sep = \";\")\ndf.PassengerFeaturesSpecs &lt;- read.csv(\"Aircraft_PassengerFeaturesSpecs.csv\",header = TRUE, sep = \";\")\ndf.SpeedAndAltitudeSpecs &lt;- read.csv(\"Aircraft_SpeedAndAltitudeSpecs.csv\",header = TRUE, sep = \";\")\ndf.WeightAndFuelSpecs &lt;- read.csv(\"Aircraft_WeightAndFuelSpecs.csv\",header = TRUE, sep = \";\")\n<\/pre>\n<p>And after we import the data, let&#8217;s explore it:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">head(df.engineSpecs)\nhead(df.PassengerFeaturesSpecs)\nhead(df.SpeedAndAltitudeSpecs)\nhead(df.WeightAndFuelSpecs)\n<\/pre>\n<p>The head() function returns the first few rows  of the dataset:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&gt; head(df.engineSpecs)\n<\/pre>\n<p><\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&#160; Aircraft.Model &#160;&#160;&#160;&#160;&#160;&#160; Engine.Manufacturer&#160; Model...Type &#160;&#160;&#160; No..of.engines &#160; Static.thrust..kN.\n1 AIRBUSA300-600R [1974]&#160;&#160;&#160;&#160;&#160;P&amp;W&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;  4158&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 257\n2&#160; AIRBUSA310-300 [1983]&#160;&#160;&#160;&#160; P&amp;W&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;  4152&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;  &#160;2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;231\n3&#160; AIRBUSA319-100 [1995]&#160;&#160;&#160;&#160;&#160;CFMI&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;  CFM56-5A4&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160; 2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;99,7\n4&#160; AIRBUSA320-200 [1988]&#160;&#160;&#160;&#160; CFMI&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160; CFM56-5A3&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160; 2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;111,2\n5&#160; AIRBUSA321-200 [1993]&#160;&#160;&#160;&#160; CFMI&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;CFM56-5B3&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160; &#160;&#160;2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 142\n6&#160; AIRBUSA330-200 [1998]&#160;&#160;&#160;&#160;&#160;GE&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;  CF6-80E1A4&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160; 2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 310\n<\/pre>\n<p>In T-SQL this will be done by a <code>SELECT TOP(N)  FROM<\/code>&#8230; statement. <\/p>\n<h2>  &#160;Examine the data<\/h2>\n<p>The next very important task is to check the  data for duplicates. If we have duplicates, this will create problems later on when we join the datasets.  <\/p>\n<p>In R we can check for duplicates easily with a  couple piped functions:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">df.engineSpecs&#160;&#160;%&gt;% group_by(Aircraft.Model) %&gt;% summarise(count=n()) %&gt;% filter(count!=1)\ndf.PassengerFeaturesSpecs&#160;&#160;%&gt;% group_by(Aircraft.Model) %&gt;% summarise(count=n()) %&gt;% filter(count!=1)\ndf.SpeedAndAltitudeSpecs %&gt;% group_by(Aircraft.Model) %&gt;% summarise(count=n()) %&gt;% filter(count!=1)\ndf.WeightAndFuelSpecs&#160;&#160;%&gt;% group_by(Aircraft.Model) %&gt;% summarise(count=n()) %&gt;% filter(count!=1)\n<\/pre>\n<p>In T-SQL checking for duplicates can be done  with a CTE by using the ranking partitioning function <\/p>\n<p> <code> ... row_number() over(partition col1... order by col1) as [rankNumber] ...<\/code><\/p>\n<p>The CTE will return the [rankNumber] column  with value 1 for each unique value in col1 and will increase the count for each subsequent duplicate. To return the  duplicate rows we would use a statement similar to SELECT * from CTE where [rankNumber] &gt; 1<\/p>\n<p>In this case there are duplicates in the  engineSpecs file, and the result in R looks like this:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&gt; df.engineSpecs&#160;&#160;%&gt;% group_by(Aircraft.Model) %&gt;% summarise(count=n()) %&gt;% filter(count!=1)Source: local data frame [2 x 2]\n<\/pre>\n<p><\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&#160;&#160;Aircraft.Model &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; count\n1 AIRBUSA319-100 [1995]&#160;&#160;   &#160;&#160;   2\n2&#160;&#160;TUPOLEVTu-334 [2000]&#160;&#160;   &#160;&#160;&#160;2\n<\/pre>\n<p>To remove the duplicates, in T-SQL we would use  the CTE mentioned above and a <code>DELETE<\/code> statement like <code><br \/><\/code> \t&#160;&#160;<code>&#160; <\/code> <code>DELETE from CTE where [rankNumber] &gt; 1<\/code><\/p>\n<p>In R we can remove the duplicate records by  using the following code:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">df.engineSpecs&#160; &lt;- distinct(df.engineSpecs ,Aircraft.Model)\ndf.PassengerFeaturesSpecs &lt;- distinct(df.PassengerFeaturesSpecs ,Aircraft.Model)\ndf.SpeedAndAltitudeSpecs&#160; &lt;- distinct(df.SpeedAndAltitudeSpecs ,Aircraft.Model)\ndf.WeightAndFuelSpecs&#160;&#160; &lt;- distinct(df.WeightAndFuelSpecs ,Aircraft.Model)\n<\/pre>\n<h2> Join datasets into a dataframe<\/h2>\n<p>As a next step we will join the datasets  together in one dataset. Joins in R are very similar to the concept of joins in T-SQL and I will not go in great detail  about them. The important part here is to make sure that we check for the row count after each join and make sure the  count does not increase significantly. <\/p>\n<p>The code we will use for the join in R is:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">df.aircraft_specs &lt;- left_join(df.engineSpecs, df.PassengerFeaturesSpecs, by=\"Aircraft.Model\")&#160;&#160;\ndf.aircraft_specs %&gt;% summarise(count=n())\ndf.aircraft_specs &lt;- left_join(df.aircraft_specs, df.SpeedAndAltitudeSpecs, by=\"Aircraft.Model\")\ndf.aircraft_specs %&gt;% summarise(count=n())\ndf.aircraft_specs &lt;- left_join(df.aircraft_specs, df.WeightAndFuelSpecs, by=\"Aircraft.Model\") \ndf.aircraft_specs %&gt;% summarise(count=n())\n<\/pre>\n<p>You might have noticed that I am using the &#160;<code>df.aircraft_specs %&gt;% summarise(count=n())<\/code> call to count the rows after each  join. <\/p>\n<h2> Add new variables<\/h2>\n<p>After we have done the join, we have a master  dataset, which looks like this:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"189\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image004.jpg\" width=\"604\" alt=\"2204-clip_image004.jpg\" \/><\/p>\n<p>The dataset has 72 observations of 41  variables. This, translated to a T-SQL way of thinking, means that we have a table with 41 columns which contain values  for 72 rows. <\/p>\n<p>As we can see in the dataset, year that the  aircraft was released held in the Aircraft.Model variable. The pattern is that the year is separated by [ and ]. It  seems that &#160;this would be a valuable part of the data analysis, and it would be  great if we could somehow extract this data into a separate variable. <\/p>\n<p>In T-SQL we would use a substring() function on  the Aircraft.Model column and load it with proper parameters for start and length of the substring.  <\/p>\n<p>In R we can use the <b>mutate()<\/b> function to create the extra variable and together with the <b>sub()<\/b> function and a regular expression to extract the release year.  <\/p>\n<pre>df.aircraft_specs &lt;- mutate(df.aircraft_specs, year=sub(\".*\\\\[([0-9]{4})\\\\]\",\"\\\\1\",Aircraft.Model))<\/pre>\n<p>The next idea comes to mind: it would be great  to extract variable which contains the decade when&#160;&#160; each aircraft was  released. <\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">df.aircraft_specs &lt;- mutate(df.aircraft_specs, \n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;decade = as.factor(\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='193','1930s',\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='194','1940s',\n         &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='195','1950s',\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='196','1960s',\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='197','1970s',\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='198','1980s',\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='199','1990s',\n &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='200','2000s',\n\t&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;ifelse(substring(df.aircraft_specs$year,1,3)=='201','2010s',\"ERROR\"\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; )))))))))\n&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;)\n    &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; )\n<\/pre>\n<p>In T-SQL this can be done with a CASE function  in a very similar manner as the code above. <\/p>\n<p>Let&#8217;s see what distinct decades we have and how  many aircraft per decade:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&gt; select(df.aircraft_specs,decade)&#160;&#160;%&gt;% group_by(decade) %&gt;% summarise(count=n())\nSource: local data frame [6 x 2]\n<\/pre>\n<p><\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&#160; decade count\n\t\t1&#160;&#160;1950s&#160;&#160;   &#160; 1\n2&#160;&#160;1960s&#160;&#160;   &#160;10\n3&#160;&#160;1970s&#160;&#160;   &#160; 9\n4&#160;&#160;1980s&#160;&#160;   &#160;20\n5&#160;&#160;1990s&#160;&#160;   &#160;24\n6&#160;&#160;2000s&#160;&#160;   &#160; 8\n<\/pre>\n<p>Now I am curious: which plane from the 50s do I  have in the dataset? In T-SQL this will be done with a simple SELECT &#8230; WHERE statement, and in R this can be done with  the piped select %&gt;% filter command:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">select(df.aircraft_specs,decade,Aircraft.Model) %&gt;% filter(decade==\"1950s\")\n<\/pre>\n<p>And the result is this:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">&#160;&#160; decade&#160;&#160;&#160;&#160; Aircraft.Model\n1&#160; 1950s DOUG.DC8-63 [1959]\n<\/pre>\n<h2> Explore the data<\/h2>\n<p>So far we have imported the data into memory,  we have deduplicated the data, we have sorted out a few extra variables from the data (built year and decade) and now it  is time to jump right in and see if we can get some value out of it. So far we have used the <b>dplyr <\/b>package (which covers most T-SQL operations) and now we will dive into  the <b>ggplot2 <\/b>package, which corresponds to the SSRS component of SQL Server  (or even Excel, if the data fits within the limitations of it). <\/p>\n<p>We are looking for interesting features in the  dataset: things that stand out, trends and relationships between variables. Of course, as it usually happens, the  findings will certainly bring up more questions, and this is why the process of exploration is one of iterative cycles &#160;between &#160;modelling, transformation and  visualization. <\/p>\n<p>Let&#8217;s look at the thrust (the propulsive force  of an engine) vs aircraft weight ratio first. We want to get an idea of what the distribution looks like. We can do this  in a simple plot graph (in T-SQL we would write a simple SELECT statement which returns the columns we need and then  feed the data to Excel or SSRS). In R this can be done with the <b>ggplot2<\/b>  package and the following code:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\"># static thrust vs weight\n&#160;\nfilter(df.aircraft_specs,Max..take.off.weight.kg!=0,Static.thrust..kN.!=0) %&gt;%\n&#160; ggplot( aes(x=Max..take.off.weight.kg\/1000, y=as.numeric(as.character(Static.thrust..kN.)))) +\n&#160;&#160;geom_point(alpha=.4, size=4, color=\"#112277\") +\n&#160;&#160;ggtitle(\"Thrust vs. weight\") +\n&#160;&#160;labs(x=\"Max Takeoff weight kg*1000\", y=\"Static Thrust\")\n<\/pre>\n<p>This returns the following graph:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"323\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image006.jpg\" width=\"605\" alt=\"2204-clip_image006.jpg\" \/><\/p>\n<p>Ok, on first glance it looks good, but on a  second thought there seems to be something strange: How come there are a couple relatively heavy planes that need way  less thrust, compared to some relatively lighter planes? Another question comes to mind: is this true or is there a  problem with the data? If <span class=\"msoDel\">it is<\/span><span class=\"msoIns\">the<\/span>  data really is true, that would mean that some planes have incredibly better design than others. If so, why would anyone  want a plane which has significantly poorer design; after all, the ideal case would be to lift off the heaviest plane  with the least amount of thrust.<\/p>\n<p>Then the solution dawns: After a closer look at  the data, we establish that the <b>Static.thrust..kN.<\/b> column indicates the  thrust of a single engine of the aircraft, and not the total thrust. This means that in order to get correct results, we  would need to multiply the thrust by the number of engines. <\/p>\n<pre class=\"theme:ssms2012 lang:tsql\"># static thrust vs weight and number of engines\n&#160;\nfilter(df.aircraft_specs,Max..take.off.weight.kg!=0,Static.thrust..kN.!=0,No..of.engines!=0) %&gt;%\n&#160;&#160;ggplot( aes(x=Max..take.off.weight.kg\/1000, y=Static.thrust..kN.*No..of.engines)) +\n&#160;&#160;geom_point(alpha=.4, size=4, color=\"#112277\") +\n&#160;&#160;ggtitle(\"Thrust vs. weight and number of engines\") +\n&#160;&#160;labs(x=\"Max Takeoff weight kg*1000\", y=\"Static Thrust\")\n<\/pre>\n<p>After multiplying the thrust value by the  number of engines (hopefully all engines are of the same kind and with the same specification!), we get the following  result:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"323\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image008.jpg\" width=\"605\" alt=\"2204-clip_image008.jpg\" \/><\/p>\n<p>So, in this case it seems a bit more  believable: the weight and thrust have an almost linear dependency. Even so, by looking at the graph we can see  something quite interesting: the heaviest plane requires only little bit more thrust than the one that is next heaviest.  Perhaps, we have seen the effect of some innovative design which might be worth researching. <\/p>\n<p>We learned from this exercise that we should  not trust the data blindly. When we plot a graph, we should look at it, and see if it seems reasonable enough for us to  make decisions based on it. <\/p>\n<p>Now let&#8217;s look at the top thrust vs speed:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\"># static thrust vs speed\n&#160;\nfilter(df.aircraft_specs,Max.Speed..kt.!=0,Static.thrust..kN.!=0) %&gt;%\n&#160;&#160;ggplot( aes(x=Static.thrust..kN.*No..of.engines, y=Max.Speed..kt.)) +\n&#160;&#160;geom_point(alpha=.4, size=4, color=\"#112277\") +\n&#160;&#160;ggtitle(\"Thrust vs. speed\") +\n&#160;&#160;labs(x=\"Static Thrust kN\", y=\"Max speed kt\")\n<\/pre>\n<p>The graph we get looks like this:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"323\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image010.jpg\" width=\"605\" alt=\"2204-clip_image010.jpg\" \/><\/p>\n<p>From here we can see that this data is  scattered and it is great candidate for a further segmentation. We get questions like: How has the thrust developed over  the decades and how does the thrust vs speed has developed over the decades? <\/p>\n<p>We will look at the thrust development per  decade:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\"># thrust over decades\nfilter(df.aircraft_specs,Static.thrust..kN.!=0,No..of.engines!=0) %&gt;%\nggplot(aes(x=Static.thrust..kN.*No..of.engines)) +\n&#160;&#160;geom_histogram(binwidth = 30,fill=\"#112277\") +\n&#160;&#160;ggtitle(\"Total thrust by decade\") +\n&#160;&#160;labs(x=\"Thrust, kN\", y=\"Count of Records\") +\n&#160;&#160;facet_wrap(~decade)\n<\/pre>\n<p>This time we will have a histogram, which looks  like this:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"414\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image012.jpg\" width=\"604\" alt=\"2204-clip_image012.jpg\" \/><\/p>\n<p>This histogram shows how many planes have what  amount of thrust per decade of production. In the 90s, there were plenty with thrust about 600, and most records we have  for the 2000&#8217;s have more than 750. This is not a big surprise, since plenty of planes produced in the 70s, 80s and 90s  were still in operation in the 2000s, and the gap on the market was for bigger planes with more thrust. <\/p>\n<p>Let&#8217;s explore how the speed and thrust  developed over the decades:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\"># static thrust vs speed per decade \n&#160;\nfilter(df.aircraft_specs,Max.Speed..kt.!=0,Static.thrust..kN.!=0) %&gt;%\nggplot(aes(x=Max.Speed..kt., y=Static.thrust..kN.*No..of.engines)) +\n&#160;&#160;geom_point(binwidth = 100,alpha=.9,color=\"#112277\") +\n&#160;&#160;facet_wrap(~decade) +\n&#160;&#160;ggtitle(\"Speed vs thrust by decade\") +\n&#160;&#160;labs(x=\"Speed, kt\", y=\"Thrust, kN\")\n<\/pre>\n<p>The histogram looks like this:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"414\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image014.jpg\" width=\"604\" alt=\"2204-clip_image014.jpg\" \/><\/p>\n<p>We can notice that there is one plane which has  a top speed significantly greater than the rest, and we would like to remove it from the graphs, so we can see the  distributions better. All we have to do is add another condition to the filter like this:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">filter(df.aircraft_specs,Max.Speed..kt.!=0,Static.thrust..kN.!=0,Max.Speed..kt. &lt; 700) %&gt;%\n&#160;&#160;ggplot(aes(x=Max.Speed..kt., y=Static.thrust..kN.*No..of.engines)) +\n&#160;&#160;geom_point(binwidth = 100,alpha=.9,color=\"#112277\") +\n&#160;&#160;facet_wrap(~decade) +\n&#160;&#160;ggtitle(\"Speed vs thrust by decade\") +\n&#160;&#160;labs(x=\"Speed, kt\", y=\"Thrust, kN\")\n<\/pre>\n<p>Then we get the following histogram:<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"414\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2204-clip_image016.jpg\" width=\"604\" alt=\"2204-clip_image016.jpg\" \/><\/p>\n<p>And so, in this manner, the data exploration  can continue forever. <\/p>\n<h1>  &#160;Conclusion: <\/h1>\n<p>In this article we have traveled a long way:  from relational database design and data management, to in-memory data exploration. The article&#8217;s goal was to introduce  readers familiar with T-SQL and reporting tools to the possibilities of data exploration with R. Both methods of data  exploration have their pros and cons, and as long as the data explorer is paying attention to detail, a great value can  be potentially extracted from the data. <\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>What&#8217;s the best way for a SQL programmer to learn about R? It&#8217;s probably by trying out various experiments with an interesting set of data, and some helpful suggestions that point out the parallels with SQL. Feodor provides the data and the helpful suggestions. The rest is up to you.&hellip;<\/p>\n","protected":false},"author":221902,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528],"tags":[4379,4150],"coauthors":[],"class_list":["post-2012","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","tag-reporting-services","tag-sql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2012","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221902"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=2012"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2012\/revisions"}],"predecessor-version":[{"id":91009,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2012\/revisions\/91009"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=2012"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=2012"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=2012"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=2012"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}