{"id":2144,"date":"2016-01-13T00:00:00","date_gmt":"2016-01-13T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/data-manipulation-in-r-beyond-sql\/"},"modified":"2021-05-03T14:02:02","modified_gmt":"2021-05-03T14:02:02","slug":"data-manipulation-in-r-beyond-sql","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/development\/dotnet-development\/data-manipulation-in-r-beyond-sql\/","title":{"rendered":"Data Manipulation in R:  Beyond SQL"},"content":{"rendered":"<div id=\"pretty\">\n<p>    R is an extremely versatile statistical analysis platform.&#160; It is designed to perform complex calculations using only a few  interactive commands or a short script.&#160; Many R users initially use the language to create a statistical summary  or chart using a tiny subset of the available functionality.&#160;&#160; The <a href=\"https:\/\/www.simple-talk.com\/dotnet\/software-tools\/r-basics\/\"> R-Basics<\/a> and <a href=\"https:\/\/www.simple-talk.com\/dotnet\/software-tools\/visualizing-data-with-r\/\"> Visualizing Data with R<\/a> articles introduce these topics, but don&#8217;t go into  much detail about how to manipulate datasets within R.&#160;  <\/p>\n<p>&#160;A subsequent article describes the use of <a href=\"https:\/\/www.simple-talk.com\/dotnet\/software-tools\/sql-and-r-\/\">SQL and  R<\/a> for interacting with databases and manipulating data frames.&#160; The R community has created a number of SQL-related packages available  from <a href=\"https:\/\/cran.r-project.org\/web\/packages\/available_packages_by_name.html\"> CRAN<\/a> (the Comprehensive R Archive Network). Many of these allow R to query  and process data in relational databases, but others (like <a href=\"https:\/\/cran.r-project.org\/web\/packages\/sqldf\/index.html\">sqldf<\/a>)  can be used with data that does not explicitly originate in, or target, a  relational database.&#160; The  availability of SQL within R is incredibly valuable for many data problems,  particularly if you &#8220;think&#8221; in SQL.&#160;&#160; A dataset can be conveniently sorted, filtered, manipulated, re-ordered  and analyzed.&#160; These basic operations  are intrinsic to SQL and applicable to any dataset that conforms to the &#8220;shape&#8221;  of a standard database results set.<\/p>\n<p>SQL is a well-understood standard language that is familiar  to many developers and analysts.&#160; It  is used in most modern relational databases and has influenced similar query  languages on non-relational platforms.&#160;&#160; R syntax can be confusing when initially encountered, so if you are  comfortable using SQL, you can use it to jump-start your development on R.<\/p>\n<p>&#160;Although SQL is a popular way to process data frames, there  are situations where a different approach is warranted.&#160; Certain data transformations are notoriously difficult to express using  SQL.&#160; &#160;Such transformations require  complex, obscure syntax or multiple statements executed in sequence.&#160; The core R programming language and a number of publicly available  packages can be leveraged to easily accomplish such tasks.<\/p>\n<h2>Complex SQL Not Required<\/h2>\n<p>Operations that are conceptually simple can be difficult to  perform using SQL.&#160; Consider the  common requirements to pivot or transpose a dataset.&#160;  &#160;Each of these actions are conceptually straightforward but are complex  to implement using SQL.&#160; The examples  that follow are somewhat verbose, but the details are not significant. The main  point is to illustrate is that, by using specialized functions outside of SQL, &#160;R makes trivial some of those operations  that would otherwise require complex SQL statements.&#160; The contrast in the amount of code required is striking.&#160; The simpler approach allows you to focus attention on the scientific or  business problem at hand, rather than expending energy reading documentation or  laboriously testing complex statements.<\/p>\n<h3>Example 1: The Pivot<\/h3>\n<p>Pivoting a table (also known as cross tabulation) involves  transforming certain data elements into columns.&#160; This task will introduce the <span class=\"monospaced\">reshape2<\/span> package.&#160; If you are following along and have not installed the package, you will  have to do so using the RStudio GUI or at the command prompt by running.<\/p>\n<pre>install.packages('reshape2')<\/pre>\n<p>Once installed, the library needs to be included in the  current R session.<\/p>\n<pre>library(reshape2) &#160;&#160;&#160; <\/pre>\n<p>Note that an <a href=\"https:\/\/stat.ethz.ch\/pipermail\/r-packages\/2010\/001169.html\">earlier  less optimized package named &#8220;reshape&#8221;<\/a> provides similar functionality and a  corresponding API.&#160; It&#8217;s will not be  used in the examples that follow.<\/p>\n<p>Many relational databases have<span class=\"monospaced\"> PIVOT<\/span> and  <span class=\"monospaced\">UNPIVOT <\/span>functions  to create a crosstab report. &#160;Examples  using SQLServer are presented in other <a href=\"https:\/\/www.simple-talk.com\/sql\/t-sql-programming\/questions-about-pivoting-data-in-sql-server-you-were-too-shy-to-ask\/\"> posts on Simple-Talk<\/a>.&#160; This  syntax is consistent with other RDBMS implementations.&#160; Consider a results set returned by a query on Oracle&#8217;s HR demonstration  schema.<\/p>\n<pre>select job_id, department_name, sum(salary) salary \n&#160;&#160;&#160;&#160; from&#160; DEPARTMENTS d \n&#160;&#160;&#160;&#160; join EMPLOYEES e on e.DEPARTMENT_ID = d.DEPARTMENT_ID\n&#160;&#160;&#160;&#160; where job_id in ('ST_MAN','AD_ASST','PU_MAN','SH_CLERK')\n&#160;&#160;&#160;&#160; group by job_id, department_name;\n<\/pre>\n<p>The results represent the total salary for a given job  title within a department.&#160;  <b> <\/b><\/p>\n<pre>JOB_ID&#160;&#160;&#160;&#160;&#160; &#160;&#160;DEPARTMENT_NAME SALARY\nAD_ASST&#160;&#160;&#160;&#160; &#160; Administration&#160; 4400\nST_MAN&#160;&#160;&#160;&#160;&#160; &#160; Shipping&#160; &#160; &#160;&#160;&#160;&#160;36400\nPU_MAN&#160;&#160;&#160;&#160;&#160; &#160; Purchasing&#160;&#160;&#160;&#160;&#160; 11000\nSH_CLERK&#160;&#160;&#160; &#160; Shipping&#160; &#160; &#160;&#160;&#160;&#160;64300\n<\/pre>\n<p>You can create a dataframe in R to hold this data and view  it as follows:<\/p>\n<pre>df&lt;- data.frame(\n&#160; JOB_ID=c('AD_ASST','ST_MAN','PU_MAN', 'SH_CLERK'),\n&#160; DEPARTMENT_NAME = c('Administration','Shipping','Purchasing','Shipping'),\n&#160; SALARY=c(4400, 36400, 11000, 64300)\n)\n&#160;\nView(df)\n&#160;\n<\/pre>\n<p><b>  <img loading=\"lazy\" decoding=\"async\" height=\"127\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image002.jpg\" width=\"307\" alt=\"2341-clip_image002.jpg\" \/><\/b><b><\/b><\/p>\n<p>The <code>View()<\/code> command is used to provide the spreadsheet-like  view of the data.&#160; It will be omitted  in subsequent examples.&#160;&#160; Any  example that returns data can simply be wrapped in a <code>View()<\/code> method call to be  viewed in this manner.<\/p>\n<p>A pivot operation applied to the operation results in each  department name being used as a column.&#160;&#160; Essentially, the original query is wrapped in and a pivot clause is  appended.&#160; The pivot clause lists the  data value and column names to use.&#160;&#160; <\/p>\n<pre>select * from \n&#160; (select job_id, department_name, sum(salary) salary \n&#160;&#160;&#160;&#160; from&#160; DEPARTMENTS d \n&#160;&#160;&#160;&#160; join EMPLOYEES e on e.DEPARTMENT_ID = d.DEPARTMENT_ID\n&#160;&#160;&#160;&#160; where job_id in ('ST_MAN','AD_ASST','PU_MAN','SH_CLERK')\n&#160;&#160;&#160;&#160; group by job_id, department_name)\npivot (\n&#160; sum(salary) for (department_name) in (\n&#160;&#160;&#160; 'Administration',\n&#160;&#160;&#160; 'Purchasing',\n&#160;&#160;&#160; 'Shipping') \n) order by 1;\n<\/pre>\n<p>The results contain the same data presented in a pivot  format.<\/p>\n<pre>JOB_ID Administration Purchasing&#160;&#160;&#160; Shipping\nAD_ASST&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160; 4400&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; \nPU_MAN&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;11000&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; \nSH_CLERK&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 64300\nST_MAN&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;36400\n<\/pre>\n<p>Earlier SQL developers devised clever but notoriously  complicated solutions before the <span class=\"monospaced\">PIVOT <\/span>function was implemented.&#160;&#160; Some relied upon non-standard database specific functionality while  others required several queries to be executed to create intermediate results  before creating the final crosstab report.&#160; The <span class=\"monospaced\">PIVOT <\/span>function is an improvement, but remains a fairly verbose option  when compared with the ability to &#8220;cast&#8221; a database using a function from the  <span class=\"monospaced\">reshape2 <\/span>package.&#160; <\/p>\n<pre>df2&lt;- dcast(df, JOB_ID ~  DEPARTMENT_NAME)<\/pre>\n<p><b>  <img loading=\"lazy\" decoding=\"async\" height=\"123\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image004.jpg\" width=\"348\" alt=\"2341-clip_image004.jpg\" \/><\/b><b><\/b><\/p>\n<h3>Example 2: The Transpose<\/h3>\n<p>Transposing a dataset involves rotating a result set so  that the rows from the original result set are columns in the final result set.&#160; This relatively simple idea is surprisingly difficult to express in SQL  (though it is <a href=\"https:\/\/www.simple-talk.com\/sql\/t-sql-programming\/matrix-math-in-sql\/\"> certainly possible<\/a>).&#160;&#160; No  special packages are required.&#160; The R  base language contains a built in transpose function.&#160;&#160; We could just proceed to transpose, but the dataset as represented has an  inconsistency.&#160; The column names are  not part of the dataset, but the row names are.&#160; SQL has a concept of column names but not row names.&#160; Our results will be a bit neater if we assign row names based on  <span class=\"monospaced\">JOB_IDs <\/span>and remove the <span class=\"monospaced\">JOB_ID <\/span>column from the data.&#160; These two statements enact these changes.<\/p>\n<pre>rownames(df2) &lt;- df2$JOB_ID\ndf2&lt;-df2[-1]\n&#160;\n<\/pre>\n<p>We will pause here for a moment to consider a few aspects  of R that are a bit surprising.&#160; Most  programming languages do not support a function call on a variable that is the  recipient of an assignment, yet R allows you to return the row names of the data  frame and use the returned value as the recipient of an assignment.&#160; In addition, most languages require that each item in a collection be  assigned individually.&#160; This often  requires a looping construct to iterate through each available assignment target  and value.&#160; R, being vector based,  allows the set of row names associated with the data frame to be assigned all at  once.&#160; <\/p>\n<p>The second line shows one of several methods available to  remove a column.&#160; In this case, the  minus sign indicates a column should be deleted.&#160; R, being one-based rather than zero based like many other programming  languages, interprets this to mean &#8220;return a data frame like the current data  frame without the first column.&#8221;&#160;&#160; This value is then assigned back to the variable that held the original  reference to the data frame.&#160;  <\/p>\n<p>The net effect is that the first column disappears from the  dataset and row names are now associated with the data frame.<b><\/b><\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"109\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image006.jpg\" width=\"325\" alt=\"2341-clip_image006.jpg\" \/><\/p>\n<p>The idea of transposing a matrix is a well-understood  concept in linear algebra and is commonly used in statistical analysis.&#160; Since it has a long history of providing statistical functions, R  included matrix manipulation from its earliest days.&#160; All that is required to flip the data frame is to call the t function.<\/p>\n<pre>t(df2)<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"84\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image008.jpg\" width=\"351\" alt=\"2341-clip_image008.jpg\" \/><\/p>\n<p>That&#8217;s all there is to it!&#160; There are some limitations.&#160;  But in this case, with consistent numeric data and column and row names outside  of the dataset proper, the conversion works as expected.&#160;&#160; Lets look at a different example where the limitations become evident.<\/p>\n<p>Recall the &#8220;<span class=\"monospaced\">mtcars<\/span>&#8221; dataset introduced in a <a href=\"https:\/\/www.simple-talk.com\/dotnet\/software-tools\/sql-and-r-\/\">previous  post<\/a> that is derived from 1974 Motor Trend US magazine data (hence the &#8220;mt&#8221;  in the dataset name). The dataset describes fuel consumption and various aspects  of automobile design and performance for 32 automobiles.&#160; <\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"385\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image010.jpg\" width=\"576\" alt=\"2341-clip_image010.jpg\" \/><\/p>\n<p>When this dataset is transposed, the columns names are car  names and the row names are features of a car.&#160; Values in individual cells of the data frame that were originally  integers are cast to decimals.<\/p>\n<pre>t(mtcars)<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"409\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image012.jpg\" width=\"577\" alt=\"2341-clip_image012.jpg\" \/><\/p>\n<p>The fundamental difference between data frames and matrices  in R is that matrices can only contain a single data type while data frames can  have a different data type in each column.&#160;&#160; The transpose function first converts the data frame to a matrix &#8211; so all  data is converted to the &#8220;least common denominator&#8221; data type.&#160; This is evident in the mtcars dataset in the gear column.<\/p>\n<p>In the worst case, you end up with a bunch of strings,  which is often not intended.&#160; This  can happen if row names are included as part of your dataset rather than being  part of the metadata of the data frame accessible using the rownames function.<\/p>\n<h2>More Reshaping of Data<\/h2>\n<p>The <span class=\"monospaced\">reshape2 <\/span>includes two methods (&#8220;<span class=\"monospaced\">melt<\/span>&#8221; and &#8220;<span class=\"monospaced\">cast<\/span>&#8220;) to  change the form, but not the content of a data frame. &#160;&#160;The base R language can perform  comparable transformations, but uses a number of different functions and  operators that are not particularly unified or consistent.&#160; The <span class=\"monospaced\">reshape2 <\/span>package provides a streamlined and consistent syntax for  these operations.<\/p>\n<p>Before we begin, we will manipulate row names using the  inverse of the operation shown earlier. &#160;&#160;Rather  than pulling a column out of the dataset and assigning the column values as row  names, the row names will be included inline in the dataset.<\/p>\n<pre>df&lt;-mtcars\ndf$car_name &lt;- rownames(mtcars)\n<\/pre>\n<p>The <span class=\"monospaced\">melt <\/span>function takes each column and represents the data  by creating a column to hold the (old) column name and a second column to hold  the actual value.&#160; Any columns listed  in the id vector are retained. <\/p>\n<pre>melt(df, id=\"c(\"\"car_name\"))<\/pre>\n<p><b> &#160;<\/b><\/p>\n<p><b>  <img loading=\"lazy\" decoding=\"async\" height=\"258\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image014.jpg\" width=\"256\" alt=\"2341-clip_image014.jpg\" \/><\/b><b><\/b><\/p>\n<p>The variable column contains entries for each column in the  original dataset (mpg, cyl, disp, hp, etc).&#160; A row is created for each value associated.&#160; This presents the data in a &#8220;long&#8221; rather than a &#8220;wide&#8221; format.<\/p>\n<p>The default version of the function selects an  <span class=\"monospaced\">id <\/span>column to  use based on data types available (and works as expected with  <span class=\"monospaced\">df<\/span>). &#160;&#160;When called in this manner, all columns  with numeric values are treated as variables with values and the  <span class=\"monospaced\">car_name <\/span>column  is assumed to be an id column.&#160; So  the results are the same as those shown above.<\/p>\n<pre>melt(df)<\/pre>\n<p>This function selects the &#8220;<span class=\"monospaced\">name<\/span>&#8221; column by default with the  current dataset.&#160; In this example, we  named the column car_name as it is more explicit, and it prevents problems with  other functions that treat columns named &#8220;<span class=\"monospaced\">name<\/span>&#8221; in a special way.&#160; <\/p>\n<p>Just to emphasize, this is not the only way to use the  <span class=\"monospaced\">melt <\/span>function, and different results are possible by modifying the call to this  function.&#160; The following example  retains the gear data in a column along with the <span class=\"monospaced\">name <\/span>column as previously  shown.<\/p>\n<pre>melt(df, id=\"c(\"\"car_name\",\"gear\"))<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"282\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image016.jpg\" width=\"336\" alt=\"2341-clip_image016.jpg\" \/><\/p>\n<p>The gear column is not included as a value in the variable  data, but instead is retained as a column.&#160;&#160; As a result, the total number of rows in the dataset is reduced.&#160; The data is a bit &#8220;wider&#8221; than the previous example, but it is not a  particularly practical arrangement.<\/p>\n<p>We will store the melted data in a second data frame.<\/p>\n<pre>melted &lt;- melt(df, id=\"c(\"\"car_name\"))<\/pre>\n<p>The complementary function for melt is cast.&#160; There are a few variations on this function that allow you to choose what  data structure is returned.&#160; We will  use the dcast variation that returns results as a data frame. We will store the  results in a variable and will look at them a bit closer in the next section.<\/p>\n<pre>df2 &lt;- dcast(melted, car_name ~ variable, value.var = c(\"value\")) <\/pre>\n<p>The tilde ~ operator is typically used in R to separate the  left and right hand sides of a formula.&#160; While the data has been converted back and forth without changing the  content of the data, the row and column order may have changed.&#160;&#160; <\/p>\n<p>At this point we have two data sets, the original, and a  second dataset that has been melted and recast.&#160; To do a final comparison that demonstrates these are identical, we need  to perform a few operations that order the columns and rows. &#160;If we don&#8217;t perform these steps, the two  data frames will be considered &#8220;not equal&#8221; simply because the rows or columns  are ordered differently.&#160; These steps  essentially &#8220;normalize&#8221; the datasets by sorting the columns and rows in  alphabetical order by row name.<\/p>\n<p>The row names is stored in alphabetical order in a vector.<\/p>\n<pre>cn &lt;- sort(colnames(df))<\/pre>\n<p>Two new dataframes are created containing the data with the  columns in alphabetical order.<\/p>\n<pre>original&lt;- df[cn]\nreshaped&lt;- df2[cn]\n<\/pre>\n<p>The data itself is reordered so that rows are in  alphabetical order by name.<\/p>\n<pre>original &lt;- original[order(original$car_name),]\nreshaped &lt;- reshaped[order(reshaped$car_name),]\n<\/pre>\n<p>Because rownames are metadata and we have included them as  the names column in the data frames, we will not consider them in the comparison  of the two data frames.<\/p>\n<pre>rownames(reshaped)&lt;-NULL\nrownames(original)&lt;-NULL\n<\/pre>\n<p>With these arrangements in place, the original dataset and  the melted-recast dataset contain the same data.<\/p>\n<pre>all.equal(original, reshaped)\n<\/pre>\n<p>The function call returns TRUE, indicating that the two  data frames contain identical values.&#160; If you wanted to retain this set of transformations for use in subsequent  tests, you can use the following function.<\/p>\n<pre>normalize_df &lt;- function(df, sort_column){\n&#160; cn &lt;- sort(colnames(df))\n&#160; df &lt;- df[cn]\n&#160; df &lt;- df[order(df[sort_column]),]\n&#160; rownames(df) &lt;- NULL\n&#160; return(df)\n}\n<\/pre>\n<p>This function can compare the data frames without the need  to run each normalization step manually.<\/p>\n<pre>all.equal(\n&#160; normalize_df(df, &#160;'car_name'), \n&#160; normalize_df(df2, 'car_name')\n)\n\t<\/pre>\n<p>R lends itself to simple interactive examples.&#160; Functions like this suggest one of the most useful techniques to learn  the R language.&#160;&#160; In a few lines  of code, a function of this sort can be created to verify an operation is  behaving as expected.&#160; Often the best  way to ensure you understand what R is doing is to create simple examples to  test and validate your expectations. <\/p>\n<h2>Reshaping and Tidy Data<\/h2>\n<p>Programming languages can be classified by basic philosophy  that drives the design of the language and impact subsequent community activity  such as package development.&#160;  <a href=\"https:\/\/en.wikipedia.org\/wiki\/There's_more_than_one_way_to_do_it\"> TIMTOWTDI<\/a> (pronounced &#8220;Tim Toady&#8221;) stands for &#8220;There is more than one way to  do it&#8221; and was popularized by the Perl community.&#160; While some programmers appreciate the flexibility afforded by such an  approach, others are irritated by the ambiguity introduced.&#160; Languages like Python take the opposite approach and are designed so that  there is only one way to do things whenever possible. Regardless of your opinion  on the best approach, it is important to realize that R packages have  significant overlap in terms of functionality.&#160; Often features available in the base language are packaged as a domain  specific language in a particular package for consistency or convenience.&#160; An awareness of this fact will prevent confusion and make you more  effective at selecting packages to rapidly solve problems.<\/p>\n<p>The <span class=\"monospaced\">tidyr <\/span>package includes functionality found in the <span class=\"monospaced\">reshape2 <\/span>package but the overall design has a different purpose.&#160; Load up the package to see the similarities demonstrated below.<\/p>\n<pre>library(tidyr)<\/pre>\n<p>The <span class=\"monospaced\">tidyr <\/span>package contains methods that correspond to  <span class=\"monospaced\">melt <\/span>and <span class=\"monospaced\">cast<\/span>:&#160; gather and spread.&#160; The <span class=\"monospaced\">gather <\/span>function is called along with the &#8220;bare&#8221; column name &#8211; the  name of the column without enclosing it in quotes.<\/p>\n<pre>melted_tidyr &lt;- gather(df, car_name)<\/pre>\n<p>The only difference between the dataframe in this case with  the one produced by <span class=\"monospaced\">reshape2 <\/span>is the rownames.&#160; The all.equal method includes an option to ignore the rownames.<\/p>\n<pre>all.equal(melted, melted_tidyr, check.names=FALSE)<\/pre>\n<p>The function call returns <span class=\"monospaced\">TRUE<\/span>.&#160; We will ensure that the column names are unique by explicitly naming them  and then call the <span class=\"monospaced\">spread <\/span>function to convert the data back to its original  format.<\/p>\n<pre>colnames(melted_tidyr ) &lt;- c('car_name', 'variable', 'value')\nspread_tidyr &lt;- spread(melted_tidyr, variable, value)\n<\/pre>\n<p>The final result is a data frame that matches the original  (once the data frames are normalized).<\/p>\n<pre>all.equal(\n&#160; normalize_df(df, 'car_name'), \n&#160; normalize_df(spread_tidyr, 'car_name')\n)\n\t<\/pre>\n<p>The <span class=\"monospaced\">reshape2 <\/span>package is designed for general purpose  reshaping of data.&#160; The  <span class=\"monospaced\">tidyr <\/span>package  is specifically designed for getting data into a &#8220;tidy&#8221; format.&#160; Hadley Wickham coined the term &#8220;<a href=\"http:\/\/vita.had.co.nz\/papers\/tidy-data.pdf\">Tidy  Data<\/a>&#8221; and is the author of the packages referenced in this article.&#160; It is a standardized form for data &#8211; closely related to the idea of  normal forms in databases &#8211; that simplifies subsequent analysis and processing.<\/p>\n<h2>SQL-like Manipulations in a Functional Style<\/h2>\n<p>SQL is the standard language for manipulating data in  relational databases and has influenced a variety of query languages in  non-relational data stores.&#160; It can  be used to manipulate data frames in R using the sqldf package.&#160; The dplyr package provides functions comparable to those available in  SQL.&#160; <\/p>\n<table class=\"MsoTableGrid\">\n<tbody>\n<tr>\n<td valign=\"top\">\n<p><b>SQL Keyword<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p><b>dplyr<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p>SELECT<\/p>\n<\/td>\n<td valign=\"top\">\n<p>select, mutate<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p>WHERE<\/p>\n<\/td>\n<td valign=\"top\">\n<p>filter<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p>ORDER BY<\/p>\n<\/td>\n<td valign=\"top\">\n<p>arrange<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p>GROUP BY<\/p>\n<\/td>\n<td valign=\"top\">\n<p>group_by<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p>HAVING<\/p>\n<\/td>\n<td valign=\"top\">\n<p>filter<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Besides its practical usefulness, the package is  interesting as it provides a different way of constructing data manipulation  that is based on functional programming and pipes rather than SQL statements.&#160; The value of this approach becomes more apparent when long complex  queries consisting of nested subqueries are replaced with a chain of piped  functions.<\/p>\n<p>The <span class=\"monospaced\">dplyr <\/span>package was introduced in the <a href=\"https:\/\/www.simple-talk.com\/dotnet\/software-tools\/r-basics\/\">R-Basics  post<\/a>.&#160;   The&#160;<b>dplyr<\/b>&#160;library uses the operator (%&gt;%) to stream the results of one  function to the next (like UNIX pipes).&#160;&#160; This operator is available as soon as the library is loaded and can be  used with functions that are not part of dplyr itself.<\/p>\n<pre>library(dplyr)<\/pre>\n<h3>SELECT<\/h3>\n<p>The functions that correspond to the SQL  <span class=\"monospaced\">SELECT <\/span>statement  are <span class=\"monospaced\">select<\/span>() and <span class=\"monospaced\">mutate<\/span>().&#160; We will  use a piping operator to chain together calls on the data frame based on the  <span class=\"monospaced\">mtcars <\/span>data frame.&#160; The  <span class=\"monospaced\">mutate <\/span>column  is used to add derived data frame columns.&#160; In this case, we will append the car name, the number of cylinders and  the string literal cylinder.&#160; This  new column will be named description.&#160; We then filter which columns will be viewable and restrict them to the  new description column and the gear column.&#160; Finally, we will pipe to the <span class=\"monospaced\">View <\/span>function (which is not part of  <span class=\"monospaced\">dplyr<\/span>)  which will display the final result. <\/p>\n<pre>df %&gt;% \n&#160; mutate(description=paste(car_name, cyl, 'cylinder')) %&gt;%\n&#160;&#160;select(description, gear) %&gt;% \n&#160; View()\n&#160;\n<\/pre>\n<p><b>  <img loading=\"lazy\" decoding=\"async\" height=\"298\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image018.jpg\" width=\"262\" alt=\"2341-clip_image018.jpg\" \/><\/b><b><\/b><\/p>\n<h3>ORDER BY<\/h3>\n<p>The ORDER BY function is used to sort the data based on one  or more columns.&#160; The arrange  function is the dplyr equivalent. &#160;&#160;<\/p>\n<pre>df %&gt;% select(car_name) %&gt;% arrange(car_name) %&gt;% View()&#160;<\/pre>\n<p><b>  <img loading=\"lazy\" decoding=\"async\" height=\"283\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image020.jpg\" width=\"151\" alt=\"2341-clip_image020.jpg\" \/><\/b><b><\/b><\/p>\n<h3>GROUP BY<\/h3>\n<p>The SQL <span class=\"monospaced\">GROUP BY<\/span> clause is used in close conjunction with  aggregate functions like <span class=\"monospaced\">SUM() <\/span>and  <span class=\"monospaced\">MEAN()<\/span>.&#160; The clause indicates the level of grouping at which a given summary  function is applied.&#160; When using &#160;the <span class=\"monospaced\">dplyr<\/span> package, it is common practice  to use the <span class=\"monospaced\">summarize <\/span>and &#160;<span class=\"monospaced\">group_by  <\/span>functions together.<\/p>\n<pre>df %&gt;% group_by(cyl) %&gt;% summarize(mean(hp)) %&gt;% View()<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"88\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image022.jpg\" width=\"180\" alt=\"2341-clip_image022.jpg\" \/><\/p>\n<p>In SQL, it is necessary to list the columns referenced in  the <span class=\"monospaced\">GROUP BY<\/span> clause in the<span class=\"monospaced\"> SELECT<\/span> clause &#8211; along with any aggregate functions.&#160; The <span class=\"monospaced\">dplyr<\/span> package just requires you to specify the function used to  aggregate in the summarize clause.&#160;  The <span class=\"monospaced\">dplyr<\/span> functions can be chained together in creative ways that would require  nested subqueries in SQL.&#160; For  instance, if you wanted to compare the average just calculated with the actual  value in each row, this can be accomplished using the <span class=\"monospaced\">mutate<\/span> function rather  than summarize.<\/p>\n<pre>df %&gt;% group_by(cyl) %&gt;% \n&#160;&#160; mutate(mhp=mean(hp)) %&gt;% \n&#160;&#160; select(cyl, hp, mhp) %&gt;%\n&#160;&#160; arrange(cyl, hp) %&gt;% View()\n<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"290\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image024.jpg\" width=\"220\" alt=\"2341-clip_image024.jpg\" \/><\/p>\n<h3>WHERE\/HAVING<\/h3>\n<p>The <span class=\"monospaced\">WHERE <\/span>clause in SQL is used to filter which rows are  returned.&#160; The  <span class=\"monospaced\">HAVING <\/span>clause is used  when evaluating summaries that rely upon a <span class=\"monospaced\">GROUP BY<\/span>.&#160; The simplest case is filtering using an equality operator.&#160; <\/p>\n<pre>df %&gt;% filter(car_name=='Volvo 142E') %&gt;% View()<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"39\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image026.jpg\" width=\"422\" alt=\"2341-clip_image026.jpg\" \/><\/p>\n<p>This example is essentially the same usage that you would  obtain using a <span class=\"monospaced\">WHERE <\/span>clause in SQL.&#160;  The next example uses an aggregate function <i>inside<\/i> of a filter function.&#160; This corresponds with the SQL <span class=\"monospaced\">HAVING <\/span>clause.&#160; Note that the results include records which have an actual horse power  (hp) that is less than 200.&#160; However,  they are part of the grouping (by cyl) that is greater than 200.<\/p>\n<pre>&#160;\ndf %&gt;% group_by(cyl) %&gt;% \n&#160;&#160;&#160;&#160;&#160;&#160; mutate(mhp=mean(hp)) %&gt;% \n&#160;&#160;&#160;&#160;&#160;&#160; filter(mhp &gt; 200 &amp; mpg &lt; 17) %&gt;% \n&#160;&#160;&#160;&#160;&#160;&#160; select(car_name, mhp, hp, mpg) %&gt;% \n&#160;&#160;&#160;&#160;&#160;&#160; View()\n<\/pre>\n<p> <img loading=\"lazy\" decoding=\"async\" height=\"267\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2341-clip_image028.jpg\" width=\"356\" alt=\"2341-clip_image028.jpg\" \/><\/p>\n<p>As this example illustrates, the arrangement of functions  in a chain is less restrictive than the structure of SQL queries.&#160; Each function is independent and discrete, whereas a unit of work in SQL  is a statement that must contain a <span class=\"monospaced\">SELECT <\/span>and may contain the other clauses  referenced.&#160;&#160; SQL statements can  be nested using subqueries to effectively &#8220;chain&#8221; together SQL statements.&#160;&#160; The dplyr function interface is less verbose and more focused on the  specific aspects of the data that are changing with each step processed.&#160; <\/p>\n<h2>R a Data Manipulation Platform<\/h2>\n<p>SQL is &#8211; by definition &#8211; a query language.&#160; It excels at retrieving data from a database and is in fact essential in  many situations where it is the only way to get data out of a database.<\/p>\n<p>However, SQL can be cumbersome when it is used to transform  data.&#160; SQL is very flexible and so  does support the ability to transform data in significant ways, but often with  the cost of requiring verbose, obscure, and difficult to support SQL statements.&#160; R includes a number of packages that can perform such operations in a  concise, clear, simple manner.&#160;&#160;  It is well worth the time to learn these packages so that the best aspects of  both SQL and R packages can be used to analyze data using a series of steps that  is easy to understand and comprehend.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Although SQL is an obvious choice for retrieving  the data for analysis, it strays outside its comfort zone when dealing with pivots and matrix manipulations. R includes a number of packages that can do these simply. By combining the two, you can prepare your data for analysis or visualisation in R more efficiently.&hellip;<\/p>\n","protected":false},"author":177915,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143538],"tags":[4143,4178,6011,4150],"coauthors":[],"class_list":["post-2144","post","type-post","status-publish","format-standard","hentry","category-dotnet-development","tag-net","tag-bi","tag-software-tools","tag-sql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2144","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/177915"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=2144"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2144\/revisions"}],"predecessor-version":[{"id":90788,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2144\/revisions\/90788"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=2144"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=2144"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=2144"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=2144"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}