{"id":72498,"date":"2017-09-06T16:40:21","date_gmt":"2017-09-06T16:40:21","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=72498"},"modified":"2021-09-15T13:22:18","modified_gmt":"2021-09-15T13:22:18","slug":"sql-server-r-services-working-data-frames","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/sql-server-r-services-working-data-frames\/","title":{"rendered":"SQL Server R Services: Working with Data Frames"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-basics\/\">SQL Server R Services: The Basics<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-digging-r-language\/\">SQL Server R Services: Digging into the R Language<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-working-ggplot2-statistical-graphics\/\">SQL Server R Services: Working with ggplot2 Statistical Graphics<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-working-data-frames\/\">SQL Server R Services: Working with Data Frames<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-generating-sparklines-types-spark-graphs\/\">SQL Server R Services: Generating Sparklines and Other Types of Spark Graphs<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-r-services-working-multiple-data-sets\/\">SQL Server R Services: Working with Multiple Data Sets<\/a><\/li>\n<\/ol>\n\n<p>In the first article in this series, I introduced you to the concept of the data frame, a type of object in the R language for working with a set of related data, similar to a table variable in SQL Server. The second and third articles also touched upon data frames, but only in passing. In this article, we dig into the data frame in much more detail.<\/p>\n<p>Data frames are an important concept in the R language and are integral to understanding how to write R scripts when working with SQL Server R Services. In fact, much of the work you do within an R script is related to creating data frames, restructuring them, or transforming their data in some way. You can, of course, incorporate other object types, such as vectors or lists, but you must still rely heavily on the data frame structure.<\/p>\n<p>For example, when you create an R script that leverages SQL Server data, you use the <strong>InputDataSet<\/strong> variable to import the data into the script. To return the script\u2019s output to a calling application, you use the <strong>OutputDataSet<\/strong> variable. Both variables come as part of the R Services package and are defined with the <strong>data.frame<\/strong> type, an R class used to structure an object as a data frame.<\/p>\n<p>Another way to look at this is that a typical R script in SQL Server starts with a data frame and ends with a data frame, with operations carried out in between. Even if you pass the data into a <strong>ggplot2<\/strong> visualization\u2014and don\u2019t return the data to an application\u2014you must get the data into a data frame.<\/p>\n<h2>Introducing the R data frame<\/h2>\n<p>The data frame is one of the most widely implemented data objects in R, providing table-like structures for manipulating and analyzing sets of data. Like a SQL Server table, a data frame is made up of rows and columns. The rows are sometimes referred to as observations, and the columns referred to as variables. In fact, you can think of a data frame as a collection of variables that have the same length (number of rows).<\/p>\n<p>As already noted, R Services includes two important data frame variables: <strong>InputDataSet<\/strong> and <strong>OutputDataSet<\/strong>. You can write a simple R script that uses the <strong>InputDataSet<\/strong> variable to access SQL Server data, and use the <strong>OutputDataSet<\/strong> variable to return that data to a calling application, without doing anything in between, as shown in the following T-SQL script:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">USE AdventureWorks2014;\r\n  GO\r\n  DECLARE @rscript NVARCHAR(MAX);\r\n  SET @rscript = N'\r\n  \u00a0 OutputDataSet &lt;- InputDataSet';\r\n  DECLARE @sqlscript NVARCHAR(MAX);\r\n  SET @sqlscript = N'\r\n  \u00a0 SELECT h.Subtotal, t.Name AS Territories\r\n  \u00a0 FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n  \u00a0 \u00a0 ON h.TerritoryID = t.TerritoryID;';\r\n  EXEC sp_execute_external_script\r\n  \u00a0 @language = N'R',\r\n  \u00a0 @script = @rscript,\r\n  \u00a0 @input_data_1 = @sqlscript;\u00a0\r\n  GO<\/pre>\n<p>The R script is about as basic as you can get in SQL Server. It assigns the data from the <strong>InputDataSet<\/strong> variable to the <strong>OutputDataSet<\/strong> variable, copying the data from one data frame object to another in order to return the data to the application that called the <strong>sp_execute_external_script<\/strong> stored procedure. (For information about the various T-SQL elements in the above example, refer to the previous articles in this series.)<\/p>\n<p>Normally, a SQL Server R script would contain far more logic between the <strong>InputDataSet<\/strong> and <strong>OutputDataSet<\/strong> variables, transforming the data along with way. In fact, this example is so simple that we can achieve the same results by running the <strong>SELECT<\/strong> statement alone, without all the other T-SQL trappings, but I wanted to approach this topic one step at a time. To this end, let\u2019s start by confirming the data type of the <strong>InputDataSet<\/strong> variable, which we can do by using the <strong>class<\/strong> function, as in the following <strong>SET<\/strong> statement:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 df_class &lt;- class(InputDataSet)\r\n  \u00a0 print(df_class)';<\/pre>\n<p>Note that in this example and the ones to follow, I show only the <strong>SET<\/strong> statement used to assign the R script to the <strong>@rscript<\/strong> variable. The other T-SQL elements are the same throughout all the examples, so I opted not to repeat them each time. Just know that you must run every T-SQL statement shown in the first example if trying out the other examples.<\/p>\n<p>Returning now to the above example, all we\u2019re doing here is specifying the <strong>InputDataSet<\/strong> variable as an argument to the <strong>class<\/strong> function and then assigning the results to the <strong>df_class<\/strong> variable. We then use the <strong>print<\/strong> function to return the <strong>df_class<\/strong> value. If we now run the T-SQL code, it will confirm that the <strong>InputDataSet<\/strong> variable uses the <strong>data.frame<\/strong> type.<\/p>\n<p>Now let\u2019s assign the <strong>InputDataSet<\/strong> variable to the <strong>sales<\/strong> variable and then retrieve the data type for the new variable:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 print(class(sales))';<\/pre>\n<p>Once again, the R script indicates that the variable uses the <strong>data.frame<\/strong> type. If we were to view the contents of the <strong>sales<\/strong> variable, we would see that it contains our SQL Server data.<\/p>\n<p>Next, we\u2019ll assign the <strong>sales<\/strong> variable to the <strong>OutputDataSet<\/strong> variable and use the <strong>class<\/strong> function to retrieve the data type for that variable:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 OutputDataSet &lt;- sales\r\n  \u00a0 print(class(OutputDataSet))';<\/pre>\n<p>This time the R script will return all rows in the data set, followed by a message that indicates the <strong>OutputDataSet<\/strong> variable uses the <strong>data.frame<\/strong> type. As the examples demonstrate, we start with a data frame and end with a data frame, bringing the SQL Server data along the way.<\/p>\n<h2>Creating a data frame<\/h2>\n<p>When developing R scripts in SQL Server, you\u2019ll likely want to construct data frames to help with your analyses. The data frames might use data from the <strong>InputDataSet<\/strong> variable, data brought in through another source, or a combination of both. The R language supports a wide range of possibilities when it comes to mixing and matching data.<\/p>\n<p>To demonstrate how to build a data frame, we\u2019ll create one with two columns, defining one column at a time. Although each column in a data frame must be the same length, you can add columns made up of different types of data, whether lists, vectors, factors, numeric matrices, or other data frames.<\/p>\n<p>We\u2019ll start by defining the data frame\u2019s first column, which is based on the values in the <strong>Territories<\/strong> column of our SQL Server data set:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 print(c1)';<\/pre>\n<p>The <strong>levels<\/strong> function retrieves a list of distinct values from the <strong>Territories<\/strong> column in the <strong>sales<\/strong> data frame. When referencing the column, we must first specify the <strong>sales<\/strong> variable, followed by a dollar sign (<strong>$<\/strong>), and then the column name. We then assign the output of the <strong>levels<\/strong> function to the <strong>c1<\/strong> variable, which gives us what we need to create the first column.<\/p>\n<p>For this example, I\u2019m also tagging on the <strong>print<\/strong> function so we can view the <strong>c1<\/strong> content. The R script returns the following results:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">STDOUT message(s) from external script:\u00a0\r\n  \u00a0[1] \"Australia\"\u00a0 \u00a0 \u00a0 \"Canada\" \u00a0 \u00a0 \u00a0 \u00a0 \"Central\"\u00a0 \u00a0 \u00a0 \u00a0 \"France\" \u00a0 \u00a0 \u00a0 \u00a0\r\n  \u00a0[5] \"Germany\"\u00a0 \u00a0 \u00a0 \u00a0 \"Northeast\"\u00a0 \u00a0 \u00a0 \"Northwest\"\u00a0 \u00a0 \u00a0 \"Southeast\"\u00a0 \u00a0 \u00a0\r\n  \u00a0[9] \"Southwest\"\u00a0 \u00a0 \u00a0 \"United Kingdom\"<\/pre>\n<p>In SQL Server, when we return data from an R script in this way, without using the <strong>OutputDataSet<\/strong> variable, we get the data in a format such as that shown above. This approach provides us with a quick way to confirm an object\u2019s values as we\u2019re building our data frame.<\/p>\n<p>Now let\u2019s define the second column. This time, we\u2019ll use the <strong>tapply<\/strong> function to return the aggregated means of the <strong>Subtotal<\/strong> values, grouped by sales territory:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 print(c2)';<\/pre>\n<p>The <strong>tapply<\/strong> function takes three arguments. The first is the <strong>Subtotal<\/strong> column in the <strong>sales<\/strong> data set. This column contains the values that will be aggregated.<\/p>\n<p>The second argument is the <strong>Territories<\/strong> column in the <strong>sales<\/strong> dataset. This argument determines the factors used in aggregating the data. In other words, the <strong>Territories<\/strong> column provides the basis for how the values in the <strong>Subtotal<\/strong> column will be grouped together and aggregated. The unique territory names also provide a mechanism for indexing the aggregated values.<\/p>\n<p>The third argument determines how the data will be aggregated. Because we\u2019ve specified the <strong>mean<\/strong> option, SQL Server will return the mean subtotal for each territory. Notice that we also use the <strong>round<\/strong> function to round the aggregated totals to integers.<\/p>\n<p>We can then use the <strong>print<\/strong> function to return the values in the <strong>C2<\/strong> column, which are shown in the following results:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">STDOUT message(s) from external script:\u00a0\r\n  \u00a0\u00a0 \u00a0 Australia \u00a0 \u00a0 \u00a0 \u00a0 Canada\u00a0 \u00a0 \u00a0 \u00a0 Central \u00a0 \u00a0 \u00a0 \u00a0 France\u00a0 \u00a0 \u00a0 \u00a0 Germany\u00a0\r\n  \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1557 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 4022\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 20543 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2714 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1874\u00a0\r\n  \u00a0\u00a0 \u00a0 Northeast\u00a0 \u00a0 \u00a0 Northwest\u00a0 \u00a0 \u00a0 Southeast\u00a0 \u00a0 \u00a0 Southwest United Kingdom\u00a0\r\n  \u00a0\u00a0 \u00a0 \u00a0 \u00a0 19714 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 3501\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 16213 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 3886 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2383\u00a0<\/pre>\n<p>Although the results might be a bit tricky to read, they confirm that we have the data we need in the <strong>C2<\/strong> column, with the territory names acting as an index for the <strong>Subtotal<\/strong> values. We can now create a data frame that uses the <strong>Subtotal<\/strong> values in conjunction with the <strong>Territories<\/strong> values in the <strong>C1<\/strong> column. To do so, we use the <strong>data.frame<\/strong> function to join the columns into a data frame:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>When we call the <strong>data.frame<\/strong> function, we pass in the <strong>c1<\/strong> and <strong>c2<\/strong> variables as arguments. We then save the function\u2019s output to the <strong>sales2<\/strong> variable, giving us a data frame that we can then assign to the <strong>OutputDataSet<\/strong> variable. When we run the R script, we get the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"247\" height=\"238\" class=\"wp-image-72499\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>We can assign names to the outputted columns when calling the <strong>sp_execute_external_script<\/strong> stored procedure, but we don\u2019t need to be concerned about that right now. However, if you want to add column names and are uncertain how to do that, refer to the first article in this series.<\/p>\n<h2>Retrieving data frame information<\/h2>\n<p>When working with data frames, we can retrieve information about them in order to verify data or troubleshoot a script. For example, we can retrieve a data frame\u2019s length, dimensions, number of columns, and number of rows, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 print(length(sales2))\r\n  \u00a0 print(dim(sales2))\r\n  \u00a0 print(ncol(sales2))\r\n  \u00a0 print(nrow(sales2))';<\/pre>\n<p>First, we use the <strong>length<\/strong> function to determine the object\u2019s length, which for data frames, translates to the number of variables (columns). We then call the <strong>dim<\/strong> function to return the data frame\u2019s dimensions, which gives us the number of rows and columns.<\/p>\n<p>The R script then calls the <strong>ncol<\/strong> function to return the data frame\u2019s number of columns (just like <strong>length<\/strong>), and the <strong>nrow<\/strong> function to return the number of rows. When we run the script, we get the following results:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">STDOUT message(s) from external script:\u00a0\r\n  [1] 2\r\n  [1] 10\u00a0 2\r\n  [1] 2\r\n  [1] 10<\/pre>\n<p>As you can see, the functions provide a simple way to return specific types of information. In addition to using them for verification and troubleshooting, we can also incorporate them within a script\u2019s logic to control statement flow based on the returned values.<\/p>\n<p>Another option we can use for getting information about a data frame is to call the <strong>summary<\/strong> function, passing in the data frame as an argument:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 print(summary(sales2))';<\/pre>\n<p>The <strong>summary<\/strong> function returns specifics about each column in the data frame, as shown in the following results:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">STDOUT message(s) from external script:\u00a0\r\n  \u00a0\u00a0 \u00a0 \u00a0 \u00a0 c1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 c2\u00a0 \u00a0 \u00a0 \u00a0\r\n  \u00a0Australia:1 \u00a0 Min. \u00a0 : 1557 \u00a0\r\n  \u00a0Canada \u00a0 :1 \u00a0 1st Qu.: 2466 \u00a0\r\n  \u00a0Central\u00a0 :1 \u00a0 Median : 3694 \u00a0\r\n  \u00a0France \u00a0 :1 \u00a0 Mean \u00a0 : 7641 \u00a0\r\n  \u00a0Germany\u00a0 :1 \u00a0 3rd Qu.:13165 \u00a0\r\n  \u00a0Northeast:1 \u00a0 Max. \u00a0 :20543 \u00a0\r\n  \u00a0(Other)\u00a0 :4\u00a0<\/pre>\n<p>The information returned by the <strong>summary<\/strong> function varies depending upon the type of data in a column. For example, the <strong>c1<\/strong> column includes unique character values, so the function simply lists several initial values, each followed by a <strong>1<\/strong>, which indicates that there that there is only one instance of this value. The column summary then ends with a grouping of the remaining values (<strong>Other<\/strong>), followed by the number of those values, in this case, <strong>4.<\/strong><\/p>\n<p>The <strong>c2<\/strong> column includes only numeric values, so the <strong>summary<\/strong> function instead returns the minimum, median, mean, and maximum values, along with the first and second quartiles. The first quartile shows that about 25% of the values fall below <strong>2466<\/strong>, and the third quartile shows that about 75% of the values fall below <strong>13165<\/strong>.<\/p>\n<p>We can also use the <strong>str<\/strong> function to return information about a data set, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 print(str(sales2))';<\/pre>\n<p>The <strong>str<\/strong> function returns an object\u2019s internal structure and provides an abbreviated list of its contents, as shown in the following results:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">STDOUT message(s) from external script:\u00a0\r\n  'data.frame':\t10 obs. of\u00a0 2 variables:\r\n  \u00a0$ c1: Factor w\/ 10 levels \"Australia\",\"Canada\",..: 1 2 3 4 5 6 7 8 9 10\r\n  \u00a0$ c2: num [1:10(1d)] 1557 4022 20543 2714 1874 ...\r\n  \u00a0 ..- attr(*, \"dimnames\")=List of 1\r\n  \u00a0 .. ..$ : chr\u00a0 \"Australia\" \"Canada\" \"Central\" \"France\" ...\r\n  NULL<\/pre>\n<p>The first line tells us that the object is indeed a data frame and that it contains 10 rows (observations) and 2 columns (variables). We then get details about each of the two columns. In this case, the results indicate that the <strong>c1<\/strong> column is a factor variable containing a vector with 10 levels, and the <strong>c2<\/strong> column is a numeric variable containing a vector with 10 elements. Each listing also includes a sample of the data. Plus, the <strong>c2<\/strong> listing, which is made up of three lines, includes additional information about its index.<\/p>\n<h2>Manipulating data frame columns<\/h2>\n<p>In addition to being able to view information about a data frame, the R language lets you manipulate its rows and columns. For example, you can add columns, reorder them, or delete them. (We\u2019ll get to rows in the next section.) Let\u2019s start with adding a column. To do so, we simply define a new variable and then use the <strong>cbine<\/strong> function to add that variable to the primary data frame, as in the following script:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 c3 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 sales3 &lt;- cbind(sales2, c3)\r\n  \u00a0 OutputDataSet &lt;- sales3';<\/pre>\n<p>The first part of the script defines the initial data frame (<strong>sales2<\/strong>), as we\u2019ve seen in preceding examples. Next, we define the <strong>c3<\/strong> variable by creating a vector containing 10 values that coincide with the data frame\u2019s existing rows and values. Notice that we must use the <strong>c<\/strong> function to concatenate the values before assigning them to the <strong>c3<\/strong> variable.<\/p>\n<p>In this example, we\u2019re taking a very simple and manual approach to defining our column, but know that the R language lets us construct columns in a variety of ways, as well as import data from different source types, such as .csv files. In this way, we can create data frames that include a wide range of data from both SQL Server and other sources.<\/p>\n<p>After defining the <strong>c3<\/strong> variable, we use the <strong>cbine<\/strong> function to append the variable to the <strong>sales2<\/strong> data frame. We then assign the function\u2019s output to the <strong>sales3<\/strong> variable. When calling the <strong>cbine<\/strong> function, we need only specify the <strong>sales2<\/strong> data frame and <strong>c3<\/strong> variable as arguments. From there, we assign the <strong>sales3<\/strong> variable to the <strong>OutputDataSet<\/strong> variable, giving us the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"347\" height=\"235\" class=\"wp-image-72500\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-1.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>In some cases, we might want to rearrange the columns in the data frame. For example, we can move the <strong>c3<\/strong> column to the second position by using the data frame\u2019s indexing:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 c3 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2, c3)\r\n  \u00a0 sales2 &lt;- sales2[c(1,3,2)]\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>This time around, we\u2019re creating the <strong>sales2<\/strong> data frame as a single operation. We then call the data frame, followed by square brackets that specify the column order. In the R language, the brackets act as a subsetting function that lets us filter our datasets. We can create filter conditions based on rows, columns, or both. If we want to create a condition based on columns, but not rows, we include only a single argument, which in this case, is the new column order, based on their internally generated index numbers. Note, however, we must also use the <strong>c<\/strong> function to concatenate the list of indexes when reshuffling them in this way. We then reassign the changes back to the <strong>sales2<\/strong> variable, giving us the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"351\" height=\"240\" class=\"wp-image-72501\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-2.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>Now suppose we want to delete the <strong>c3<\/strong> column from the data frame. To do so, we simply set the value of that column to <strong>NULL<\/strong>, as shown in the following script:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 c3 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2, c3)\r\n  \u00a0 sales2$c3 &lt;- NULL\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>When referencing the <strong>c3<\/strong> column, we specify the <strong>sales2<\/strong> data frame, followed by a dollar sign (<strong>$<\/strong>) and then the column name. From there, we need only assign <strong>NULL<\/strong> to the column reference. This will remove the <strong>c3<\/strong> column from the data set and leave us with only the <strong>c1<\/strong> and <strong>c2<\/strong> columns.<\/p>\n<h2>Manipulating data frame rows<\/h2>\n<p>We can also add rows to a data frame or remove them. For example, to add a row to our <strong>sales2<\/strong> data frame, we define a new data frame that contains two columns and one row, and then use the <strong>rbind<\/strong> function to insert the row into the original data frame:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 r1 &lt;- data.frame(Territories=\"New Zealand\", Sales=1048)\r\n  \u00a0 sales2 &lt;- rbind(sales2, r1)\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>The first thing to note here is that, after creating the <strong>sales2<\/strong> data frame, we name the data frame\u2019s columns. This provides us with more intuitive names for referencing the columns in subsequent statements. We could have instead provided different names for the original <strong>c1<\/strong> and <strong>c2<\/strong> variables or stuck with the <strong>c1<\/strong> and <strong>c2<\/strong> names throughout the entire script, but I wanted to demonstrate the ability to modify a data frame\u2019s column names.<\/p>\n<p>When assigning new columns names, we use the <strong>names<\/strong> function, passing in <strong>sales2<\/strong> as an argument. We then use the <strong>c<\/strong> function to concatenate the list of new column names and pass the results to the <strong>names<\/strong> function.<\/p>\n<p>With our data frame names in place, we\u2019re ready to create our new row. To do so, we call the <strong>data.frame<\/strong> function and pass in the new values as arguments, using the updated column names when delineating those values. We then assign the results to the <strong>r1<\/strong> variable.<\/p>\n<p>After defining the row, we use the <strong>rbind<\/strong> function to add the row to the <strong>sales2<\/strong> data frame, passing in the original data frame and new variable as arguments. We then reassign the output to the <strong>sales2<\/strong> variable, giving us the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"248\" height=\"254\" class=\"wp-image-72502\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-3.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>If we want to delete a row from a data set, we essentially filter it out, reassigning the results to the existing variable or assigning them to a new one. For example, we can use the <strong>sales2<\/strong> variable to create an expression that includes the logic necessary to filter out the row with a <strong>Territories<\/strong> value of <strong>New<\/strong> <strong>Zealand<\/strong>, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 r1 &lt;- data.frame(Territories=\"New Zealand\", Sales=1048)\r\n  \u00a0 sales2 &lt;- rbind(sales2, r1)\r\n  \u00a0 sales3 &lt;- sales2[sales2$Territories != \"New Zealand\",]\r\n  \u00a0 OutputDataSet &lt;- sales3';<\/pre>\n<p>We define our expression within a set of brackets that follow the <strong>sales2<\/strong> variable. The expression begins by referencing the column name, rather than index number. We then use the not equal (<strong>!=<\/strong>) comparison operator to compare the column value to the string <strong>New<\/strong> <strong>Zealand<\/strong>. The comma at the end of the expression indicates that we\u2019re subsetting the data by rows, not columns. (If we omit the comma, the R engine will assume we\u2019re indexing by columns). We then assign the results to the <strong>sales3<\/strong> data frame. The script will now return all rows except the one for New Zealand.<\/p>\n<p>We can achieve the same results by using the <strong>subset<\/strong> function, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 r1 &lt;- data.frame(Territories=\"New Zealand\", Sales=1048)\r\n  \u00a0 sales2 &lt;- rbind(sales2, r1)\r\n  \u00a0 sales3 &lt;- subset(sales2, Territories != \"New Zealand\")\u00a0\r\n  \u00a0 OutputDataSet &lt;- sales3';<\/pre>\n<p>The <strong>subset<\/strong> function allows us to return a subset or rows, columns, or both. When filtering out rows, we must pass in two arguments when calling the function. The first is the name of the data frame, and the second is the expression that defines the logic used for creating the filter. In this case, the script returns a subset of data that includes all rows except those that contain a <strong>Territories<\/strong> value of <strong>New<\/strong> <strong>Zealand<\/strong>. Because we\u2019ve already specified the data frame in the first argument, we do not need to include it in the second argument to qualify the column name.<\/p>\n<p>When subsetting row data, we can define a wide range of expressions. For example, the next script returns only those rows with a <strong>Sales<\/strong> value greater than <strong>3000<\/strong>:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales2 &lt;- subset(sales2, Sales &gt; 3000)\u00a0\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>The R script returns the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"248\" height=\"165\" class=\"wp-image-72503\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-4.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>We can also create a compound expression when subsetting row data:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales2 &lt;- subset(sales2, Sales &gt; 2000 &amp; Sales &lt; 4000)\u00a0\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>This time we\u2019re defining two conditions. The first specifies that <strong>Sales<\/strong> values must be greater than <strong>2000<\/strong>, and the second specifies that <strong>Sales<\/strong> values must be less than <strong>4000<\/strong>. The two conditions are connected by the and (<strong>&amp;<\/strong>) logical operator, which means both conditions must evaluate to true, giving us the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"250\" height=\"127\" class=\"wp-image-72504\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-5.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>The ability to use conditional logic provides us with a wide range of options for subsetting rows. However, there might be times when we want to subset the columns, so let\u2019s look at how to do that.<\/p>\n<h2>Creating Subsets of columns<\/h2>\n<p>The <strong>subset<\/strong> function also lets us filter out data based on specified columns. Again, we must pass in our data frame as the first argument, and then specify the columns we want to include in the results as the second argument:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales2 &lt;- subset(sales2, select=\"Territories\")\u00a0\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>To ensure that we\u2019re targeting columns, and not rows, we include the argument name (<strong>select<\/strong>) when specifying the columns we want to include. In this case, we\u2019re specifying only the <strong>Territories<\/strong> column, giving us the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"158\" height=\"241\" class=\"wp-image-72505\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-6.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>We can achieve the same results by specifying the data frame variable and then, in brackets, providing the column name:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 OutputDataSet &lt;- sales2[\"Territories\"]';<\/pre>\n<p>Rather than using the column name, we can use the column\u2019s assigned index number, which in this instance is <strong>1<\/strong>:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 OutputDataSet &lt;- sales2[1]';<\/pre>\n<p>It\u2019s not uncommon in the R language to have multiple options for achieving the same results. In this case, we\u2019re simply subsetting the data to filter out a column, but the same principle holds true if we want to filter out both rows and columns. Consider the following example, in which the <strong>subset<\/strong> function is used to return only those rows with a <strong>Sales<\/strong> value greater than <strong>10000<\/strong> and only the <strong>Territories<\/strong> column:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales2 &lt;- subset(sales2, Sales&gt;10000, select=\"Territories\")\u00a0\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>Again, when calling the <strong>subset<\/strong> function, we pass in the name of the data frame as the first argument. Next, we provide an expression to filter out row data, and finally we specify that only the <strong>Territories<\/strong> column be returned, giving us the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"160\" height=\"117\" class=\"wp-image-72506\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-7.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>We can also use indexing to return specific results, as we\u2019ve seen in previous examples. Only this time, we must specify two elements within the brackets. The first is the rows to return, and the second is the columns to returns, as in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales3 &lt;- data.frame(sales2[1:3, 1])\r\n  \u00a0 OutputDataSet &lt;- sales3';<\/pre>\n<p>Notice that we\u2019re using the <strong>data.frame<\/strong> function to format the filtered output as a data frame, passing in the <strong>sales2<\/strong> data frame as an argument to the function. We\u2019re also indexing the <strong>sales2<\/strong> data frame to narrow down the results, specifying that only rows 1 through 3 be returned and only the first column (<strong>Territories<\/strong>) be returned. The following figure shows what our results look like this time around.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"160\" height=\"117\" class=\"wp-image-72507\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-8.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>We can use a data frame\u2019s indexing capabilities in any way necessary. For example, the next R script specifies that the rows 2 and 4 be returned, along with the first two columns:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 sales3 &lt;- data.frame(sales2[c(2,4), 1:2])\r\n  \u00a0 OutputDataSet &lt;- sales3';<\/pre>\n<p>When specifying the rows, we must use the <strong>c<\/strong> function to concatenate them because we are providing individual values, rather than a range. The script now returns the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"255\" height=\"91\" class=\"wp-image-72508\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-9.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>Not surprisingly, there is a great deal more we can do when it comes to subsetting a data frame. The better you understand how to filter data, the more effectively you can put together R scripts.<\/p>\n<h2>Making head or tail of a data frame<\/h2>\n<p>The R language provides a couple very handy functions for returning only the first or last few rows of a data frame. One of these is the <strong>head<\/strong> function, shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 OutputDataSet &lt;- head(sales2, 3)';<\/pre>\n<p>The <strong>head<\/strong> function takes one or two arguments. The first argument is the target object, in this case, the <strong>sales2<\/strong> data frame, and the second argument, which is optional, is the number of rows to return. Because we\u2019ve specified <strong>3<\/strong>, the script returns the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"253\" height=\"111\" class=\"wp-image-72509\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-10.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>If we had not specified the number of rows, the script would have returned six rows (or fewer rows, if the dataset had been smaller).<\/p>\n<p>The R language also supports the <strong>tail<\/strong> function, which works just like the <strong>head<\/strong> function, except that it returns rows at the end of the data frame:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"Sales\")\r\n  \u00a0 OutputDataSet &lt;- tail(sales2, 2)';<\/pre>\n<p>This time, we\u2019ve specified that the last two rows be returned, giving us the following results:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"253\" height=\"94\" class=\"wp-image-72510\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-11.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>As with the <strong>head<\/strong> function, the <strong>tail<\/strong> function returns up to six rows if no row amount is specified.<\/p>\n<h2>Turning things around in a data frame<\/h2>\n<p>The R language also provides the ability to sort a data frame, based on the values in one or more columns. For this, we use the <strong>order<\/strong> function:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 c3 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2, c3)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"CountryCode\", \"Sales\")\r\n  \u00a0 OutputDataSet &lt;- sales2[order(sales2$CountryCode, sales2$Territories),]';<\/pre>\n<p>When sorting a data frame, we first specify the data frame and then, within brackets, call the <strong>order<\/strong> function, passing in the column names as arguments. In this example, we\u2019re sorting the data first by the <strong>CountryCode<\/strong> column and then by the <strong>Territories<\/strong> column, giving us the following results.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"355\" height=\"242\" class=\"wp-image-72511\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-12.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>The R language also lets us pivot a data frame, turning rows into columns and columns into rows. For this, we use the <strong>t<\/strong> function (short for transpose), as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 c3 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2, c3)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"CountryCode\", \"Sales\")\r\n  \u00a0 sales2 &lt;- data.frame(t(subset(sales2, Sales&lt;3000)))\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>Before pivoting the data frame, we use the <strong>subset<\/strong> function to filter out all rows except those with a <strong>Sales<\/strong> value greater than <strong>3000<\/strong>. (I included this only to make the data easier to view once pivoted.) We then use the <strong>t<\/strong> function to pivot the subset, passing the function results in as an argument to the <strong>data.frame<\/strong> function. We then reassign the transposed data frame to the <strong>sales2<\/strong> variable, giving us the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"455\" height=\"109\" class=\"wp-image-72512\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/09\/documents-datafiles-scr-13.png\" alt=\"..\/..\/..\/..\/..\/..\/..\/..\/..\/Documents\/DataFiles\/Screenshots\/stSqlSrvRS\" \/><\/p>\n<p>As you can see, we now have a column for each subsetted territory, rather than a row, with each column containing data relevant to its respective territory.<\/p>\n<h2>Updating data frame values<\/h2>\n<p>One other aspect of data frames worth looking at is how to update a specific value. To do so, we need only assign the new value to the specific column and row. For example, suppose we want to update the amount of France\u2019s sales. To do so, we must assign the new value to the <strong>Sales<\/strong> column, filtering by the <strong>France<\/strong> territory, as shown in the following R script:<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET @rscript = N'\r\n  \u00a0 sales &lt;- InputDataSet\r\n  \u00a0 c1 &lt;- levels(sales$Territories)\r\n  \u00a0 c2 &lt;- c(\"AU\", \"CA\", \"US\", \"FR\", \"DE\", \"US\", \"US\", \"US\", \"US\", \"GB\")\r\n  \u00a0 c3 &lt;- round(tapply(sales$Subtotal, sales$Territories, mean))\r\n  \u00a0 sales2 &lt;- data.frame(c1, c2, c3)\r\n  \u00a0 names(sales2) &lt;- c(\"Territories\", \"CountryCode\", \"Sales\")\r\n  \u00a0 sales2$Sales[sales2$Territories==\"France\"] &lt;- 2200\r\n  \u00a0 OutputDataSet &lt;- sales2';<\/pre>\n<p>First, we call the <strong>Sales<\/strong> column and then, in brackets, specify that the <strong>Territories<\/strong> column value should equal <strong>France<\/strong>. Once we have constructed the basic elements, we need only assign the <strong>2200<\/strong> value to the <strong>Sales<\/strong> column. When we run the script, the R engine will update the value in the data frame and return the new results.<\/p>\n<h2>The ubiquitous data frame<\/h2>\n<p>To say that there is much more we can do with data frames than what we covered here would be a gross understatement. As a key structure of the R language, the data frame supports a wide range of options for manipulating and analyzing data.<\/p>\n<p>There are, of course, many other important objects in R, and you will have to understand those as well to use R effectively in SQL Server. But data frames are key to working with the R language within R Services. You should know how to create them, update them, get information about them, and retrieve subsets of data from them. Only then will you be able to effectively use R services to manipulate and analyze data within the SQL Server environment.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Although you can get started with R in SQL Server without understanding data frames, they are a key structure of the R language that are the equivalent of SQL Server table variables. They give you many ways of manipulating and analyzing data and passing it between R and SQL Server.  For a database professional, they provide a clear and familiar concept when getting to grips with integrating R into the database.&hellip;<\/p>\n","protected":false},"author":221841,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528],"tags":[5134],"coauthors":[6779],"class_list":["post-72498","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/72498","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221841"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=72498"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/72498\/revisions"}],"predecessor-version":[{"id":72556,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/72498\/revisions\/72556"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=72498"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=72498"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=72498"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=72498"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}