{"id":76391,"date":"2017-12-18T10:44:42","date_gmt":"2017-12-18T10:44:42","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=76391"},"modified":"2022-04-27T21:23:37","modified_gmt":"2022-04-27T21:23:37","slug":"sql-server-machine-learning-services-part-2-python-data-frames","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/sql-server-machine-learning-services-part-2-python-data-frames\/","title":{"rendered":"SQL Server Machine Learning Services \u2013 Part 2: Python Data Frames"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/machine-learning-services-part-1-python\/\">SQL Server Machine Learning Services \u2013 Part 1: Python Basics<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-2-python-data-frames\/\">SQL Server Machine Learning Services \u2013 Part 2: Python Data Frames<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-3-plotting-data-python\/\">SQL Server Machine Learning Services \u2013 Part 3: Plotting Data with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-4-finding-data-anomalies-python\/\">SQL Server Machine Learning Services \u2013 Part 4: Finding Data Anomalies with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-5-generating-multiple-plots-python\/\">SQL Server Machine Learning Services \u2013 Part 5: Generating multiple plots in Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-6-merging-data-frames-python\/\">SQL Server Machine Learning Services \u2013 Part 6: Merging Data Frames in Python<\/a><\/li>\n<\/ol>\n\n<p>\n  If you\u2019ve spent any time with the R language in SQL Server R Services or Machine Learning Services (MLS), you\u2019re no doubt aware of the important role that data frames can play in your scripts, whether working with data that comes from a SQL Server database or from another source. The same is true for Python. You use data frames when passing data sets into and out of a script as well as when manipulating and analyzing data within that script.\n<\/p>\n<p>\n  This article focuses on using data frames in Python. It is the second article in a series about MLS and Python. The first article introduced you briefly to data frames. This article continues that discussion, describing how to work with data frame objects and the data within those objects.\n<\/p>\n<p>\n  Data frames and the functions they support are available to MLS and Python through the pandas library. The library is available as a Python module that provides tools for analyzing and manipulating data, including the ability to generate data frame objects and work with data frame data. The pandas library is included by default in MLS, so the functions and data structures available to pandas are ready to use, without having to manually install pandas in the MLS library.\n<\/p>\n<p>\n  This article includes a number of examples that leverage the pandas library to demonstrate various data frame concepts, using data from the <strong>AdventureWorks2017<\/strong> sample database. As you\u2019ll recall from the first article, you can specify a T-SQL query when calling the <strong>sp_execute_external_script<\/strong> stored procedure and then incorporate the data returned by that query into your Python script. The examples that follow all use the same T-SQL query, allowing us to focus specifically on the Python scripts and data frame concepts.\n<\/p>\n<h2>Introducing the data frame<\/h2>\n<p>\n  The pandas library supports a variety of data structures. The two most commonly used for analytics are <strong>Series<\/strong> and <strong>DataFrame<\/strong>. The <strong>Series<\/strong> data structure is an indexed, one-dimensional array whose values can be any Python data type, including strings, integers, floating point numbers, and other Python objects. The <strong>DataFrame<\/strong> data structure is an indexed, two-dimensional array that includes one or more columns, similar to an Excel spreadsheet or SQL Server table. All columns in a data frame are <strong>Series<\/strong> objects of the same length.\n<\/p>\n<p>\n  To better understand the <strong>Series<\/strong> and <strong>DataFrame<\/strong> data structures, take a look at the following example, which calls the <strong>sp_execute_external_script<\/strong> stored procedure, passing in a Python script and T-SQL query:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nUse AdventureWorks2017;\r\nGO\r\n\r\n-- define Python script\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], as_index=False).sum()\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return aggregated data set\r\nprint(df2)\r\nOutputDataSet = df2';\r\n\r\n-- define T-SQL query\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\n-- run procedure, using Python script and T-SQL query\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS(\r\n    (Territories NVARCHAR(100), OrderMonths NCHAR(3), TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  If you reviewed the first article in this series, most of the elements should look familiar to you. The example first defines the Python script, saving it to the <strong>@pscript<\/strong> variable, and then defines the T-SQL query, saving it to the <strong>@sqlscript<\/strong> variable. Both variables are passed in as parameter values when calling the <strong>sp_execute_external_script<\/strong> stored procedure. The procedure call also specifies the Python language and the name and type of the columns in the output data set.\n<\/p>\n<p>\n  Refer to the first article in this series if you have questions about working with the <strong>sp_execute_external_script<\/strong> stored procedure.\n<\/p>\n<p>\n  With that in mind, let\u2019s return to the example above, starting with the first statement in the Python script, which imports the <strong>calendar<\/strong> module and assigns the <strong>cl<\/strong> alias to the module:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  import calendar as cl\r\n<\/pre>\n<p>\n  The <strong>calendar<\/strong> module provides tools for working with date values. When you import a module in this way, you can use its data structures and functions in your Python script by referencing the alias.\n<\/p>\n<p>\n  After importing the module, the Python script assigns the data in the <strong>InputDataSet<\/strong> variable to the <strong>df1<\/strong> variable:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  df1 = InputDataSet\r\n<\/pre>\n<p>\n  As you\u2019ll recall from the first article, the <strong>InputDataSet<\/strong> variable provides access to the SQL Server data returned by the T-SQL query specified when calling the <strong>sp_execute_external_script<\/strong> stored procedure. By default, the variable is defined as a <strong>DataFrame<\/strong> object, which means that the <strong>df1<\/strong> variable is also created as a <strong>DataFrame<\/strong> object, as is the <strong>df2<\/strong> variable.\n<\/p>\n<p>\n  After assigning the data to the <strong>df1<\/strong> variable, the script groups and aggregates the data in the <strong>df1<\/strong> data set and then assigns the data set to the <strong>df2<\/strong> variable:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  df2 = df1.groupby([\"Territories\", \"OrderMonths\"], as_index=False).sum()\r\n<\/pre>\n<p>\n  A <strong>DataFrame<\/strong> object such as <strong>df1<\/strong> or <strong>df2<\/strong> provides a number of functions for working with the data frame and its data. One of these functions is <strong>groupby<\/strong>, which you call by adding a period after the <strong>DataFrame<\/strong> object (in this case, <strong>df1<\/strong>) and then specifying the function name.\n<\/p>\n<p>\n  The first argument you pass to the function is the column or columns on which to base the grouping. When specifying multiple columns, you must enclose the column names in brackets and separate them with a comma. In this case, the data will be grouped first by the <strong>Territories<\/strong> column and then by the <strong>OrderMonths<\/strong> column.\n<\/p>\n<p>\n  The second argument you pass to the <strong>groupby<\/strong> function is <strong>as_index<\/strong>, which is set to <strong>False<\/strong>. This argument is related to how data frames are indexed and how Python returns data frame values. We\u2019ll get into these topics more in a bit. Just know you must set this option to <strong>False<\/strong> to ensure that the values from the <strong>Territories<\/strong> and <strong>OrderMonths<\/strong> columns are returned to the calling application along with the aggregated sales data. The setting also ensures that the data is returned in \u2018SQL-style,\u2019 according to the pandas document <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.DataFrame.groupby.html\">pandas.DataFrame.groupby<\/a>.\n<\/p>\n<p>\n  After specifying the <strong>groupby<\/strong> function, you must then tag on an aggregate function that specifies how the data should be summarized. In this case, the <strong>sum<\/strong> function is used to find the sales totals for each month for each territory.\n<\/p>\n<p>\n  One other note about the working with <strong>DataFrame<\/strong> functions: If calling a function outside of the context of a <strong>DataFrame<\/strong> object, you must first import the <strong>pandas<\/strong> module, using a statement such as the following:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  import pandas as pd\r\n<\/pre>\n<p>\n  You can then use the module\u2019s alias when calling the specific function. For example, if you want to create a series, you can use the <strong>Series<\/strong> function:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  data = [\"one\", 2, \"three\", 4.3456]\r\n ds = pd.Series(data)\r\n<\/pre>\n<p>\n  The function generates a <strong>Series<\/strong> object based on the <strong>data<\/strong> variable. The object is then assigned to the <strong>ds<\/strong> variable. Python will also automatically generate an integer-based index for the <strong>Series<\/strong> object, starting with 0.\n<\/p>\n<p>\n  As noted earlier, each column in the data frame is a <strong>Series<\/strong> object. You can test this out by including a statement similar to the following in your script:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  print(type(df2[\"Territories\"]))\r\n<\/pre>\n<p>\n  The <strong>type<\/strong> function returns the data type of the <strong>Territories<\/strong> column and the <strong>print<\/strong> function returns the data type to the console, confirming that the variable\u2019s type is <strong>pandas.core.series.Series<\/strong>.\n<\/p>\n<p>\n  After aggregating the data, the Python script converts the <strong>OrderMonths<\/strong> values to a more readable format. If you run the T-SQL query in the above example separately from the stored procedure, you\u2019ll see that the <strong>OrderMonths<\/strong> column returns numerical values to represent the months (the integers 1 through 12). To make the output more readable, the Python script creates a lambda function to convert these values into abbreviated month names:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  df2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n<\/pre>\n<p>\n  A lambda function is simply an anonymous, unnamed function. To use a lambda function to convert values in a column, you first specify the column by providing the data frame name followed by the column name, enclosed in brackets. Next, you tag on the pandas <strong>apply<\/strong> function, which tells Python to run the lambda function against every value in the specified column.\n<\/p>\n<p>\n  The lambda function is passed as an argument to the <strong>apply<\/strong> function. The function definition begins with the keyword <strong>lambda<\/strong>, followed by the <strong>x<\/strong> variable and a colon. The function definition then calls the <strong>month_abbr<\/strong> function from the <strong>calendar<\/strong> module (using the <strong>cl<\/strong> alias) and passes the <strong>x<\/strong> variable in as an argument, enclosed in brackets. In this way, the <strong>month_abbr<\/strong> function will be applied to each of the column\u2019s integers, changing them to month abbreviations.\n<\/p>\n<p>\n  The final part of the Python script includes two statements that return the final data frame. The first statement uses the <strong>print<\/strong> function to return the data frame to the console:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  print(df2)\r\n<\/pre>\n<p>\n  The following figure shows how the returned data looks in SQL Server Management Studio (SSMS).\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"405\" height=\"233\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig01b-png-2.png\" class=\"wp-image-76392\" alt=\"%20screenshots2\/stPython2_fig01b.png\" \/>\n<\/p>\n<p>\n  The first column, which is unnamed, is the data frame\u2019s index. As with <strong>Series<\/strong> objects, Python generates the index automatically, starting with 0; however, you can change the index, reset it, or designate one of the columns as the index. The important point to remember is that a data frame, by default, is automatically assigned an index, which can be a factor when manipulating and analyzing data.\n<\/p>\n<p>\n  The final statement in the Python script assigns the <strong>df2<\/strong> variable to the <strong>OutputDataSet<\/strong> variable:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  OutputDataSet = df2\r\n<\/pre>\n<p>\n  The <strong>OutputDataSet<\/strong> variable is the default variable used to return the script\u2019s data to a calling application. Like the <strong>InputDataSet<\/strong> variable, the <strong>OutputDataSet<\/strong> variable is defined as a <strong>DataFrame<\/strong> object. The implication of this is that, when working with Python in SQL Server, you must pass your data to the <strong>OutputDataSet<\/strong> variable as a <strong>DataFrame<\/strong> object in order to return that data to a calling application. The following figure shows part of the data set returned by the <strong>OutputDataSet<\/strong> variable.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"315\" height=\"247\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig01a-png-2.png\" class=\"wp-image-76393\" alt=\"%20screenshots2\/stPython2_fig01a.png\" \/>\n<\/p>\n<p>\n  The integers shown in the first column are the row number labels used by SSMS. These values are not the index. The index is not included in the results. To include the index in the <strong>OutputDataSet<\/strong> results, you should use the <strong>reset_index<\/strong> function to turn that index into a column. For example, you can include the following statement in your script:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  df2.reset_index(inplace=True)\r\n<\/pre>\n<p>\n  The <strong>reset_index<\/strong> function turns the existing index into a column and generates a new index. For this example, the function takes only the <strong>inplace<\/strong> argument, which is available to a number of functions. When set to <strong>True<\/strong>, the argument tells Python to update the data frame in place, without needing to reassign the data frame to itself or assign it to a new variable.\n<\/p>\n<p>\n  Also note that the results shown in the above figure include the <strong>Territories<\/strong> and <strong>OrderMonths<\/strong> columns. If you had set the <strong>as_index<\/strong> argument to <strong>True<\/strong> or had not included the argument when calling the <strong>groupby<\/strong> function, the results would include only the <strong>TotalSales<\/strong> column.\n<\/p>\n<p>\n  You can retrieve information about a data frame by calling its properties. For example, the following Python script retrieves the column names and dimensions of the <strong>df2<\/strong> data frame:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], as_index=False).sum()\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return information about data df2 data frame\r\nprint(list(df2.columns.values))\r\nprint(df2.shape)\r\nprint(len(df2.index))';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript;     \r\nGO\r\n<\/pre>\n<p>\n  The first <strong>print<\/strong> statement uses the <strong>list<\/strong> function and <strong>columns.values<\/strong> property to return a list of the columns in the <strong>df2<\/strong> data frame. The <strong>shape<\/strong> property returns the data frame\u2019s dimensions (number of rows and columns). The <strong>index<\/strong> property returns the data frame\u2019s index. If you use the <strong>index<\/strong> property in conjunction with the <strong>len<\/strong> function, you\u2019ll get the number of values in the index. (The <strong>len<\/strong> function returns the number of items in an object.) The Python scripts returns the following results:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n['Territories', 'OrderMonths', 'Sales']\r\n(120, 3)\r\n120\r\n<\/pre>\n<p>\n  The <strong>DataFrame<\/strong> class supports a wide range of properties (attributes) and functions (methods), making it possible to work with data in a variety of ways. For more information about the <strong>DataFrame<\/strong> class, see the topic <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.DataFrame.html\">pandas.DataFrame<\/a> in the pandas documentation.\n<\/p>\n<h2>Adding columns to a data frame<\/h2>\n<p>\n  Whether your data frames are based on SQL Server data or come from another source (or are a combination of both), there might be times when you\u2019ll want to add columns. One way to do this is to aggregate the same source column in two different ways when calling the <strong>groupby<\/strong> function, as shown in the following example:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return df2 data set\r\nOutputDataSet = df2';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), \r\n    OrderMonths NCHAR(3), AvgSales MONEY, TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  You should call the <strong>groupby<\/strong> function just like you did in the previous examples, but rather than tagging on the <strong>sum<\/strong> aggregate function at the end of the <strong>groupby<\/strong> function call, you specify the <strong>agg<\/strong> function, passing in a dictionary (<strong>dict<\/strong>) value as an argument. The dictionary value specifies the <strong>Sales<\/strong> column and two aggregate functions, <strong>mean<\/strong> and <strong>sum<\/strong>. The following figure shows part of the results that the Python script returns.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"375\" height=\"247\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig03-png-2.png\" class=\"wp-image-76394\" alt=\"%20screenshots2\/stPython2_fig03.png\" \/>\n<\/p>\n<p>\n  The data set now includes the <strong>AvgSales<\/strong> and <strong>TotalSales<\/strong> column, which are both based on the <strong>Sales<\/strong> column in the <strong>df1<\/strong> data frame.\n<\/p>\n<p>\n  You can also insert columns into a data frame more directly, without having to define another aggregation. For example, the following Python script generates <strong>TargetSales<\/strong> sales, a computed column that adds 18% to the <strong>TotalSales<\/strong> values:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return df2 data set\r\nOutputDataSet = df2';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3),\r\n    AvgSales MONEY, TotalSales MONEY, TargetSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  Before adding the column, I assigned names to the columns in the <strong>df2<\/strong> data frame. I took this step to make it easier to reference those columns later in the script, which is especially beneficial when working with aggregated columns. (The pandas library also provides methods for naming aggregated and non-aggregated columns individually, but we\u2019ll save that topic for a separate discussion.)\n<\/p>\n<p>\n  Instead of naming columns, you can reference columns by their internal index, but that can sometimes add more complexity than necessary. Personally, I prefer the readability that comes with using actual names.\n<\/p>\n<p>\n  To add column labels, first call the <strong>columns<\/strong> property on the <strong>df2<\/strong> data frame and then assign the appropriate values to the property, providing each column name in the order the columns appears. You should enclose the column names in brackets and separate them with commas.\n<\/p>\n<p>\n  After renaming the columns, you can add the new column. Start by specifying the <strong>df2<\/strong> data frame, followed by the new column name in brackets. Then set its value to an expression that calculates the values for the new column. In the example above, the expression simply multiples the <strong>TotalSales<\/strong> values by 1.18. The following figures shows part of the results now returned by the Python script.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"453\" height=\"249\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig04-png-2.png\" class=\"wp-image-76395\" alt=\"%20screenshots2\/stPython2_fig04.png\" \/>\n<\/p>\n<p>\n  Whenever you change your data frame in a way that will impact the columns retuned by the Python script, be sure to update the <strong>WITH<\/strong> <strong>RESULT<\/strong> <strong>SETS<\/strong> clause as necessary when calling the <strong>sp_execute_external_script<\/strong> stored procedure.\n<\/p>\n<h2>Filtering data in a data frame<\/h2>\n<p>\n  The pandas library provides a variety of methods for filtering your data frame. For example, the following Python script uses the <strong>head<\/strong> function to return the first five rows from the data set:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return top five rows\r\nOutputDataSet = df2.head(5)';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3),\r\n    AvgSales MONEY, TotalSales MONEY, TargetSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  To call the <strong>head<\/strong> function, you need only tag it onto the <strong>DataFrame<\/strong> object, optionally specifying the number of rows to return. If you don\u2019t specify a number, Python will return five rows. The following figures shows the results returned by the Python script.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"450\" height=\"161\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig05-png-2.png\" class=\"wp-image-76396\" alt=\"%20screenshots2\/stPython2_fig05.png\" \/>\n<\/p>\n<p>\n  The pandas library also includes the <strong>tail<\/strong> function, which works the same way as the <strong>head<\/strong> function, except that it filters out all but the last rows in the data frame, instead of the first.\n<\/p>\n<p>\n  You can also specify which columns to retain in a data set:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific columns\r\nOutputDataSet = df2[[\"Territories\", \"OrderMonths\", \"TotalSales\"]]';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3), TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  When including multiple columns, enclose them in a second set of brackets and separate the column names with commas. Python will then filter out all but the specified columns. The following figure shows part of the script\u2019s results.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"289\" height=\"249\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig06-png-2.png\" class=\"wp-image-76397\" alt=\"%20screenshots2\/stPython2_fig06.png\" \/>\n<\/p>\n<p>\n  You can also add the <strong>head<\/strong> or <strong>tail<\/strong> function after specifying other filter criteria:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific columns, last five rows\r\nOutputDataSet = df2[[\"Territories\", \"OrderMonths\", \"TotalSales\"]].tail(5)';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3), TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  Now the returned data set includes only three columns and last five rows, as shown in the following figure.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"325\" height=\"157\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig07-png-2.png\" class=\"wp-image-76398\" alt=\"%20screenshots2\/stPython2_fig07.png\" \/>\n<\/p>\n<p>\n  In addition to columns, you can filter rows out of a data frame, based on conditional expressions you specify when calling the columns. For example, the following statement filters out all rows except those with a <strong>Territories<\/strong> value of <strong>France<\/strong> and <strong>TotalSales<\/strong> value greater than 500000:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific rows\r\nOutputDataSet = df2[(df2[\"Territories\"] == \"France\") &amp; (df2[\"TotalSales\"] &gt; 500000)]';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3),\r\n    AvgSales MONEY, TotalSales MONEY, TargetSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  The first expression compares the <strong>Territories<\/strong> values to <strong>France<\/strong>, using the equals (<strong>==<\/strong>) comparison operator. The second expression compares the <strong>TotalSales<\/strong> values to 500000, using the greater than (<strong>&gt;<\/strong>) comparison operator. Each expression is enclosed in parentheses and separated with the and (<strong>&amp;<\/strong>) conditional operator, which indicates that both expressions must evaluate to true for a row to be included in the results, as shown in the following figure.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"458\" height=\"130\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig08-png-2.png\" class=\"wp-image-76399\" alt=\"%20screenshots2\/stPython2_fig08.png\" \/>\n<\/p>\n<p>\n  As an alternative to what we\u2019ve seen so far, the pandas library provides a set of functions that also make it possible to filter data in a data frame. The functions include <strong>at<\/strong>, <strong>iat<\/strong>, <strong>loc<\/strong>, and <strong>iloc<\/strong>. The pandas documentation recommends using these functions when possible because they are optimized for data access. For information about these functions, see the pandas document <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/indexing.html#indexing\">Indexing and Selecting Data<\/a>.\n<\/p>\n<p>\n  One of the most valuable of these functions is <strong>loc<\/strong>, which lets you filter out columns, rows, or both. The following syntax provides a simple overview of how to use the function:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\n  &lt;DataFrame&gt;.loc[&lt;rows&gt;, &lt;columns&gt;]\r\n<\/pre>\n<p>\n  You specify row information as the first argument and column information as the second argument, separating the two by a comma. If you specify column information without row information, you need only include a colon for the first argument:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific columns\r\nOutputDataSet = df2.loc[:,[\"Territories\", \"OrderMonths\", \"TotalSales\"]]';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3), TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  The statement filters out all but the <strong>Territories<\/strong>, <strong>OrderMonths<\/strong>, and <strong>TotalSales<\/strong> columns. The following figure shows part of the results returned by the Python script.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"290\" height=\"250\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig09-png-2.png\" class=\"wp-image-76400\" alt=\"%20screenshots2\/stPython2_fig09.png\" \/>\n<\/p>\n<p>\n  If you want to filter out only rows, you can use the colon placeholder for the second argument or leave it off altogether. The following example includes the placeholder:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific rows\r\nOutputDataSet = df2.loc[(df2[\"Territories\"] == \"France\") &amp;\r\n  (df2[\"TotalSales\"] &gt; 500000),:]';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3),\r\n    AvgSales MONEY, TotalSales MONEY, TargetSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  Now the script returns all columns, but only few rows, as shown in the following figure.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"456\" height=\"134\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig10-png-2.png\" class=\"wp-image-76401\" alt=\"%20screenshots2\/stPython2_fig10.png\" \/>\n<\/p>\n<p>\n  To filter out both rows and columns, you can include both sets of criteria in one statement when using <strong>loc<\/strong>:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# return specific columns and rows\r\nOutputDataSet = df2.loc[(df2[\"Territories\"] == \"France\") &amp; (df2[\"TotalSales\"] &gt; 500000),\r\n  [\"Territories\", \"OrderMonths\", \"TotalSales\"]]';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3), TotalSales MONEY));     \r\nGO\r\n<\/pre>\n<p>\n  As the following figure shows, the results are now leaner still.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"341\" height=\"154\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig11-png-2.png\" class=\"wp-image-76402\" alt=\"%20screenshots2\/stPython2_fig11.png\" \/>\n<\/p>\n<p>\n  Be sure to review the pandas documentation for more information about the different ways you can filter data in a data set. Consider what we\u2019ve covered here as only the beginning.\n<\/p>\n<h2>Adding rows to a data frame<\/h2>\n<p>\n  Earlier in this article, you saw how to add columns to your data frame. With pandas, you can also add rows. In the following example, the Python script generates a new data frame based on filtered data and then adds a row to the data frame:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# calculate sales averages and totals for each territory and month\r\ndf1 = InputDataSet\r\ndf2 = df1.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\n\r\n# assign column names and add computed column\r\ndf2.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf2[\"TargetSales\"] = df2[\"TotalSales\"] * 1.18\r\n\r\n# change month numbers to abbreviated month names\r\ndf2[\"OrderMonths\"] = df2[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\n\r\n# define the df3 data frame and add row\r\ndf3 = df2.loc[(df2[\"OrderMonths\"] == \"Jun\") &amp; (df2[\"TotalSales\"] &gt; 500000),\r\n  [\"Territories\", \"AvgSales\", \"TotalSales\"]]\r\nrow = [\"Australia\", 8257.1674, 830092.7536]\r\ndf3.loc[len(df3)] = row\r\ndf3 = df3.sort_values(by=\"Territories\")\r\n\r\n# return the df3 data set\r\nOutputDataSet = df3';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), AvgSales MONEY, TotalSales MONEY));\r\nGO\r\n<\/pre>\n<p>\n  To create the new data frame, the script uses the <strong>loc<\/strong> function to filter out all rows except those with an <strong>OrderMonths<\/strong> value that equals <strong>Jun<\/strong> and a <strong>TotalSales<\/strong> value greater than <strong>500000<\/strong>. The function also filters out all but the <strong>Territories<\/strong>, <strong>AvgSales<\/strong>, and <strong>TotalSales<\/strong> columns. The filtered results are then assigned to the <strong>df3<\/strong> variable to create the new data frame.\n<\/p>\n<p>\n  After defining the <strong>df3<\/strong> data frame, the script specifies the values for the new row and saves them to the <strong>row<\/strong> variable. The script then uses the <strong>loc<\/strong> and <strong>len<\/strong> functions to set the value of the new row. The <strong>len<\/strong> function returns the number of rows in the data frame, and the <strong>loc<\/strong> function interprets that number as an index value for the new row.\n<\/p>\n<p>\n  This works because the index starts with 0, so the highest index number before the row is added is 3. The new row is added with an index value of 4 because the index size is four. As a result, the new row is inserted after all the other rows, which means that the territory names are no longer in alphabetical order. To fix this, you can call the <strong>sort_values<\/strong> function on the <strong>df3<\/strong> data frame, specifying the name of the <strong>Territories<\/strong> column. The Python script now returns that data set shown in the following figure.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"305\" height=\"177\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig12-png-2.png\" class=\"wp-image-76403\" alt=\"%20screenshots2\/stPython2_fig12.png\" \/>\n<\/p>\n<p>\n  As you can see, the script returns a small data set that includes only five rows and three columns, just what we were after, with the values nicely sorted by territory name. The new row for Australia is included.\n<\/p>\n<h2>Removing a column from a data frame<\/h2>\n<p>\n  You might also run into situations when you want to remove a column from a data frame. One way to do this is to use the <strong>loc<\/strong> function to filter out the column and then reassign the data frame to itself or a different variable. Another approach is to use the <strong>drop<\/strong> function available to the <strong>DataFrame<\/strong> class, as shown in the following example:\n<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\r\nDECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\nimport calendar as cl\r\n\r\n# define the df1 data frame\r\ndf1 = InputDataSet.groupby([\"Territories\", \"OrderMonths\"], \r\n  as_index=False).agg({\"Sales\": [\"mean\", \"sum\"]})\r\ndf1.columns = [\"Territories\", \"OrderMonths\", \"AvgSales\", \"TotalSales\"]\r\ndf1[\"TargetSales\"] = df1[\"TotalSales\"] * 1.18\r\ndf1[\"OrderMonths\"] = df1[\"OrderMonths\"].apply(lambda x: cl.month_abbr[x])\r\ndf1 = df1.loc[df1[\"TotalSales\"] &gt; 800000]\r\ndf1 = df1.sort_values(by=\"TotalSales\", ascending=False)\r\n\r\n# remove AvgSales column\r\ndf1.drop(\"AvgSales\", axis=1, inplace=True)\r\n\r\n# return df1 data set\r\nOutputDataSet = df1';\r\n\r\nDECLARE @sqlscript NVARCHAR(MAX);\r\nSET @sqlscript = N'\r\n  SELECT t.Name AS Territories, MONTH(h.OrderDate) AS OrderMonths,\r\n    CAST(h.Subtotal AS FLOAT) AS Sales\r\n  FROM Sales.SalesOrderHeader h INNER JOIN Sales.SalesTerritory t\r\n    ON h.TerritoryID = t.TerritoryID\r\n  WHERE YEAR(h.OrderDate) = 2013;';\r\n\r\nEXEC sp_execute_external_script\r\n  @language = N'Python',\r\n  @script = @pscript,\r\n  @input_data_1 = @sqlscript\r\n  WITH RESULT SETS((Territories NVARCHAR(100), OrderMonths NCHAR(3),\r\n    TotalSales MONEY, TargetSales MONEY));   \r\nGO\r\n<\/pre>\n<p>\n  This time round, I consolidated some of the script, although at this point, you should be well familiar with most of these elements. The only new construct is the statement that uses the <strong>drop<\/strong> function to remove the <strong>AvgSales<\/strong> column.\n<\/p>\n<p>\n  When calling the function, first specify the name of the column that should be dropped. Next, add the <strong>axis<\/strong> argument, setting its value to 1. This indicates that you\u2019re dropping a column, not the index (which would be 0). The <strong>inplace<\/strong> argument comes next. As you saw earlier, it tells Python to update the data frame in place. The Python script returns the data set shown in the following figure.\n<\/p>\n<p>\n  <img loading=\"lazy\" decoding=\"async\" width=\"391\" height=\"247\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/12\/20screenshots2-stpython2_fig13-png-2.png\" class=\"wp-image-76404\" alt=\"%20screenshots2\/stPython2_fig13.png\" \/>\n<\/p>\n<p>\n  As the example demonstrates, the <strong>drop<\/strong> function provides yet another way to remove a column from a data frame. When working with pandas and Python, you\u2019ll find that there are often a number of different ways to carry out a task. You might have to do some digging to figure out the best approach, but sometimes it\u2019s just a matter of preference. The better you understand pandas and Python, the better you\u2019ll understand the options you have available to work with the data.\n<\/p>\n<h2>Making the most of the data frame<\/h2>\n<p>\n  There is, of course, a great deal more to data frames and the pandas library than what we\u2019ve covered here, but this should provide you with a foundation for getting started so you can strike out on your own. The best way to acquaint yourself with Python, pandas, and data frames is to play around with data sets in your Python scripts and experiment with the various ways you can manipulate and analyze data.\n<\/p>\n<p>\n  As we progress through this series, we\u2019ll dig more into the Python language and pandas library and some of ways you can work with data. Your understanding of data frames will be instrumental in your ability to move forward with MLS and Python and to perform advanced data analytics.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>SQL Server Machine Learning Services provides the ability to run Python scripts directly against data in SQL Server. To do so, you must understand how to work with the data frame object. Robert Sheldon explains how to get started using the data frame object, how to pass data from SQL Server to it, and how to manipulate it with Python and pandas commands. &hellip;<\/p>\n","protected":false},"author":221841,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,146042],"tags":[5134],"coauthors":[6779],"class_list":["post-76391","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-python","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76391","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221841"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=76391"}],"version-history":[{"count":3,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76391\/revisions"}],"predecessor-version":[{"id":76409,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76391\/revisions\/76409"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=76391"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=76391"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=76391"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=76391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}