{"id":77388,"date":"2018-02-28T09:11:34","date_gmt":"2018-02-28T09:11:34","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=77388"},"modified":"2026-04-15T19:19:11","modified_gmt":"2026-04-15T19:19:11","slug":"sql-server-machine-learning-services-part-6-merging-data-frames-python","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/sql-server-machine-learning-services-part-6-merging-data-frames-python\/","title":{"rendered":"Merge SQL Server Data with CSV Files in Python Using pandas (MLS Part 6)"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"h-executive-summary\">Executive Summary<\/h2>\n\n\n\n<p><strong>This article demonstrates merging data from two sources inside SQL Server Machine Learning Services using Python. The first DataFrame is created from SQL Server data using sp_execute_external_script. The second DataFrame is built from CSV files read with pandas read_csv(). The two DataFrames are merged using pandas merge() &#8211; equivalent to a SQL JOIN &#8211; on a common key column. The merged result is then used to create a bar chart with matplotlib, showing annual sales per sales representative. Requires SQL Server 2017+ with Machine Learning Services and Python enabled.<\/strong><\/p>\n\n\n<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/machine-learning-services-part-1-python\/\">SQL Server Machine Learning Services \u2013 Part 1: Python Basics<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-2-python-data-frames\/\">SQL Server Machine Learning Services \u2013 Part 2: Python Data Frames<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-3-plotting-data-python\/\">SQL Server Machine Learning Services \u2013 Part 3: Plotting Data with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-4-finding-data-anomalies-python\/\">SQL Server Machine Learning Services \u2013 Part 4: Finding Data Anomalies with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-5-generating-multiple-plots-python\/\">SQL Server Machine Learning Services \u2013 Part 5: Generating multiple plots in Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-6-merging-data-frames-python\/\">SQL Server Machine Learning Services \u2013 Part 6: Merging Data Frames in Python<\/a><\/li>\n<\/ol>\n\n\n\n\n<p>SQL Server Machine Learning Services (MLS), along with the Python language, offer a wide range of options for analyzing and visualizing data. I covered a number of these options in the previous articles in this series and provided examples of how to use them in your Python scripts. This article builds on those discussions by demonstrating how to merge two data frames, one created from SQL Server data and the other from multiple .csv files. The final output is a third data frame that provides the foundation for plotting the data in order to generate a visualization.<\/p>\n\n\n\n<p>The examples in this article are based on data from the <strong>AdventureWorks2017<\/strong> sample database and a small set of .csv files (download at the bottom of the article) that contain data mapping back to the SQL Server data. The goal is to create a single data frame that merges the two original data sets. The final data frame will contain six years of annual sales data for a group of sales reps. The data will then be used to generate a bar chart that visualizes each individual\u2019s sales. I created the examples in SQL Server Management Studio (SSMS), connecting to a local instance of SQL Server 2017.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-creating-a-data-frame-based-on-sql-server-data\">Creating a data frame based on SQL Server data<\/h2>\n\n\n\n<p>As you\u2019ve seen in the previous articles, the key to running a Python (or R) script within the context of a SQL Server database is to call the <strong>sp_execute_external_script<\/strong> stored procedure, passing in the Python script as a parameter value. If you have any questions about how this works, refer back to the first article in this series. In this article, we focus primarily on the Python script.<\/p>\n\n\n\n<p>If you plan to try out these examples for yourself, you\u2019ll ultimately create a merged data frame and use it to generate a visualization. To get to that point, you need to take four steps:<\/p>\n\n\n<div class=\"block-core-list\">\n<ol class=\"wp-block-list\">\n<li>Create the first data frame based on SQL Server data.<\/li>\n\n\n\n<li>Create the second data frame based on data from a set of .csv files.<\/li>\n\n\n\n<li>Create the final data frame by merging the first two data frames.<\/li>\n\n\n\n<li>Create a bar chart based on the data in the final data frame.<\/li>\n<\/ol>\n<\/div>\n\n\n<p>The following example carries out the first step:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Use AdventureWorks2017;\nGO\n-- define Python script\nDECLARE @pscript NVARCHAR(MAX);\nSET @pscript = N'\n# assign SQL Server dataset to the df1 variable\ndf1 = InputDataSet\ndf1[\"2012\"] = df1[\"2012\"].astype(int)\nprint(df1)';\n-- define T-SQL query\nDECLARE @sqlscript NVARCHAR(MAX);\nSET @sqlscript = N'\nSELECT TOP(6) BusinessEntityID AS RepID, \n  LastName + '', '' + FirstName AS FullName,\n  CAST(SalesLastYear AS FLOAT) AS [2012]\nFROM Sales.vSalesPerson\nWHERE JobTitle = ''Sales Representative''\nORDER BY SalesLastYear DESC;';\n \n-- run procedure, using Python script and T-SQL query\nEXEC sp_execute_external_script\n  @language = N'Python',\n  @script = @pscript,\n  @input_data_1 = @sqlscript;\nGO<\/pre>\n\n\n\n<p>The example uses the <strong>sp_execute_external_script<\/strong> stored procedure to retrieve the SQL Server data and create the <strong>df1<\/strong> data frame. As you\u2019ve seen in the previous articles, you start by declaring the <strong>@pscript<\/strong> variable and assigning the Python script as its value. Next, you declare the <strong>@sqlscript<\/strong> variable and assign the T-SQL statement as its value. The T-SQL statement returns sales information from the <strong>vSalesPerson<\/strong> view. However, the statement returns only the top six rows, based on values in the <strong>SalesLastYear<\/strong> column, sorted in descending order. I\u2019ve renamed the column to <strong>2012<\/strong> to indicate the year of sales. (I picked this year just for demonstration purposes. It is not meant to reflect the dates associated with the underlying data.)<\/p>\n\n\n\n<p>With the <strong>@pscript<\/strong> and <strong>@sqlscript<\/strong> variables defined, you can then pass them in as parameter values when calling the <strong>sp_execute_external_script<\/strong> stored procedure. Again, refer to the first article in this series if you have questions about how any of this works.<\/p>\n\n\n\n<p>Returning now to the Python script, you can see that it includes only a few statements. The first one assigns the data from the <strong>InputDataSet<\/strong> variable to the <strong>df1<\/strong> variable:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df1 = InputDataSet<\/pre>\n\n\n\n<p>The <strong>InputDataSet<\/strong> variable contains the data returned by the T-SQL statement, which is passed into the <strong>df1<\/strong> variable. The <strong>df1<\/strong> variable is created as a <strong>DataFrame<\/strong> object. You can make changes to the data frame by calling the variable. For example, the next Python statement converts the <strong>2012<\/strong> column to the <strong>int<\/strong> data type:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df1[\"2012\"] = df1[\"2012\"].astype(int)<\/pre>\n\n\n\n<p>This is the first time we\u2019ve used the <strong>astype<\/strong> function in this series. You call the function by tagging it onto the column reference (using a period to separate the two) and then specifying the data type. Be aware, however, when using the <strong>astype<\/strong> function to convert a column from <strong>float<\/strong> to <strong>int<\/strong>, Python merely truncates the values, without rounding them up or now. In this case, it\u2019s not a big deal, but if you need more precision, you should use the <strong>round<\/strong> function. You can also round the values within the T-SQL statement.<\/p>\n\n\n\n<p>That\u2019s all you need to do to set up the first data frame. You can verify that the data frame has been created the way you want it by calling the <strong>print<\/strong> function and passing in <strong>df1<\/strong> as the parameter value:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">print(df1)<\/pre>\n\n\n\n<p>The following figure shows the data returned by the <strong>print<\/strong> statement, as it appears in the SSMS console on my system. Your results should look similar to this.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"388\" height=\"169\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-29.png\" alt=\"\" class=\"wp-image-77389\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The results include the six rows returned by the T-SQL statement, with the <strong>2012<\/strong> values truncated to integers. Notice that Python automatically generates a 0-based index for the data frame, which is separate from the <strong>RepID<\/strong> column.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-creating-a-data-frame-based-on-data-from-csv-files\">Creating a data frame based on data from .csv files<\/h2>\n\n\n\n<p>With the first data frame in place, you\u2019re ready to create the second one. For this, you\u2019ll need to create six.csv files similar to those shown in the following figure. You can download the files from the link at the bottom of the article.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1586\" height=\"930\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-30.png\" alt=\"\" class=\"wp-image-77390\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Each file includes the sales rep ID in the file name. This is the same ID returned by the <strong>RepID<\/strong> column of the <strong>df1<\/strong> data set. Each file also contains the rep\u2019s annual sales for five years (2013 through 2017), with some files containing additional information:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Rep277.csv:<\/strong> Includes the five years, plus sales for 2012 and 2018.<\/li>\n\n\n\n<li><strong>Rep280.csv:<\/strong> Includes only the five years.<\/li>\n\n\n\n<li><strong>Rep281.csv:<\/strong> Includes the five years, plus the <strong>RepID<\/strong> column.<\/li>\n\n\n\n<li><strong>Rep282.csv:<\/strong> Includes only the five years, but out of order.<\/li>\n\n\n\n<li><strong>Rep286.csv:<\/strong> Includes only the five years.<\/li>\n\n\n\n<li><strong>Rep290.csv:<\/strong> Includes the five years, plus the <strong>RepID<\/strong> column.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The .csv files are varied in order to try out different logic within the code. The goal here is to use only the .csv files that contain the five years of sales, regardless of the order of the columns or whether they include the <strong>RepID<\/strong> column. If the files include extra sales column or are in some other way inconsistent, they should be rejected, which in this case, would be the <strong>Rep277.csv<\/strong> file.<\/p>\n\n\n\n<p>If you choose to create these files, try to follow the same logic used here, although you can use any numerical values for the sales totals. Once you have the .csv files in place, you\u2019re ready to build the second data frame, as shown in the following example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">-- define Python script\nDECLARE @pscript NVARCHAR(MAX);\nSET @pscript = N'\n# import python modules\nimport pandas as pd\nimport glob\n# assign SQL Server dataset to the df1 variable\ndf1 = InputDataSet\ndf1[\"2012\"] = df1[\"2012\"].astype(int)\n# retrieve a list of the .csv files\npath = r\"c:\\\\datafiles\\\\sales\\\\\"\nfiles = glob.glob(path + \"\/*.csv\")\n# define variables used to create a list for a data frame\nlst = []\nvals1 = [\"RepID\", \"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\nvals2 = [\"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\n# open an output file for unusable .csv files\noutfile = open(path + \"outfile.txt\", \"w\")\n# define the generate_list function\ndef generate_list():\n\tif sorted(cols) == sorted(vals1):\n\t\tlst.append(df)\n\telif sorted(cols) == sorted(vals2):\n\t\tstart = file.find(\"rep\")\n\t\tend = file.find(\".csv\")\n\t\trep = file[start + 3:end]\n\t\tdf.insert(0, \"RepID\", int(rep))\n\t\tlst.append(df)\n\telse:\n\t\tstart = file.find(\"rep\")\n\t\tfilename = file[start:]\n\t\toutfile.write(filename + \"\\n\")\n# loop through each file in files to add data to list\nfor file in files:\n\tdf = pd.read_csv(file)\n\tcols = list(df.columns.values)\n\tgenerate_list()\n# close the output file\noutfile.close()\n# generate a data frame based on the data list\ndf2 = pd.concat(lst, ignore_index=True)\nprint(df2)';\n-- define T-SQL query\nDECLARE @sqlscript NVARCHAR(MAX);\nSET @sqlscript = N'\nSELECT TOP(6) BusinessEntityID AS RepID, \n  LastName + '', '' + FirstName AS FullName,\n  CAST(SalesLastYear AS FLOAT) AS [2012]\nFROM Sales.vSalesPerson\nWHERE JobTitle = ''Sales Representative''\nORDER BY SalesLastYear DESC;';\n \n-- run procedure, using Python script and T-SQL query\nEXEC sp_execute_external_script\n  @language = N'Python',\n  @script = @pscript,\n  @input_data_1 = @sqlscript;    \nGO<\/pre>\n\n\n\n<p>The Python script in this example builds on the previous example by adding the code necessary to create the second data frame, starting with the following two <strong>import<\/strong> statements:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import pandas as pd\nimport glob<\/pre>\n\n\n\n<p>The <strong>pandas<\/strong> module includes the tools necessary to create and work with data frames. The <strong>glob<\/strong> module, which is new to this series, provides a simple way to retrieve a list of files, as shown in the following two lines of code, which have also been added to the script:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">path = r\"c:\\\\datafiles\\\\sales\\\\\"\nfiles = glob.glob(path + \"\/*.csv\")<\/pre>\n\n\n\n<p>The first statement specifies the full path where the files are located and assigns it to the <strong>path<\/strong> variable. On my system, the files are saved to the <strong>c:\\datafiles\\sales<\/strong> folder, but you can use any folder. Just be sure to update the code as necessary. The second statement uses the <strong>glob<\/strong> function in the <strong>glob<\/strong> module to retrieve a list of the files in the <strong>sales<\/strong> folder and assign that list to the <strong>files<\/strong> variable.<\/p>\n\n\n\n<p>The next step is to create a <strong>list<\/strong> object and save it to the <strong>lst<\/strong> variable:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">lst = []<\/pre>\n\n\n\n<p>The variable will be used later in the script, within a <strong>for<\/strong> loop and then after the <strong>for<\/strong> loop. By declaring the variable now, you don\u2019t have to be concerned about scope issues (although Python is very forgiving in this regard).<\/p>\n\n\n\n<p>The next step is to assign a list of column names to the <strong>vals1<\/strong> and <strong>vals2<\/strong> variables:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">vals1 = [\"RepID\", \"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\nvals2 = [\"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]<\/pre>\n\n\n\n<p>Later in the script, you\u2019ll be comparing the variable values to the column names from the .csv files. In this way, only .csv files with matching column structures will be included in the data frame. All other files will be rejected and listed in the <strong>outfile.txt<\/strong> file. To write to the file, call the <strong>open<\/strong> function, providing the path and filename:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">outfile = open(path + \"outfile.txt\", \"w\")<\/pre>\n\n\n\n<p>In this case, I\u2019ve used the <strong>path<\/strong> variable for specifying the file\u2019s location, but you can target any location. Just be sure to include the <strong>w<\/strong> argument to tell Python that you\u2019ll be writing to the file. The function will then create the file if it does not exist or overwrite the file if it does exist.<\/p>\n\n\n\n<p>The output from <strong>open<\/strong> function is assigned to the <strong>outfile<\/strong> variable, which you\u2019ll use later in the script to access the file.<\/p>\n\n\n\n<p>The next step is to create a function named <strong>generate_list<\/strong>, which defines the logic necessary to process the .csv files:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">def generate_list():\n\tif sorted(cols) == sorted(vals1):\n\t\tlst.append(df)\n\telif sorted(cols) == sorted(vals2):\n\t\tstart = file.find(\"rep\")\n\t\tend = file.find(\".csv\")\n\t\trep = file[start + 3:end]\n\t\tdf.insert(0, \"RepID\", int(rep))\n\t\tlst.append(df)\n\telse:\n\t\tstart = file.find(\"rep\")\n\t\tfilename = file[start:]\n\t\toutfile.write(filename + \"\\n\")<\/pre>\n\n\n\n<p>The function compares each file\u2019s column names to the <strong>vals1<\/strong> or <strong>vals2<\/strong> variable and then takes the appropriate steps, depending on the results of the comparison. I\u2019ve included the function primarily to demonstrate how you can use user-defined Python functions even if you\u2019re working within the context of a SQL Server database. I\u2019ll explain the function\u2019s steps in more detail shortly. First, let\u2019s look at the following <strong>for<\/strong> loop, which calls the function as its last step:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">for file in files:\n\tdf = pd.read_csv(file)\n\tcols = list(df.columns.values)\n\tgenerate_list()<\/pre>\n\n\n\n<p>The <strong>for<\/strong> loop iterates through the list of files in the <strong>files<\/strong> variable. For each iteration, the individual file is assigned to the <strong>file<\/strong> variable. The first statement in the <strong>for<\/strong> loop uses the <strong>read_csv<\/strong> function from the <strong>pandas<\/strong> module to retrieve the data from the file specified in the <strong>file<\/strong> variable. The function saves the data to the <strong>df<\/strong> variable as a <strong>DataFrame<\/strong> object.<\/p>\n\n\n\n<p>The second statement in the <strong>for<\/strong> loop retrieves a list of column names from the <strong>df<\/strong> data frame and saves the list to the <strong>cols<\/strong> variable. The third statement runs the <strong>generate_list<\/strong> function, which implements the following <strong>if\u2026elif\u2026else<\/strong> logic:<\/p>\n\n\n<div class=\"block-core-list\">\n<ol class=\"wp-block-list\">\n<li>If the <strong>cols<\/strong> values match the <strong>vals1<\/strong> values, the <strong>df<\/strong> data frame is added to the <strong>lst<\/strong> object.<\/li>\n\n\n\n<li>If the <strong>cols<\/strong> values do not match the <strong>vals1<\/strong> values, but match the <strong>vals2<\/strong> values, the script takes several steps:<div class=\"block-core-list\">\n<ol style=\"list-style-type:lower-alpha\" class=\"wp-block-list\">\n<li>Finds where \u201crep\u201d is located in the filename and saves the location to <strong>start<\/strong>.<\/li>\n\n\n\n<li>Finds where \u201c.csv\u201d is located in the filename and saves the location to <strong>end<\/strong>.<\/li>\n\n\n\n<li>Uses <strong>start<\/strong> (plus 3) and <strong>end<\/strong> to extract the rep ID from the filename and save it to <strong>rep<\/strong>.<\/li>\n\n\n\n<li>Inserts the <strong>RepID<\/strong> column in the <strong>df<\/strong> data set, sets its value to <strong>rep<\/strong>, and converts the value to the <strong>int<\/strong> data type.<\/li>\n\n\n\n<li>Adds the <strong>df<\/strong> data frame the <strong>lst<\/strong> object.<\/li>\n<\/ol>\n<\/div><\/li>\n\n\n\n<li>If the <strong>cols<\/strong> values do not match either the <strong>vals1<\/strong> or <strong>vals2<\/strong> values, the script takes the following steps:<div class=\"block-core-list\">\n<ol style=\"list-style-type:lower-alpha\" class=\"wp-block-list\">\n<li>Finds where \u201crep\u201d is located in the filename and saves the location to <strong>start<\/strong>.<\/li>\n\n\n\n<li>Uses <strong>start<\/strong> to extract the filename and save it to <strong>filename<\/strong>.<\/li>\n\n\n\n<li>Uses the <strong>write<\/strong> function on the <strong>outfile<\/strong> object to save the filename to the <strong>outfile.txt<\/strong> file, along with a line return.<\/li>\n<\/ol>\n<\/div><\/li>\n<\/ol>\n<\/div>\n\n\n<p>Notice that the <strong>sorted<\/strong> function is used when calling the <strong>cols<\/strong>, <strong>vals1<\/strong>, and <strong>vals2<\/strong> variables. This ensures that the values are being properly compared regardless of their original order.<\/p>\n\n\n\n<p>When the <strong>for<\/strong> loop has completed, the <strong>lst<\/strong> object will contain each instance of the <strong>df<\/strong> data frame, unless the .csv file does not conform to either of the two conditions, in which case, the names of the unmatched files are saved to the <strong>outfile.txt<\/strong> file. The next step, then, is to close the file by calling the <strong>close<\/strong> function on the <strong>outfile<\/strong> variable:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">outfile.close()<\/pre>\n\n\n\n<p>Once you have the <strong>lst<\/strong> object populated with the individual data frames, you can concatenate them into a single data frame by using the <strong>concat<\/strong> function in the <strong>pandas<\/strong> module:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df2 = pd.concat(lst, ignore_index=True)<\/pre>\n\n\n\n<p>The <strong>concat<\/strong> function specifies two parameters. The first parameter takes the <strong>lst<\/strong> variable as the data source. The second parameter, <strong>ignore_index<\/strong>, is set to <strong>True<\/strong>. This tells Python to disregard the individual indexes when generating the new data frame. The data frame is then assigned to the <strong>df2<\/strong> variable. If you were to run a <strong>print<\/strong> statement against the new variable, you should see results similar to those shown in the following figure.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"491\" height=\"175\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-31.png\" alt=\"\" class=\"wp-image-77391\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Notice that the <strong>df2<\/strong> data frame contains data from five of the original files, even though two of those files include the <strong>RepID<\/strong> column and one includes columns that are out of order. The only file from the original set that was rejected is <strong>rep277.csv<\/strong>. The following figure shows how the rogue file is listed in the <strong>outfile.txt<\/strong> file.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"267\" height=\"166\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-32.png\" alt=\"\" class=\"wp-image-77392\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Being able to redirect the filenames in this way allows you to easily identify which files do not conform to a particular structure. You could have tested for other logic as well, such as verifying whether the files contain extra rows or whether the filenames don\u2019t conform to a particular structure. With Python, you can test for a wide variety of conditions, depending on your particular circumstances and requirements.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-merging-two-data-frames\">Merging two data frames<\/h2>\n\n\n\n<p>Now that you have your two data frames in place, you can merge them into a new data frame, which you can then modify even further, as shown in the following example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">-- define Python script\nDECLARE @pscript NVARCHAR(MAX);\nSET @pscript = N'\n# import python modules\nimport pandas as pd\nimport glob\n# assign SQL Server dataset to the df1 variable\ndf1 = InputDataSet\ndf1[\"2012\"] = df1[\"2012\"].astype(int)\n# retrieve a list of the .csv files\npath = r\"c:\\\\datafiles\\\\sales\\\\\"\nfiles = glob.glob(path + \"\/*.csv\")\n# define variables used to create a list for a data frame\nlst = []\nvals1 = [\"RepID\", \"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\nvals2 = [\"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\n# open an output file for unusable .csv files\noutfile = open(path + \"outfile.txt\", \"w\")\n# define the generate_list function\ndef generate_list():\n\tif sorted(cols) == sorted(vals1):\n\t\tlst.append(df)\n\telif sorted(cols) == sorted(vals2):\n\t\tstart = file.find(\"rep\")\n\t\tend = file.find(\".csv\")\n\t\trep = file[start + 3:end]\n\t\tdf.insert(0, \"RepID\", int(rep))\n\t\tlst.append(df)\n\telse:\n\t\tstart = file.find(\"rep\")\n\t\tfilename = file[start:]\n\t\toutfile.write(filename + \"\\n\")\n# loop through each file in files to add data to list\nfor file in files:\n\tdf = pd.read_csv(file)\n\tcols = list(df.columns.values)\n\tgenerate_list()\n# close the output file\noutfile.close()\n# generate a data frame based on the data list\ndf2 = pd.concat(lst, ignore_index=True)\n# create the df_final data frame\ndf_final = pd.merge(df1, df2, on=\"RepID\")\ndf_final.drop(\"RepID\", axis=1, inplace=True)\ndf_final.sort_values(by=[\"FullName\"], inplace=True)\nprint(df_final)';\n-- define T-SQL query\nDECLARE @sqlscript NVARCHAR(MAX);\nSET @sqlscript = N'\nSELECT TOP(6) BusinessEntityID AS RepID, \n  LastName + '', '' + FirstName AS FullName,\n  CAST(SalesLastYear AS FLOAT) AS [2012]\nFROM Sales.vSalesPerson\nWHERE JobTitle = ''Sales Representative''\nORDER BY SalesLastYear DESC;';\n \n-- run procedure, using Python script and T-SQL query\nEXEC sp_execute_external_script\n  @language = N'Python',\n  @script = @pscript,\n  @input_data_1 = @sqlscript;    \nGO<\/pre>\n\n\n\n<p>As before, the script builds on the previous example by adding a few more lines of code. The first new statement merges the two data frames, based on the values in the <strong>RepID<\/strong> column.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df_final = pd.merge(df1, df2, on=\"RepID\")<\/pre>\n\n\n\n<p>To merge the data frames, you need only call the <strong>merge<\/strong> function from the <strong>pandas<\/strong> module, specifying the two data frames and matching column (<strong>RepID<\/strong>). The results are assigned to the <strong>df_final<\/strong> data frame.<\/p>\n\n\n\n<p>Next, you can use the <strong>drop<\/strong> function available to the <strong>df_final<\/strong> object to remove the <strong>RepID<\/strong> column because it is not needed for the visualization:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df_final.drop(\"RepID\", axis=1, inplace=True)<\/pre>\n\n\n\n<p>When removing a column by name, you must include the <strong>axis<\/strong> parameter and set its value to <strong>1<\/strong>. This tells Python to reference the column names rather than the column index labels. Also include the <strong>inplace<\/strong> parameter to ensure that the object is updated in place, rather than generating a new object.<\/p>\n\n\n\n<p>The final statement added to the Python script sorts the <strong>df_final<\/strong> data frame based on the values in the <strong>FullName<\/strong> column:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df_final.sort_values(by=[\"FullName\"], inplace=True)<\/pre>\n\n\n\n<p>As with the <strong>drop<\/strong> function, be sure to include the <strong>inplace<\/strong> parameter when calling the <strong>sort_values<\/strong> function. You can then use a <strong>print<\/strong> statement to return the contents of the <strong>df_final<\/strong> data frame, which should give you results similar to those shown in the following figure.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"664\" height=\"278\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-33.png\" alt=\"\" class=\"wp-image-77393\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Notice that in the SSMS console, Python has wrapped the final column to the next row, which is indicated by the backslash to the right of the top row of column names. Even so, you can still see that the data set includes the names of the sales reps, along with their total sales for each of the six years.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-plotting-the-data-frame\">Plotting the data frame<\/h2>\n\n\n\n<p>With the final data frame in place, you\u2019re ready to create a bar chart the shows the annual sales for each sales rep:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">-- define Python script\nDECLARE @pscript NVARCHAR(MAX);\nSET @pscript = N'\n# import python modules\nimport matplotlib\nmatplotlib.use(\"PDF\")\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport glob\n# assign SQL Server dataset to the df1 variable\ndf1 = InputDataSet\ndf1[\"2012\"] = df1[\"2012\"].astype(int)\n# retrieve a list of the .csv files\npath = r\"c:\\\\datafiles\\\\sales\\\\\"\nfiles = glob.glob(path + \"\/*.csv\")\n# define variables used to create a list for a data frame\nlst = []\nvals1 = [\"RepID\", \"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\nvals2 = [\"2013\", \"2014\", \"2015\", \"2016\", \"2017\"]\n# open an output file for unusable .csv files\noutfile = open(path + \"outfile.txt\", \"w\")\n# define the generate_list function\ndef generate_list():\n\tif sorted(cols) == sorted(vals1):\n\t\tlst.append(df)\n\telif sorted(cols) == sorted(vals2):\n\t\tstart = file.find(\"rep\")\n\t\tend = file.find(\".csv\")\n\t\trep = file[start + 3:end]\n\t\tdf.insert(0, \"RepID\", int(rep))\n\t\tlst.append(df)\n\telse:\n\t\tstart = file.find(\"rep\")\n\t\tfilename = file[start:]\n\t\toutfile.write(filename + \"\\n\")\n# loop through each file in files to add data to list\nfor file in files:\n\tdf = pd.read_csv(file)\n\tcols = list(df.columns.values)\n\tgenerate_list()\n# close the output file\noutfile.close()\n# generate a data frame based on the data list\ndf2 = pd.concat(lst, ignore_index=True)\n# create the df_final data frame\ndf_final = pd.merge(df1, df2, on=\"RepID\")\ndf_final.drop(\"RepID\", axis=1, inplace=True)\ndf_final.sort_values(by=[\"FullName\"], inplace=True)\n# create a bar chart object\npt = df_final.plot.bar(x=\"FullName\", cmap=\"copper\")\n# configure the title, legend, and grid\npt.set_title(label=\"Annual Sales by Sales Rep\", y=1.04, fontsize=14)\npt.legend(loc=7, bbox_to_anchor=(1.2, .8))\npt.grid(color=\"slategray\", alpha=.5, linestyle=\"dotted\", linewidth=.5)\n# configure the Y-axis labels\npt.set_ylabel(\"Total Sales\", labelpad=20, fontsize=12)\npt.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter\n  (lambda x, p: format(int(x), \",\")))\n# configure the X-axis labels\npt.xaxis.label.set_visible(False)\npt.set_xticklabels(labels=df_final[\"FullName\"],\n  rotation=45, horizontalalignment=\"right\")\n# save the bar chart to a PDF file\nplt.savefig(path + \"AnnualSales.pdf\", \n  bbox_inches=\"tight\", pad_inches=.5)';\n-- define T-SQL query\nDECLARE @sqlscript NVARCHAR(MAX);\nSET @sqlscript = N'\nSELECT TOP(6) BusinessEntityID AS RepID, \n  LastName + '', '' + FirstName AS FullName,\n  CAST(SalesLastYear AS FLOAT) AS [2012]\nFROM Sales.vSalesPerson\nWHERE JobTitle = ''Sales Representative''\nORDER BY SalesLastYear DESC;';\n \n-- run procedure, using Python script and T-SQL query\nEXEC sp_execute_external_script\n  @language = N'Python',\n  @script = @pscript,\n  @input_data_1 = @sqlscript;    \nGO<\/pre>\n\n\n\n<p>Most of the code related to the visualization was covered in third article, so we won\u2019t dig into it too deeply. As you\u2019ll recall, you must include the following <strong>import<\/strong> and <strong>use<\/strong> statements:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import matplotlib\nmatplotlib.use(\"PDF\")\nimport matplotlib.pyplot as plt<\/pre>\n\n\n\n<p>The <strong>matplotlib<\/strong> module provides a set of tools for generating different types of visualizations and saving them to files. The first statement imports the module, the second statement uses the module to set the backend to <strong>PDF<\/strong>, and the third statement imports the <strong>pyplot<\/strong> plotting framework. The three statements make it possible to output the visualization it to a .pdf file.<\/p>\n\n\n\n<p>The next step is to define the actual bar chart, calling the <strong>plot.bar<\/strong> function on the <strong>df_final<\/strong> data frame:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">pt = df_final.plot.bar(x=\"FullName\", cmap=\"copper\")<\/pre>\n\n\n\n<p>For this visualization, the <strong>plot.bar<\/strong> function takes only two parameters. The first specifies that the <strong>FullName<\/strong> column should be used for the chart\u2019s X-axis. The second parameter, <strong>cmap<\/strong>, sets the chart\u2019s colormap. A colormap is essentially a monochromatic color palette made up of different shades of a common color. When creating a chart, you can choose from a number of different colormaps. In this example, I\u2019ve selected <strong>copper<\/strong>, but feel free to try different ones. (You can learn more about colormaps at <a href=\"https:\/\/matplotlib.org\/tutorials\/colors\/colormaps.html\">https:\/\/matplotlib.org\/tutorials\/colors\/colormaps.html<\/a>.)<\/p>\n\n\n\n<p>The chart generated by the <strong>plot.bar<\/strong> function is assigned to the <strong>pt<\/strong> variable. Most of the remaining Python code uses the variable to access the chart\u2019s properties in order to control how the chart is rendered. From there, the chart is saved to the <strong>AnnualSales.pdf<\/strong> file. Again, refer back to the third article if you have any questions about these settings.<\/p>\n\n\n\n<p>That\u2019s all there to creating the bar chart. If you followed along with this example, your Python script should produce a chart similar to the one shown in the following figure.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"935\" height=\"745\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/02\/word-image-34.png\" alt=\"\" class=\"wp-image-77394\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>What\u2019s interesting about all this is how the <strong>plot.bar<\/strong> function intuitively handles the data, in this case, grouping together the sales for each rep and then applying the <strong>copper<\/strong> colormap to each group. Notice, too, that the legend automatically picks up the column names, while color-coding the keys to match the bars in each group.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-working-with-python-in-sql-server\">Working with Python in SQL Server<\/h2>\n\n\n\n<p>In this and the previous articles in this series, we\u2019ve covered a number of aspects of working with data frames, as they relate to both analytics and visualizations. Of course, there\u2019s much more you can do with any of these. And there\u2019s plenty more you can do with Python and MLS in general. The better you understand how to work with Python scripts in SQL Server, the more powerful a tool you have for analyzing and visualizing data.<\/p>\n\n\n\n<section id=\"faq\" class=\"faq-block my-5xl\">\n    <h2>FAQs: SQL Server Machine Learning Services \u2013 Part 6: Merging Data Frames in Python<\/h2>\n\n                        <h3 class=\"mt-4xl\">1. How do I merge two pandas DataFrames in Python?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Use pandas.merge(left, right, how=&#8217;inner&#8217;, on=&#8217;key_column&#8217;). The how parameter controls join type: &#8216;inner&#8217; (matching rows only, equivalent to SQL INNER JOIN), &#8216;left&#8217; (all rows from left, matching from right &#8211; equivalent to LEFT JOIN), &#8216;right&#8217;, or &#8216;outer&#8217; (all rows from both). The on parameter specifies the column name used to match rows (or use left_on and right_on for different column names in each DataFrame). Example: merged_df = pd.merge(sql_df, csv_df, how=&#8217;inner&#8217;, on=&#8217;SalesPersonID&#8217;)<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">2. How do I read multiple CSV files into a single pandas DataFrame?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Use a list comprehension with glob to read all CSV files matching a pattern: import glob; import pandas as pd; all_files = glob.glob(&#8216;path\/*.csv&#8217;); df = pd.concat([pd.read_csv(f) for f in all_files]). For a fixed set of CSV files, call pd.read_csv() for each and concatenate: df = pd.concat([pd.read_csv(&#8216;file1.csv&#8217;), pd.read_csv(&#8216;file2.csv&#8217;)]). Ensure all files have the same column structure before concatenating.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">3. How do I create a bar chart in Python with matplotlib inside SQL Server MLS?<\/h3>\n            <div class=\"faq-answer\">\n                <p>After building your merged DataFrame, use matplotlib: import matplotlib.pyplot as plt; fig, ax = plt.subplots(); ax.bar(df[&#8216;Name&#8217;], df[&#8216;Amount&#8217;]); plt.savefig(&#8216;chart.png&#8217;). In SQL Server Machine Learning Services, configure the @parallel parameter and ensure matplotlib&#8217;s Agg backend is used (plt.switch_backend(&#8216;Agg&#8217;)) before creating the figure &#8211; this prevents GUI-related errors when running in the SQL Server Python environment without a display. Return the chart as a binary column in the output DataFrame if you want to send it back as a result set.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">4. What is the equivalent of SQL JOIN in pandas?<\/h3>\n            <div class=\"faq-answer\">\n                <p>pandas.merge() is the equivalent of SQL JOINs. INNER JOIN \u2192 how=&#8217;inner&#8217;; LEFT JOIN \u2192 how=&#8217;left&#8217;; RIGHT JOIN \u2192 how=&#8217;right&#8217;; FULL OUTER JOIN \u2192 how=&#8217;outer&#8217;. The on= parameter specifies the join column (equivalent to ON clause). For multi-column joins: on=[&#8216;col1&#8217;, &#8216;col2&#8242;]. For columns with different names: left_on=&#8217;LeftCol&#8217;, right_on=&#8217;RightCol&#8217;. pandas also has DataFrame.join() (indexes-based join) and pd.concat() (stacking DataFrames vertically, equivalent to SQL UNION ALL). For most SQL JOIN scenarios, pd.merge() is the clearest equivalent.<\/p>\n            <\/div>\n            <\/section>\n","protected":false},"excerpt":{"rendered":"<p>Merge a SQL Server data frame with a CSV file data frame in Python using pandas merge() inside SQL Server Machine Learning Services. Covers creating both data frames and plotting the merged result as a bar chart.&hellip;<\/p>\n","protected":false},"author":221841,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":true,"footnotes":""},"categories":[143528,146042],"tags":[95509],"coauthors":[6779],"class_list":["post-77388","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-python","tag-standardize"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/77388","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221841"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=77388"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/77388\/revisions"}],"predecessor-version":[{"id":109878,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/77388\/revisions\/109878"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=77388"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=77388"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=77388"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=77388"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}