{"id":76903,"date":"2018-01-16T14:53:23","date_gmt":"2018-01-16T14:53:23","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=76903"},"modified":"2022-04-27T21:22:51","modified_gmt":"2022-04-27T21:22:51","slug":"sql-server-machine-learning-services-part-4-finding-data-anomalies-python","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/sql-server-machine-learning-services-part-4-finding-data-anomalies-python\/","title":{"rendered":"SQL Server Machine Learning Services \u2013 Part 4: Finding Data Anomalies with Python"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/machine-learning-services-part-1-python\/\">SQL Server Machine Learning Services \u2013 Part 1: Python Basics<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-2-python-data-frames\/\">SQL Server Machine Learning Services \u2013 Part 2: Python Data Frames<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-3-plotting-data-python\/\">SQL Server Machine Learning Services \u2013 Part 3: Plotting Data with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-4-finding-data-anomalies-python\/\">SQL Server Machine Learning Services \u2013 Part 4: Finding Data Anomalies with Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-5-generating-multiple-plots-python\/\">SQL Server Machine Learning Services \u2013 Part 5: Generating multiple plots in Python<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sql-server-machine-learning-services-part-6-merging-data-frames-python\/\">SQL Server Machine Learning Services \u2013 Part 6: Merging Data Frames in Python<\/a><\/li>\n<\/ol>\n\n<p>The first three articles in this series focused on the fundamentals of working with Python in SQL Server Machine Learning Services (MLS). The articles demonstrated how to run Python scripts within the context of a SQL Server database, using data from that database. If you reviewed these article, you should have a basic understanding how to create Python scripts in SQL Server, generate and manage data frames within the scripts, and plot data within those data frames to create different types of visualizations.<\/p>\n<p>As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what\u2019s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.<\/p>\n<p>The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results. Before running the script, make sure that the path, <strong>C:\\DataFiles\\<\/strong> exists. For this example, you won\u2019t be using SQL Server data, but the principles are the same: you start with a data frame, end with a data frame, and use the data frame to generate the scatter plot, as shown in the following script:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">DECLARE @pscript NVARCHAR(MAX);\r\nSET @pscript = N'\r\n# import Python modules\r\nimport matplotlib\r\nmatplotlib.use(\"PDF\")\r\nimport matplotlib.pyplot as plt\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\nfrom sklearn.model_selection import train_test_split\r\nfrom microsoftml import rx_oneclass_svm, rx_predict\r\n# create data frame\r\niris = load_iris()\r\ndf = pd.DataFrame(iris.data)\r\ndf.reset_index(inplace=True)\r\ndf.columns = [\"Plant\", \"SepalLength\", \"SepalWidth\", \"PetalLength\", \"PetalWidth\"]\r\n# split data frame into training and test data\r\ntrain, test = train_test_split(df, test_size=.06)\r\n# generate model\r\nsvm = rx_oneclass_svm(\r\n\u00a0 formula=\"~ SepalLength + SepalWidth + PetalLength + PetalWidth\", data=train)\r\n# add anomalies to test data\r\ntest2 = pd.DataFrame(data=dict(\r\nPlant = [175, 200],\r\nSepalLength = [2.9, 3.1], \r\nSepalWidth = [.85, 1.1],\r\nPetalLength = [2.6, 2.5], \r\nPetalWidth = [2.7, 3.2]))\r\ntest = test.append(test2)\r\n# score test data\r\nscores = rx_predict(svm, data=test, extra_vars_to_write=\"Plant\")\r\n# create scatter plot\r\npt = scores.plot.scatter(x=\"Plant\", y=\"Score\", c=\"navy\", s=35, alpha=.8)\r\n# configure scatter plot title and grid\r\npt.set_title(label=\" Identifying Iris Anomalies\", y=1.04,\u00a0\r\n\u00a0 fontsize=14, weight=500, color=\"navy\")\r\npt.grid(color=\"slategray\", alpha=.5, linestyle=\"dotted\", linewidth=.5)\r\n# configure scatter plot Y-axis and X-axis labels\r\npt.set_ylabel(\"Species Scoring\", labelpad=20, fontsize=12, color=\"navy\")\r\npt.set_xlabel(\"Iris Species (by ID)\", labelpad=20, fontsize=12, color=\"navy\")\r\n# save scatter plot to .pdf file\r\nplt.savefig(\"c:\\\\datafiles\\\\IrisAnomalies.pdf\",\r\n\u00a0 bbox_inches=\"tight\", pad_inches=.8)';\r\nEXEC sp_execute_external_script\r\n\u00a0 @language = N'Python',\r\n\u00a0 @script = @pscript;\u00a0 \u00a0 \u00a0\r\nGO<\/pre>\n<p>When running a Python script in MLS, you must pass it in as a parameter value to the <strong>sp_execute_external_script<\/strong> stored procedure. If you\u2019re not familiar with how this works, refer to the first article in this series. In this article, we focus exclusively on the Python script, taking each section one step at a time, starting with importing the Python modules.<\/p>\n<h2>Importing Python modules<\/h2>\n<p>For most Python scripts, you will likely need to import one or more Python modules, packages that contain tools for carrying out various tasks and working with data in different ways. You were introduced to the first set of <strong>import<\/strong> statements in the previous article:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">import matplotlib\r\nmatplotlib.use(\"PDF\")\r\nimport matplotlib.pyplot as plt\r\n<\/pre>\n<p>As you\u2019ll recall, the <strong>matplotlib<\/strong> module provides a set of tools for creating visualizations and saving them to files. The first <strong>import<\/strong> statement calls the module, and the second statement uses the module to call the <strong>use<\/strong> function, which sets the backend to <strong>PDF<\/strong>. The third statement imports the <strong>pyplot<\/strong> plotting framework included in the <strong>matplotlib<\/strong> package. Together these three statements make it possible to create a scatter plot and output it to a .pdf file.<\/p>\n<p>The next step is to import the <strong>pandas<\/strong> module, which includes the tools necessary to create and work with data frames:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">import pandas as pd\r\n<\/pre>\n<p>The examples in the previous articles did not require you to import the <strong>pandas<\/strong> module because the <strong>InputDataSet<\/strong> variable pulled in the necessary <strong>DataFrame<\/strong> structure, along with the various functions it supports. However, because this example does not import SQL Server data into the script, and consequently does not use the <strong>InputDataSet<\/strong> variable, you must specifically import the <strong>pandas<\/strong> module to access its tools.<\/p>\n<p>The next two <strong>import<\/strong> statements retrieve specific functions from the <strong>sklearn<\/strong> module, which contains a variety of machine learning, cross-validation, and visualization algorithms:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">from sklearn.datasets import load_iris\r\nfrom sklearn.model_selection import train_test_split\r\n<\/pre>\n<p>In this case, you need to import only the <strong>load_iris<\/strong> function in <strong>sklearn.datasets<\/strong> and the <strong>train_test_split<\/strong> function in <strong>sklearn.model_selection<\/strong>, both of which are discussed in more detail below.<\/p>\n<p>The final import statement retrieves two functions from the <strong>microsoftml<\/strong> module, a collection of Python functions specific to MLS:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">from microsoftml import rx_oneclass_svm, rx_predict\r\n<\/pre>\n<p>As with the <strong>sklearn<\/strong> functions, we\u2019ll be digging into the <strong>rx_oneclass_svm<\/strong> and <strong>rx_predict<\/strong> functions shortly, but know that the <strong>microsoftml<\/strong> module includes a number of functions for training and scoring data and carrying out other analytical tasks.<\/p>\n<p>The <strong>microsoftml<\/strong> module is also tightly coupled with the <strong>revoscalepy<\/strong> module, another Microsoft package included with MLS. The <strong>revoscalepy<\/strong> module contains a collection of functions that you can use to import, transform, and analyze data at scale. The module also supports classifications, regression trees, descriptive statistics, and other data models. Data sources used in the <strong>microsoftml<\/strong> module are defined as <strong>revoscalepy<\/strong> objects.<\/p>\n<h2>Generating the iris data frame<\/h2>\n<p>After you\u2019ve imported the necessary components, you will create a data frame that provides the initial data structure necessary to perform the analytics. You can pull the data from a SQL Server database or from another source. In this case, you\u2019ll take the latter approach, creating a data frame based on the well-known Iris data set.<\/p>\n<p>The Iris data set is commonly used for demonstrating and trying out coding constructs. The data set originated in the 1930s and is considered one of the first modern examples of statistical classification. The data contains measurement data for 150 species of iris flowers. For each species, the set includes the sepal length and width and the petal length and width.<\/p>\n<p>To pull the Iris data set into you script, you need only call the <strong>load_iris<\/strong> function available in the <strong>sklearn<\/strong> module:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">iris = load_iris()\r\n<\/pre>\n<p>The <strong>load_iris<\/strong> function returns the data as an <strong>sklearn<\/strong>.<strong>utils<\/strong>.<strong>Bunch<\/strong> object that includes the iris data. For this reason, the next step you should perform is to retrieve the data from the object and convert it to a <strong>pandas<\/strong> data frame:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">df = pd.DataFrame(iris.data)\r\n<\/pre>\n<p>If you were retrieving SQL Server data, you would not need to take this step because the data would be imported into the Python script as a data frame. Because that is not the case in this situation, you must explicitly convert the data. The important point to take from all this is that, regardless of where the data originates, you must end up with a data frame. From there, you can take whatever steps necessary to prepare the data frame for analysis. In this particular situation, you should convert the data frame\u2019s index to a column:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">df.reset_index(inplace=True)\r\n<\/pre>\n<p>If you reviewed the second article in this series, the <strong>reset_index<\/strong> function should be familiar to you. The function transforms the existing index into a column and generates a new index. As you\u2019ll see later in the article, this will provide you with a way to identify the individual species after scoring the data.<\/p>\n<p>Next, you should assign a name to each column, starting with the index column:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">df.columns = [\"Plant\", \"SepalLength\", \"SepalWidth\", \"PetalLength\", \"PetalWidth\"]\r\n<\/pre>\n<p>When you run the <strong>reset_index<\/strong> function, the new column is added at the beginning of the data set, so be sure to start with that column name first. You can then follow with the names for the four columns that contain the iris size data, specifying the names in the same order as in the original data set. For any of these columns, you can use whatever names you like. Just be sure to update the script accordingly.<\/p>\n<p>If you want to view the data at this point, you can add a <strong>print<\/strong> command to your Python script, passing in the <strong>df<\/strong> data set as the function\u2019s argument:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">print(df)\r\n<\/pre>\n<p>The following figure shows the first 14 rows returned by the <strong>print<\/strong> command when I ran it on my system in SQL Server Management Studio.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"912\" height=\"462\" class=\"wp-image-76904\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/01\/20screenshots4-stpython4_fig01-png.png\" alt=\"%20screenshots4\/stPython4_fig01.png\" \/><\/p>\n<p>As expected, the <strong>Plant<\/strong> column contains the same values as the newly generated index, following by the four columns that contain the <strong>iris<\/strong> size data. With the data frame in place, you\u2019re ready to generate the sample training and test data.<\/p>\n<h2>Preparing the model and sample data<\/h2>\n<p>When analyzing data for anomalies (outliers), you will need data to train the analytics model and data to test for outliers. One way you can get this data is to split the <strong>df<\/strong> data frame into two data sets: a larger one for training and a smaller one for testing. An easy way to create the two data sets is to use the <strong>train_test_split<\/strong> function included with the <strong>sklearn<\/strong> module:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">train, test = train_test_split(df, test_size=.06)\r\n<\/pre>\n<p>The <strong>train_test_split<\/strong> function generates two random sets of data from the specified array or matrix, which in this case is the <strong>df<\/strong> data frame. By default, the function uses 75% of the data set for the training data and 25% for the test data. (It appears that these proportions will be changing to a 79\/21 split at some point in the future.) For this example, I wanted an even smaller set of test data, so I included the <strong>test_size<\/strong> parameter and specified that only 6% of the data set be used. (You can specify any float between 0.0 and 1.0.)<\/p>\n<p>When you run the <strong>train_test_split<\/strong> function, it will assign the training data to the <strong>train<\/strong> variable and the test data to the <strong>test<\/strong> variable, resulting in two new data frames.<\/p>\n<p>The next step is to generate and train a one-class SVM data model that identifies outliers that do not belong to the target class. In this case, you want the model to identify iris-related data that is outside the normal size ranges. To create the trained model, you can use the <strong>rx_oneclass_svm<\/strong> function in the <strong>microsoftml<\/strong> module:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true \"> svm = rx_oneclass_svm(\r\nformula=\"~ SepalLength + SepalWidth + PetalLength + PetalWidth\", data=train)\r\n<\/pre>\n<p>When you call the function, you need to specify the formula and data to use for training the model. When defining the formula, you will typically specify one or more variables (in this case, columns). For multiple columns, precede the list with the cbind (~) operator and use the concatenate (+) operator to separate the columns. The model will be based on all four size columns from the <strong>train<\/strong> data set.<\/p>\n<p>At this point, you\u2019re ready to run the <strong>svm<\/strong> model against the test data. But before you do that, you should add a couple of rows of outlier data to the <strong>test<\/strong> data set to ensure that it contains observable anomalies. One way to add the data is to create a second data frame and then append it to the <strong>test<\/strong> data frame:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">test2 = pd.DataFrame(data=dict(\r\nPlant = [175, 200],\r\nSepalLength = [2.9, 3.1], \r\nSepalWidth = [.85, 1.1],\r\nPetalLength = [2.6, 2.5], \r\nPetalWidth = [2.7, 3.2]))\r\ntest = test.append(test2)\r\n<\/pre>\n<p>To add the rows, first create a <strong>dict<\/strong> object that contains the set of new values, and then use the <strong>DataFrame<\/strong> function to convert the <strong>dict<\/strong> object to a <strong>DataFrame<\/strong> object and assign it to the <strong>test2<\/strong> variable. Next, use the <strong>append<\/strong> function on the <strong>test<\/strong> variable to add the <strong>test2<\/strong> data.<\/p>\n<p>You can use whatever values you like for the <strong>test2<\/strong> data (within reason). I came up with these amounts by looking at the mean values included with the original Iris data set:<\/p>\n<ul>\n<li>Sepal length: 5.84<\/li>\n<li>Sepal width: 3.05<\/li>\n<li>Petal length: 3.76<\/li>\n<li>Petal width: 1.20<\/li>\n<\/ul>\n<p>Notice that I also specified <strong>175<\/strong> and <strong>200<\/strong> for the <strong>Plant<\/strong> values in the <strong>test2<\/strong> data frame. The original Iris data set includes only 150 rows, so these two values will make it easier to verify that the outliers are indeed the rows you just added. This also points to why it was useful to turn the original index into a column. It is an easy way to identify the individual species. Of course, this implies that the index has meaning in terms of identifying species, which is not very likely, but for our purposes here, it\u2019s a useful enough strategy.<\/p>\n<h2>Scoring the data<\/h2>\n<p>The final step is to score the <strong>test<\/strong> data, using the <strong>svm<\/strong> model. For this, you can use the <strong>rx_predict<\/strong> function in the <strong>microsoftml<\/strong> module:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">scores = rx_predict(svm, data=test, extra_vars_to_write=\"Plant\")\r\n<\/pre>\n<p>The function runs an advanced algorithm that calculates a relative score for each species, based on the four size columns. When calling the function, you must specify the <strong>svm<\/strong> model and <strong>test<\/strong> data frame. You should also include the <strong>extra_vars_to_write<\/strong> parameter and specify <strong>Plant<\/strong> as its value so the column is included in the results. Otherwise, you would end up with only the scores, one for each species, without being able to tie them back to the individual plants.<\/p>\n<p>The <strong>rx_predict<\/strong> function returns the results as a <strong>DataFrame<\/strong> object, which is assigned to the <strong>scores<\/strong> variable. If you run the <strong>print<\/strong> function against the variable, you should see results similar to those shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"386\" height=\"361\" class=\"wp-image-76905\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/01\/20screenshots4-stpython4_fig02-png.png\" alt=\"%20screenshots4\/stPython4_fig02.png\" \/><\/p>\n<p>Notice that the <strong>Plant<\/strong> values are turned into floats. You can convert those to integers if you like, but for our purposes here, it\u2019s not necessary. The important point is that plants 175 and 200 clearly fall outside the range of the other scores.<\/p>\n<p>When running your Python script, keep in mind that you\u2019re generating random test and training data, so your results will likely vary from what are shown here and from what you see each time you run the script. Even so, plants 175 and 200 should always show up as outliers.<\/p>\n<h2>Plotting the data scores<\/h2>\n<p>The rest of the Python script should be familiar to you if you reviewed the third article in this series. You start by creating a scatter plot chart based on the <strong>scores<\/strong> data frame, specifying the <strong>Plant<\/strong> column for the X-axis and the <strong>Score<\/strong> column for the Y-axis:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">pt = scores.plot.scatter(x=\"Plant\", y=\"Score\", c=\"navy\", s=35, alpha=.8)\r\n<\/pre>\n<p>The <strong>c<\/strong> parameter sets the color. The <strong>s<\/strong> parameter sets the size in points, raised to the power of 2 (points^2). The <strong>alpha<\/strong> parameter sets the transparency based on a percentage.<\/p>\n<p>The next step is to add a title, providing its text, location, font size, weight, and color:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">pt.set_title(label=\" Identifying Iris Anomalies \", y=1.04, \r\nfontsize=14, weight=500, color=\"navy\")\r\n<\/pre>\n<p>You can also add a background grid, setting its color, transparency, line type, and line width:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">pt.grid(color=\"slategray\", alpha=.5, linestyle=\"dotted\", linewidth=.5)\r\n<\/pre>\n<p>You can follow this by configuring labels for the X-axis and Y-axis:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">pt.set_ylabel(\"Species Scoring\", labelpad=20, fontsize=12, color=\"navy\")\r\npt.set_xlabel(\"Iris Species (by ID)\", labelpad=20, fontsize=12, color=\"navy\")\r\n<\/pre>\n<p>The final step is to use the <strong>savefig<\/strong> function from the <strong>matplotlib<\/strong> module to save the scatter plot to a .pdf file. You can use the following statement for this, but be sure to specify the correct path and file name:<\/p>\n<pre class=\"theme:vs2012 lang:python decode:true\">plt.savefig(\"c:\\\\datafiles\\\\IrisAnomalies.pdf\",\r\nbbox_inches=\"tight\", pad_inches=.8)\r\n<\/pre>\n<p>We\u2019ve skimmed over this part of the Python script because most of these concepts are covered in detail in the previous article. If you have any questions about these statements, refer to that article or to the <strong>matplotlib<\/strong> documentation.<\/p>\n<h2>Reviewing the results<\/h2>\n<p>When you run the script, the Python engine generates results similar to the following:<\/p>\n<pre>STDOUT message(s) from external script: \r\nAutomatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.\r\nBeginning processing data.\r\nRows Read: 141, Read Time: 0.001, Transform Time: 0\r\nBeginning processing data.\r\nBeginning processing data.\r\nRows Read: 141, Read Time: 0, Transform Time: 0\r\nBeginning processing data.\r\nUsing these libsvm parameters: svm_type=2, nu=0.1, cache_size=100, eps=0.001, shrinking=1, kernel_type=2, gamma=0.25, degree=0, coef0=0\r\nReconstructed gradient.\r\noptimization finished, #iter = 21\r\nobj = 83.959314, rho = 11.983749\r\nnSV = 15, nBSV = 13\r\nNot training a calibrator because it is not needed.\r\nElapsed time: 00:00:01.1979236\r\nElapsed time: 00:00:00.2649242\r\nBeginning processing data.\r\nRows Read: 11, Read Time: 0, Transform Time: 0\r\nBeginning processing data.\r\nElapsed time: 00:00:00.2414419\r\nFinished writing 11 rows.\r\nWriting completed.\r\n<\/pre>\n<p>The <strong>rx_oneclass_svm<\/strong> function outputs the majority of these results. The last six lines are generated by the <strong>rx_predict<\/strong> function. Mostly what you\u2019re looking for are error and warning messages. Fortunately, everything here is informational, which you can review at your own leisure.<\/p>\n<p>You\u2019ll likely be more interested in the scatter plot that you saved to the .pdf file. The following figure shows the one I generated on my system.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1394\" height=\"1176\" class=\"wp-image-76906\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/01\/20screenshots4-stpython4_fig03-png.png\" alt=\"%20screenshots4\/stPython4_fig03.png\" \/><\/p>\n<p>Again, your results will likely be a bit different from those shown here and different each time you run the Python script. Even so, the two outliers (plants 175 and 200) should still be clearly visible.<\/p>\n<h2>The world of SQL Server analytics<\/h2>\n<p>MLS includes both the <strong>microsoftml<\/strong> and <strong>revoscalepy<\/strong> modules as part of its installation. Together they provide a range of machine learning tools for transforming and analyzing data. The more you use Python to work with data in a SQL Server database, the more likely you\u2019ll run up against these modules.<\/p>\n<p>Python supports other data science modules as well, such as <strong>NumPy<\/strong> and <strong>SciPy<\/strong>, both of which are included in Anaconda, the Python distribution used by MLS. Anaconda also includes a number of other Python packages for working with SQL Server data. What we\u2019ve looked at so far in this series barely scratches the surface of what\u2019s available. The more you dig into Python for machine learning and analytics, the more you\u2019ll realize how much you can do, and all within the context of a SQL Server database.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python is widely used to analyze data. One such use is to find anomalies or outliers. In this article, Robert Sheldon demonstrates how to create a support vector machine (SVM) to score test data so that outliers can be viewed on a scatter plot. &hellip;<\/p>\n","protected":false},"author":221841,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,146042],"tags":[],"coauthors":[6779],"class_list":["post-76903","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-python"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76903","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221841"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=76903"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76903\/revisions"}],"predecessor-version":[{"id":77642,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/76903\/revisions\/77642"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=76903"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=76903"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=76903"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=76903"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}