{"id":85005,"date":"2019-08-19T14:15:04","date_gmt":"2019-08-19T14:15:04","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=85005"},"modified":"2026-04-14T16:38:09","modified_gmt":"2026-04-14T16:38:09","slug":"apache-spark-for-net-developers","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/development\/dotnet-development\/apache-spark-for-net-developers\/","title":{"rendered":"Apache Spark for .NET Developers: Setup and First Spark Program"},"content":{"rendered":"<h2>Executive Summary<\/h2>\n<p><strong>This article describes how to set up and write programs using .NET for Apache Spark (Microsoft.Spark) &#8211; the .NET driver that allowed C# and F# developers to use Apache Spark for distributed data processing. Note: this project was archived in 2024 and the instructions below reflect its original state. Apache Spark itself remains actively maintained and widely used &#8211; this article provides conceptual background on how .NET interfaced with Spark&#8217;s JVM-based execution model, how Spark&#8217;s DataFrame API works, and what kinds of big data processing Spark is suited for. That conceptual knowledge transfers to any Spark environment.<\/strong><\/p>\n<p>Apache Spark is a fast, scalable data processing engine for big data analytics. In some cases, it can be 100x faster than Hadoop. Ease of use is one of the primary benefits, and Spark lets you write queries in Java, Scala, Python, R, SQL, and now .NET. The execution engine doesn\u2019t care which language you write in, so you can use a mixture of languages or SQL to query data sets.<\/p>\n<p>The goal of .NET for Apache Spark is to make Spark accessible from C# or F#. You can bring Spark functionality into your apps using the skills you already have.<\/p>\n<p>The .NET implementation provides a full set of API\u2019s that mirror the actual Spark API so that, excluding a few areas still under development, the complete set of Spark functionality is available from .NET.<\/p>\n<h2>Setting up Apache Spark on Windows<\/h2>\n<p>The .NET implementation still uses the Java VM, and so it isn\u2019t a separate implementation of Spark that replaces Spark but sits on top of the Java runtime and interacts with it. You still need to have Java installed.<\/p>\n<p>Spark is written in Scala and runs on a Java virtual machine so it can run on any platform including Windows. However, Windows does not have production support. The current version of Java that it supports is 1.8 (version 8).<\/p>\n<p>Oracle has recently changed the way that they support their JDK in that you need to pay a license fee to run it in production. Oracle also released a version called OpenJDK that doesn\u2019t have a license fee to pay when running in production. Spark can only run on Java 8 today and to run in a development environment doesn\u2019t cost anything so you can use the Oracle JRE 8 for this article, if you will be using Spark in production then it is something you should investigate.<\/p>\n<p>.NET for Apache Spark was released in April 2019 and is available as a <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.Spark\/\">download<\/a> on NuGet, or you can build and run the <a href=\"https:\/\/github.com\/dotnet\/spark\">source<\/a> from GitHub.<\/p>\n<h3>Install a Java 8 Runtime<\/h3>\n<p>You can <a href=\"https:\/\/www.oracle.com\/technetwork\/java\/javase\/downloads\/jre8-downloads-2133155.html\">download the JRE<\/a> from the Oracle site. You will need to create a free Oracle account to download.<\/p>\n<p>I would strongly suggest getting the 64 bit JRE because the 32-bit version is going to be very limited for Spark. The specific download is <em>jre-8u212-windows-x64.exe,<\/em> although this will change when there are any more releases.<\/p>\n<p>Install Java, my installation of Java was in <em>C:\\Program Files\\Java\\jre1.8.0_212<\/em> but take note of where your version is because you will need it later.<\/p>\n<h3>Download and Extract a Version of Spark<\/h3>\n<p>You can download Spark <a href=\"https:\/\/spark.apache.org\/downloads.html\">here<\/a>. There are currently two versions of Spark that you can download, 2.3 or 2.4. The current .NET implementation supports both versions, but you do need to know which version you will be using. I would suggest downloading 2.4 at this point. The README for .NET spark shows which versions of <a href=\"https:\/\/github.com\/dotnet\/spark\">Spark<\/a> are supported, currently any 2.3.* version is supported or and of 2.4.0, 2.4.1, 2.4.3 but note that 2.4.2 is not supported so stay clear of that version.<\/p>\n<p>At the time of this writing, the version of Spark supported by the current Microsoft.Spark is this <a href=\"https:\/\/archive.apache.org\/dist\/spark\/spark-2.4.1\/spark-2.4.1-bin-hadoop2.7.tgz\">version<\/a>.<\/p>\n<p>Once you have chosen the Spark version, you can select the package type, unless you want to compile Spark from source or use your own Hadoop implementation, then select the <em>Pre-built for Apache Hadoop 2.7 and later<\/em> and then download the tgz<em>. <\/em>Today, that is <em>spark-2.4.3-bin-hadoop2.7.tgz<\/em>.<\/p>\n<p>Once it has downloaded, use 7-zip to extract the folder to a known location, <em>c:\\spark-2.4.3-bin-hadoop2.7,<\/em> for example. Again, take note of where you extracted the Spark folder. My Spark folder looks like:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"779\" height=\"517\" class=\"wp-image-85027\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-59.png\" \/><\/p>\n<p>If you have something that looks like this, then you should be in good shape.<\/p>\n<h3>Download the Hadoop winutils.exe.<\/h3>\n<p>The last step is to download <em>winutils<\/em>, which is a helper for Hadoop on windows. You can download it from <a href=\"https:\/\/github.com\/steveloughran\/winutils\/blob\/master\/hadoop-2.7.1\/bin\/winutils.exe?raw=true\">GitHub<\/a>.<\/p>\n<p>When you have downloaded <em>winutils.exe<\/em>, you need to put it in a folder called <em>bin<\/em> inside another folder. I use <em>c:\\Hadoop\\bin<\/em>, but as long as <em>winutils.exe<\/em> is in a folder called <em>bin<\/em>, you can put it anywhere.<\/p>\n<h3>Configure Environment Variables<\/h3>\n<p>The final step in configuring Spark is to create some environment variables. I have a script I run from a cmd prompt when I want to use them but can also set system environment variables if you wish. My script looks like this:<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">SET SPARK_HOME=c:\\spark-2.4.1-bin-hadoop2.7\nSET HADOOP_HOME=c:\\Hadoop\nSET JAVA_HOME=C:\\Program Files\\Java\\jre1.8.0_212\nSET PATH=%SPARK_HOME%\\bin;%HADOOP_HOME%\\bin;%JAVA_HOME%\\bin;%PATH%<\/pre>\n<p>What this script does is set <code>SPARK_HOME<\/code> to the location of the extracted Spark directory, set <code>JAVA_HOME<\/code> to the location of the JRE installation, set <code>HADOOP_HOME<\/code> to the name of the folder that contains the bin directory that <em>winutils.exe<\/em> is put in. Once the environment variables have been set, I add the bin folder from each to the <code>PATH<\/code> environment variable.<\/p>\n<h2>Testing Apache Spark on Windows<\/h2>\n<p>To check everything is set up correctly, check that the JRE is available and the correct version:<\/p>\n<p>In a command window, run <code>Java -version<\/code> then <code>spark-shell<\/code>. If you have set up all the environment variables correctly you should see the Spark-shell start. The Spark-shell is a repl that lets you run scala commands to use Spark. Using the repl is a great way to experiment with data as you can read, examine, and process files:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"941\" height=\"1161\" class=\"wp-image-85028\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-60.png\" \/><\/p>\n<p>When you are ready to continue, exit Spark-shell by typing <code>:q<\/code>. You use the spark-shell to check that Spark is working. To run a job later, you use something called <code>spark-submit<\/code>.<\/p>\n<p>If you can start the Spark-shell, get a prompt and the cool Spark logo, then you should be ready to write a .NET application to use Spark.<\/p>\n<p>Note, you may see a warning that says<\/p>\n<p><em>NativeCodeLoader: Unable to load native-Hadoop library for your platform\u2026 using builtin-java classes where available<\/em><\/p>\n<p>It is safe to ignore this; it means that you don&#8217;t have Hadoop running on your system. If this is a Windows machine, then that is highly likely.<\/p>\n<h2>The .NET Driver<\/h2>\n<p>The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.<\/p>\n<p>The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.<\/p>\n<p>There was a breaking change to version 0.4 of the .NET driver, so when you use the driver, if you are using version 0.4 or higher then you need to use the package name <em>org.apache.spark.deploy.dotnet<\/em> and if you are on version 0.3 or less you should use <em>org.apache.spark.deploy<\/em>, note the extra <em>dotnet<\/em> at the end.<\/p>\n<h2>Your First Apache Spark Program<\/h2>\n<p>The .NET driver is compiled as .NET standard so you can use either the Windows .NET runtime or .NET core to create a Spark program. In this example, you will create a new .NET runtime (4.6) console application:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1178\" height=\"817\" class=\"wp-image-85029\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-61.png\" \/><\/p>\n<p>You will then add the .NET Spark driver from NuGet:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"836\" height=\"816\" class=\"wp-image-85030\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-62.png\" \/><\/p>\n<p>Select <em>Microsoft.Spark<\/em>. There was also an older implementation from Microsoft called <em>Microsoft.SparkCLR<\/em> but that has been superseded, so make sure you use the correct one. For this example, use Spark version 2.4.1 and the 0.2.0 NuGet package \u2013 these have been tested and work together.<\/p>\n<p>When you add the NuGet package to the project, you should see in the packages folder the two Java jar\u2019s which you will need later:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"813\" height=\"206\" class=\"wp-image-85031\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-63.png\" \/><\/p>\n<h2>Execute Your First Program<\/h2>\n<p>For the first program, you will download a <a href=\"http:\/\/prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com\/pp-monthly-update-new-version.csv\">CSV from the UK government<\/a> website which has all of the prices for houses sold in the last year. If the file URL has changed, then you can get to it from <a href=\"https:\/\/www.gov.uk\/government\/statistical-data-sets\/price-paid-data-downloads#single-file\">here<\/a> after and searching \u201ccurrent month as CSV file\u201d.<\/p>\n<p>The program will read this file, sum the total cost of houses sold this month, and then display the results:<\/p>\n<pre class=\"lang:c# theme:vs2012\">using System;\nusing System.Linq;\nusing Microsoft.Spark.Sql;\nnamespace HousePrices\n{\n    class Program\n    {\n        static void Main(string[] args)\n        {\n            var Spark = SparkSession\n                           .Builder()\n                           .GetOrCreate();\n            var dataFrame = Spark.Read().Csv(args[0]);\n            dataFrame.PrintSchema();\n            dataFrame.Show();\n            var sumDataFrame = dataFrame.Select(Functions.Sum(dataFrame.Col(\"_c1\")));\n            var sum = sumDataFrame.Collect().FirstOrDefault().GetAs&lt;Double&gt;(0);\n            Console.WriteLine($\"SUM: {sum}\");\n        }\n    }\n}<\/pre>\n<p>The first thing to do is to either use the <a href=\"https:\/\/github.com\/GoEddie\/dotnet-spark-article\">sample project<\/a> and build the project or create your own project and build it so you get an executable that you can call from Spark.<\/p>\n<p>First, take a look at this code:<\/p>\n<pre class=\"lang:c# theme:vs2012\">var Spark = SparkSession\n      .Builder()\n      .GetOrCreate();<\/pre>\n<p>Here you create the Spark session. The Spark session enables communication back with the .NET java code and through to Spark.<\/p>\n<p>Next review:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            var dataFrame = Spark.Read().Csv(args[0]);\n            dataFrame.PrintSchema();\n            dataFrame.Show();<\/pre>\n<p>Here the Spark session created above reads from a CSV file. Pass in the path to the CSV on the command line (<code>args[0]<\/code>). (I realise that you should validate if it exists.) Once the file has been read, the code will print out the schema and show the first 20 records.<\/p>\n<p>Finally look at this code::<\/p>\n<pre class=\"lang:c# theme:vs2012\">       var sumDataFrame = dataFrame.Select(Functions.Sum(dataFrame.Col(\"_c1\")));\n       var sum = sumDataFrame.Collect().FirstOrDefault().GetAs&lt;Double&gt;(0);\n       Console.WriteLine($\"SUM: {sum}\");<\/pre>\n<p>This will use the <code>Sum<\/code> function against the <code>_c1<\/code> column (the price column), it will then select it into a new <code>DataFrame (sumDataFrame)<\/code> and then it iterates through the rows of the <code>DataFrame<\/code>. It selects the first row and then retrieves the value of the 0\u2019th column and prints out the results.<\/p>\n<p>To run this, instead of just pushing F5 in Visual Studio, you need to first run Spark and tell it to load the .NET driver and pass onto the .NET driver the name of the program to execute.<\/p>\n<p>You will need these details to run the .NET app:<\/p>\n<table>\n<tbody>\n<tr>\n<td>Type<\/td>\n<td>Name<\/td>\n<td>Value<\/td>\n<\/tr>\n<tr>\n<td>Environment Variable<\/td>\n<td>JAVA_HOME<\/td>\n<td>Path to JRE install such as <em>C:\\Program Files\\Java\\jre1.8.0_212<\/em><\/td>\n<\/tr>\n<tr>\n<td>Environment Variable<\/td>\n<td>HADOOP_HOME<\/td>\n<td>Path to the folder that contains a <em>bin<\/em> folder with winutils.exe inside such as <em>c:\\Hadoop<\/em><\/td>\n<\/tr>\n<tr>\n<td>Environment Variable<\/td>\n<td>SPARK_HOME<\/td>\n<td>The folder you extracted the contents of the downloaded spark (note that the file downloaded is a tar then gzipped file, so you need to un-gzip then un-tar the file)<\/td>\n<\/tr>\n<tr>\n<td>The driver package name<\/td>\n<td><\/td>\n<td>For 0.3 and less the driver package is <em>org.apache.spark.deploy<\/em> and for 0.4 and greater it is <em>org.apache.spark.deploy.dotnet<\/em><\/td>\n<\/tr>\n<tr>\n<td>The full path to the built .net executable<\/td>\n<td><\/td>\n<td>I created my project in <em>c:\\git\\simpletalk\\dotnet\\HousePrices<\/em> so my full path is <em>c:\\git\\simpletalk\\dotet-spark\\HousePrices\\HousePrices\\bin\\Debug\\HousePrices.exe<\/em><\/td>\n<\/tr>\n<tr>\n<td>The full path to the jars that are included in the Microsoft.Spark NuGet package<\/td>\n<td><\/td>\n<td>Because I created my solution in <em>c:\\git\\simpletalk\\dotnet,<\/em> my path is <em>C:\\git\\simpletalk\\dotet-spark\\HousePrices\\packages\\Microsoft.Spark.0.2.0\\jars\\Microsoft-spark-2.4.x-0.2.0.jar<\/em><br \/>\n(Note it is the full path including the name of the jar, not the path to where the jars are located) If your NuGet package is version 0.0.3 or something else then the name of the jar will be more like: <em>packages\\Microsoft.Spark.0.3.0\\jars\\Microsoft-spark-2.4.x-0.3.0.jar<\/em> \u2013 every change to the NuGet package will cause this version to change.<\/td>\n<\/tr>\n<tr>\n<td>The full path to the downloaded house prices csv<\/td>\n<td><\/td>\n<td>In my example it is <em>c:\\users\\ed\\Downloads\\pp-monthly-update-new-version.csv<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In a command prompt that has these environment variables set, run the next command. (If you still have the spark-shell session open in your command prompt, close it using <code>:q<\/code>).<\/p>\n<pre class=\"lang:c# theme:vs2012\">spark-submit --class org.apache.spark.deploy.DotnetRunner --master local \"C:\\git\\simpletalk\\dotet-spark\\HousePrices\\packages\\Microsoft.Spark.0.2.0\\jars\\Microsoft-spark-2.4.x-0.2.0.jar\" \"c:\\git\\simpletalk\\dotet-spark\\HousePrices\\HousePrices\\bin\\Debug\\HousePrices.exe\" \"c:\\users\\ed\\Downloads\\pp-monthly-update-new-version.csv\"<\/pre>\n<p>If your executable isn\u2019t called <em>HousePrices.exe<\/em>, then replace that with the name of your program. When you build in Visual Studio, the output window should show the full path to your built executable. If you aren\u2019t called \u201ced\u201d then change the path to the CSV file, and if you decided to use Spark 2.3 rather than Spark 2.4, then change the version of the jar.<\/p>\n<p>The Scala code looks in the current working directory and any child directories underneath it to find <em>HousePrices.exe<\/em>. To see how it does that, you can look at <a href=\"https:\/\/github.com\/dotnet\/spark\/blob\/master\/src\/scala\/microsoft-spark-2.4.x\/src\/main\/scala\/org\/apache\/spark\/deploy\/dotnet\/DotnetRunner.scala\">the function<\/a> <code>resolveDotnetExecutable<\/code>. You can change the directory in your command prompt to your Visual Studio output directory and run it from there or be more specific in your command line.<\/p>\n<p>Note also that the version of the jar increases with each version of Spark, and because the version is part of the filename, I used 0.3.0 for this article, but new versions are released quite regularly:<\/p>\n<pre class=\"lang:c# theme:vs2012\">Spark-submit \u2013class org.apache.spark.deploy.DotnetRunner --master local PathToMicrosoftSparkJar PathToYourProgram.exe PathToYourCsvFile.CSV<\/pre>\n<p>If you run the command line successfully you should see:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1726\" height=\"5866\" class=\"wp-image-85032\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-64.png\" \/><\/p>\n<p>The interesting parts are the schema from <code>dataFrame.PrintSchema()<\/code>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"621\" height=\"419\" class=\"wp-image-85033\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-65.png\" \/><\/p>\n<p>The first twenty rows from <code>dataFrame.Show()<\/code>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1159\" height=\"507\" class=\"wp-image-85034\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-66.png\" \/><\/p>\n<p>Finally, the results of the <code>Sum<\/code>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"26\" class=\"wp-image-85035\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-67.png\" \/><\/p>\n<p>You may get a lot of Java IO exceptions such as:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1116\" height=\"523\" class=\"wp-image-85036\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-68.png\" \/><\/p>\n<p>To stop these, in your <em>Spark<\/em> folder there is a <em>conf<\/em> directory. In the <em>conf<\/em> directory, you will have a <em>log4j.properties<\/em> file add these lines to the end of the file:<\/p>\n<pre class=\"lang:c# theme:vs2012\">log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF\nlog4j.logger.org.apache.spark.SparkEnv=ERROR<\/pre>\n<p>If you don\u2019t have a <em>log4j.properties<\/em> you should have a <em>log4j.properties.template<\/em>, copy it to <em>log4j.properties<\/em>.<\/p>\n<h2>A Larger Example<\/h2>\n<p>The first example was very basic, and the file doesn\u2019t contain column header, so they are set to <code>_c0<\/code>, <code>_c1<\/code> etc. which isn\u2019t ideal. Also, the output from <code>PrintSchema<\/code> shows that every column is a string.<\/p>\n<p>The first this to do is to get Spark to infer the schema from the csv file, which you do by adding the option <code>inferSchema<\/code> when reading the csv. Change the line (line 15 in my program):<\/p>\n<pre class=\"lang:c# theme:vs2012\">            var dataFrame = Spark.Read().Csv(args[0]);<\/pre>\n<p>into:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            var dataFrame = Spark.Read().Option(\"inferSchema\", true).Csv(args[0]);<\/pre>\n<p>Build your .net application and re-run the spark-submit command line which now causes <code>PrintSchema()<\/code> to show the actual data types:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"420\" class=\"wp-image-85037\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-69.png\" \/><\/p>\n<p>Because you now know the data types, it goes on to break the <code>GetAs&lt;Double&gt;(0)<\/code> with an <em>Unhandled Exception: System.InvalidCastExceptionL Specified cast is not valid<\/em> so you also need to change the <code>GetAs&lt;double&gt;<\/code> to <code>GetAs&lt;long&gt;<\/code>, from:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            var sum = sumDataFrame.Collect().FirstOrDefault().GetAs&lt;Double&gt;(0);<\/pre>\n<p>into:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            var sum = sumDataFrame.Collect().FirstOrDefault().GetAs&lt;long&gt;(0);<\/pre>\n<p>You can test that the program now completes by building in Visual Studio and re-running the spark-submit command line.<\/p>\n<p>It would be good to have the correct column headers rather than <code>_c0<\/code>, to do this, read the data frame and then re-read the data frame passing in the headers \u2013 this doesn\u2019t cause the data to be re-read or re-processed, so it is efficient. If you use this program which reads the data frame, prints the schema and then converts the data frame to a data frame with headers and re-prints the schema, you should see the original _c* column names and the corrected column names:<\/p>\n<pre class=\"lang:c# theme:vs2012\">using System;\nusing System.Linq;\nusing Microsoft.Spark.Sql;\nnamespace HousePrices\n{\n    class Program\n    {\n        static void Main(string[] args)\n        {\n            var Spark = SparkSession\n                           .Builder()\n                           .GetOrCreate();\n            var dataFrame = Spark.Read().Option(\"inferSchema\", true).Csv(args[0]);\n            dataFrame.PrintSchema();\n            dataFrame.Show();\n            dataFrame = dataFrame.ToDF(\"file_guid\", \"price\", \"date_str\", \"post_code\", \"property_type\", \"old_new\", \"duration\", \"paon\", \"saon\", \"street\", \"locality\", \"town\", \"district\", \"county\", \"ppd_Category_type\", \"record_type\");\n            dataFrame.PrintSchema();\n        }\n    }\n}<\/pre>\n<p>Build your .net application and then re-run your spark-submit command line and you should see the correct column names:<\/p>\n<p>Going further, you can use the column names to filter the data. <em>KENSINGTON AND CHELSEA<\/em> is a beautiful part of London, see how much houses in that area cost to buy:<\/p>\n<pre class=\"lang:c# theme:vs2012\">using System;\nusing Microsoft.Spark.Sql;\nnamespace HousePrices\n{\n    class Program\n    {\n        static void Main(string[] args)\n        {\n            var Spark = SparkSession\n                           .Builder()\n                           .GetOrCreate();\n            var dataFrame = Spark.Read().Option(\"inferSchema\", true).Csv(args[0]);\n            \n            dataFrame = dataFrame.ToDF(\"file_guid\", \"price\", \"date_str\", \"post_code\", \"property_type\", \"old_new\", \"duration\", \"paon\", \"saon\", \"street\", \"locality\", \"town\", \"district\", \"county\", \"ppd_Category_type\", \"record_type\");\n            \n            dataFrame = dataFrame.Where(\"district = 'KENSINGTON AND CHELSEA'\");\n            Console.WriteLine($\"There are {dataFrame.Count()} properties in KENSINGTON AND CHELSEA\");\n            dataFrame.Show();\n            \n        }\n    }\n}<\/pre>\n<p>Build the .net application and run the spark-submit command line and you should see something like:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1066\" height=\"165\" class=\"wp-image-85040\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-72.png\" \/><\/p>\n<p>In case you are struggling with the amount of output, you can hide the Info messages by going back to the <em>log4j.properties<\/em> file located in the extracted spark directory and the conf folder inside that. Change the line:<\/p>\n<pre class=\"lang:c# theme:vs2012\">log4j.rootCategory=INFO, console<\/pre>\n<p>into:<\/p>\n<pre class=\"lang:c# theme:vs2012\">log4j.rootCategory=WARN, console<\/pre>\n<p>You will see warnings and output but not all the info messages. I would say it is generally better to leave the info messages on, so you get used to what is normal and learn some of the terminology that spark uses.<\/p>\n<p>This new program runs quickly, but Spark is great for processing large files. It\u2019s time to do something a little bit more complicated. First, download the entire history of the <a href=\"https:\/\/www.gov.uk\/government\/statistical-data-sets\/price-paid-data-downloads\">price paid data<\/a>. Download the <em>Single File<\/em> or <em>the complete Price Paid Transaction Data as a CSV file<\/em>, currently <a href=\"http:\/\/prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com\/pp-complete.csv\">here<\/a>.<\/p>\n<p>You can then change the program, so instead of just filtering, it filters and then groups by year and gets a count of how many properties sold per year and the average selling price. One of the features of Spark is that you can use the methods found in Scala, Python, R, or .NET or you can write SQL.<\/p>\n<p>The date must be an actual date, but even with the <code>inferSchema<\/code> option set to true, it\u2019s still a string rather than an exact date. To correct this, add an extra column to the data set which is the date cast to an actual date:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            dataFrame = dataFrame.WithColumn(\"date\", dataFrame.Col(\"date_str\").Cast(\"date\"));<\/pre>\n<p>If you build this and then run the spark-submit command line, you should see the extra column:<\/p>\n<pre class=\"lang:c# theme:vs2012\">using System;\nusing Microsoft.Spark.Sql;\nnamespace HousePrices\n{\n    class Program\n    {\n        static void Main(string[] args)\n        {\n            var Spark = SparkSession\n                           .Builder()\n                           .GetOrCreate();\n            var dataFrame = Spark.Read().Option(\"inferSchema\", true).Csv(args[0]);\n            dataFrame = dataFrame.ToDF(\"file_guid\", \"price\", \"date_str\", \"post_code\", \"property_type\", \"old_new\", \"duration\", \"paon\", \"saon\", \"street\", \"locality\", \"town\", \"district\", \"county\", \"ppd_Category_type\", \"record_type\");\n            dataFrame = dataFrame.WithColumn(\"date\", dataFrame.Col(\"date_str\").Cast(\"date\"));\n            dataFrame.Show();\n        }\n    }\n}<\/pre>\n<p>To query the data using SQL syntax rather than just using .Net methods as shown up to until now, you can save the <code>DataFrame<\/code> as a view. This makes it available to be queried:<\/p>\n<pre class=\"lang:c# theme:vs2012\">dataFrame.CreateTempView(\"ppd\");<\/pre>\n<p>You can then query the view from SQL:<\/p>\n<pre class=\"lang:c# theme:vs2012\">            Spark.Sql(\"select year(date), avg(price), count(*) from ppd group by year(date)\").OrderBy(Functions.Year(dataFrame.Col(\"date\")).Desc()).Show(100);<\/pre>\n<p>This runs the SQL query<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">select year(date), avg(price), count(*) from ppd group by year(date)<\/pre>\n<p>It then orders the results by date descending and shows the last 100 years (the data only goes back to 1995 so you won\u2019t see 100 years of data).<\/p>\n<pre class=\"lang:c# theme:vs2012\">using System;\nusing Microsoft.Spark.Sql;\nnamespace HousePrices\n{\n    class Program\n    {\n        static void Main(string[] args)\n        {\n            var Spark = SparkSession\n                           .Builder()\n                           .GetOrCreate();\n            var dataFrame = Spark.Read().Option(\"inferSchema\", true).Csv(args[0]);\n            dataFrame = dataFrame.ToDF(\"file_guid\", \"price\", \"date_str\", \"post_code\", \"property_type\", \"old_new\", \"duration\", \"paon\", \"saon\", \"street\", \"locality\", \"town\", \"district\", \"county\", \"ppd_Category_type\", \"record_type\");\n            dataFrame = dataFrame.WithColumn(\"date\", dataFrame.Col(\"date_str\").Cast(\"date\"));\n            dataFrame.CreateTempView(\"ppd\");\n            \n            var result = Spark.Sql(\"select year(date), avg(price), count(*) from ppd group by year(date)\").OrderBy(Functions.Year(dataFrame.Col(\"date\")).Desc());\n            result.Show(100);\n        }\n    }\n}<\/pre>\n<p>You can then run this against the full dataset:<\/p>\n<pre class=\"lang:c# theme:vs2012\">spark-submit --class org.apache.spark.deploy.DotnetRunner --master local[8] Microsoft-spark-2.4.x-0.2.0.jar HousePrices.exe c:\\users\\ed\\Downloads\\pp-complete.csv<\/pre>\n<p>You can also change how many cores the processing takes. Instead of <code>--master local<\/code> which uses one single core by itself, use <code>--master local[8]<\/code> or whatever number of cores you have on a machine. If you have lots of cores, use them.<\/p>\n<p>When I ran this on my laptop with eight cores, it took 1 minute 45 seconds to complete, and the average house price in that area is about 2.5 million pounds:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"629\" height=\"599\" class=\"wp-image-85041\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/08\/word-image-73.png\" \/><\/p>\n<h2>Conclusion<\/h2>\n<p>Using .NET for Apache Spark brings the full power of Spark to .NET developers who are more comfortable writing C# or F# than Scala, Python, R or Java. It also doesn\u2019t matter whether you are running Linux or Windows for your development.<\/p>\n<h2>Source Code<\/h2>\n<p>I have included a working copy of the <a href=\"https:\/\/github.com\/GoEddie\/dotnet-spark-article\">final version of the application<\/a> on GitHub. In the git repo, there are .NET Framework and core versions of the solution. If you use the .NET core version, then executing the program is the same except instead of <em>HousePrices.exe<\/em>, you need to have <em>dotnet HousePrices-core.dll<\/em> before the path to the CSV file:<\/p>\n<pre class=\"lang:c# theme:vs2012\">spark-submit --class org.apache.spark.deploy.DotnetRunner --master local PathTo\\Microsoft-spark-2.4.x-0.2.0.jar dotnet PathTo\\HousePrices-Core.dll c:\\users\\ed\\Downloads\\pp-monthly-update-new-version.csv<\/pre>\n<h2>References<\/h2>\n<p><a href=\"https:\/\/spark.apache.org\/\">https:\/\/spark.apache.org\/<\/a><\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/spark\">https:\/\/github.com\/dotnet\/spark<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Note: The .NET for Apache Spark project was archived in 2024. This article covers the original setup and usage of Microsoft.Spark for .NET developers &#8211; a useful reference for understanding .NET\/Spark integration concepts.&hellip;<\/p>\n","protected":false},"author":59464,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143538],"tags":[],"coauthors":[11314],"class_list":["post-85005","post","type-post","status-publish","format-standard","hentry","category-dotnet-development"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/85005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/59464"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=85005"}],"version-history":[{"count":10,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/85005\/revisions"}],"predecessor-version":[{"id":109670,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/85005\/revisions\/109670"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=85005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=85005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=85005"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=85005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}