The post 10 reasons why Python is better than C# (or almost any other programming language) appeared first on Simple Talk.
]]>These are just beautiful! Here’s how you can write a loop and condition in Python to play FizzBuzz in Python:
for integer in range(1,11): # for first 10 numbers if integer % 3 == 0: print(integer, "Fizz") elif integer % 5 == 0: print(integer,"Buzz")
Notice anything? There are no silly { } brackets to denote when a condition or loop block starts and ends because Python can infer this from the indentation. Why type characters when you don’t have to?
Here’s how you can declare variables in four programming languages:
In Python, you don’t – and can’t – declare variables. Instead, you just assign values to them:
# create a variable to hold the meaning of life meaning_of_life = 42
Python then creates a variable of the appropriate type (you can prove this by printing out its value):
# show the type of this, and its value print(type(meaning_of_life),meaning_of_life)
This code would show <class ‘int’> 42. Another nice thing about Python variables is how simple the data types are: there are only integers, floating point numbers, strings, and Boolean values, which is pretty much all a programming language needs, dates being just formatted numbers when all’s said and done).
Now C# programmers will – like I did – initially throw their arms up in horror at this, but haven’t we all been writing lines of code like this for years?
// store life's meaning var meaningOfLife = 42;
This statement does exactly the same thing: it sets the type of the variable from the context.
I’ve found not having to declare the type of variables strangely liberating, to the extent that I now resent having to declare variables when I go back to writing code in C# or SQL.
Python is Python because of its modules: literally thousands of addins, covering everything from scraping websites to advanced AI tasks.
But wait a bit, you may say – C# (at least as it’s written in .NET, which is surely the most common implementation) has “modules” too. Here are just a few of the ones in the System namespace:
The answer to which is: yes, it does, but … Python modules are so much easier to use. To illustrate this, let’s take an example: looping over the subfolders in a folder. Here’s how you could do this in Python:
import os # get a list of all the files in a folder files = os.listdir(r"C:\__work\\") # loop over these files for file in files: # show its name print(file)
Here’s the approximate equivalent C# code:
using System.IO; var folder = new DirectoryInfo(@"C:\__work\"); foreach (DirectoryInfo dir in folder.GetDirectories()) { Debug.WriteLine(dir.Name); }
I’m aware that it’s subjective, but I find the Python modules to be almost without exception more logically constructed and easier to learn than the equivalent .NET ones, for example.
C# has (deep breath now) the following data structures: Array
, ArrayList
, Collection
, List
, Hashtable
, Dictionary
, SortedDictionary
, SortedList
. These implement the IEnumerable
or IDictionary
interfaces. If you’re thinking of learning C#, I’ve probably just put you off for life!
Python has four structures: lists, tuples, dictionaries, and sets, although the first two are virtually identical from the programming point of view. Purists will say that you need all the different data types that C# provides, but trust me, you don’t.
It’s not just that Python only has a few data structures; the ones it does have are so easy to use. Here’s how you create a sorted list of fruit, for example:
# create an empty list fruit = [] # add some fruit fruit.append("Pears") fruit.append("Apples") fruit.append("Bananas") # sort this fruit.sort() # print print(fruit)
Programming doesn’t get any more straightforward than this!
I’m not generally a fan of languages that try to cram as much logic into as few characters as possible (that’s why I don’t like regular expressions and didn’t get on with Perl).
However, slicing sequences has the big advantage that it’s used everywhere in Python, so once you’ve learnt a few simple rules, you can pick out anything you want. And there’s a certain beauty in code like this:
# the seven deadly sins sins = ["pride", "envy", "gluttony", "greed", "lust", "sloth", "wrath"] # every other one, but missing the first and last # ie ['envy', 'greed', 'sloth'] selected_sins = sins[1:1:2] print(selected_sins)
The great thing about slicing is that it carries through to everything. Once you’ve learnt how to slice a tuple, for example, you’ve also learnt how to slice a multidimensional array (in numpy) or an Excelstyle dataframe (in pandas) or any other sequence of items.
C# has two separate loop structures, depending on whether you’re looping over numbers or objects, as shown by these two examples:
// first 5 numbers for (int i=0; i <= 5; i++) { Debug.Print(i.ToString()); } // words in a string string[] words = {"Simple","Talk"}; foreach (string word in words) { Debug.Print(word); }
Here is the same code in Python:
# first 5 numbers for i in range(1,6): print(i) # words in a string words = {"Simple","Talk"} for word in words: print(word)
Notice that not only is the Python for loop syntax simple and easier to understand, but there’s only one version of it (unlike in C#, where sometimes you use for
and sometimes foreach
).
A list comprehension is one of the most beautiful programming constructs I’ve seen. You can use it as a substitute for C# lambda or anonymous functions, and the result is transparent.
Here’s an example listing out the cubes of all the even numbers up to 10:
# cubes of even numbers up to and including 10 # (would give [8, 64, 216, 512, 1000] as output) print([n ** 3 for n in range(1,11) if n % 2 == 0])
You could of course do this the long way round:
for n in range(1,11): if n % 2 == 0: print(n ** 3)
It’s hard to think how you could improve the syntax of the first method: show me this for these numbers where this condition is true.
I said just now that list comprehensions are beautiful, but so too are sets.
For example, the way to remove duplicates from a list of items in Python is just to convert the list to a set (since sets can’t contain duplicate items, any duplicates will be removed), then convert the set back to a list. Like this, say:
# this contains some duplicates languages = ["C#","Python","VB","Java","C#","Java","C#"] # this set can't language_set = set(languages) # convert back to a list languages = list(language_set) # this will give: ['C#', 'Java', 'Python', 'VB'] print(languages)
However, sets don’t just provide an elegant way to remove duplicates from lists; they also allow you to find the intersection or union of two groups of items (those longago maths lessons learning Venn diagrams weren’t wasted after all!).
Pretty much everything you need to know about sets in Python is covered by these few lines of code:
# create two sets: Friends characters and large Antarctic ice shelves friends = {"Rachel","Phoebe","Chandler", "Joey","Monica","Ross"} ice_shelves = {"RonnieFilchner", "Ross", "McMurdo"} # show the intersection (elements in both lists) print(friends & ice_shelves) # show the union (elements in either list) print(friends  ice_shelves) # show the friends who aren't ice shelves print(friends  ice_shelves) # elements in either set but not both print(friends ^ ice_shelves)
This code would give this output ( “Ross” is the only item in both lists, for example):
Virtually everything to do with files and folders is easier than you think it’s going to be. Want to write out the contents of a list? No need to import modules – just do this :
# list of 3 people people = ["Shadrach","Meshach","Abednego"] # write them to a file with open(r"c:\\wiseowl\people.txt","w") as people_file: people_file.writelines("\n".join(people))
Want to export this as a CSV file? Just use the builtin csv module:
import csv # write student amesto a file with open(r"c:\\wiseowl\students.csv","w",newline="") as people_file: potter_file = csv.writer(people_file) # use this to write out 3 rows potter_file.writerow(["Harry", "Gryffindor"]) potter_file.writerow(["Draco", "Slytherin"]) potter_file.writerow(["Hermione", "Gryffindor"])
I could go on to show reading from or writing to JSON files, Excel files, any files using pandas … Python modules make coding as easy as it can be!
Here are the results of a search using a wellknown search engine (!) for the phrase C# tutorial:
Here are the results for the same search using the phrase Python tutorial:
But it’s not just that Python has nearly four times as many tutorial page results: the tutorials themselves are much better, IMHO.
It would be disingenuous to end this article without giving two areas in which Python doesn’t compare well with other languages like C#.
The first way in which Python underperforms is that it’s so hard to get started. In C#, you’re probably going to choose Visual Studio as your development environment, and while you will have teething problems, at least everything is integrated. In Python, you have to choose whether to use Visual Studio Code, PyCharm, Jupyter Notebook, or any of a dozen other candidates for your IDE (Integrated Development Environment). Even when you’ve chosen your IDE, you’ll still have to learn how to set up “virtual environments” in which you can install modules for different applications that you’re creating so that they don’t interfere with each other. None of this is straightforward, and it’s a serious impediment to getting started with Python.
The second way in which Python underperforms is that it isn’t always strongly typed. This limitation is best illustrated by an example. Consider this segment of Python code:
def add_numbers(first:int,second:int) > int: # add the two numbers together return first + second # test this out print(add_numbers(3,5)) # now test this with some text print(add_numbers("Simple","Talk"))
It creates a function to add two numbers together, then calls it twice. The first call will give 8 (the sum of 3 and 5), while the second will give “SimpleTalk” (treating the “+” in the function as a concatenation symbol).
The problem is this line:
def add_numbers(first:int,second:int) > int:
This is sometimes referred to as duck typing: if it looks like a duck and quacks like a duck, it’s probably a duck. Except that this looks like an integer and is passed as an integer – and yet is happily accepting a string into the function. The data types are just hints, it turns out: a serious limitation.
One other big difference between Python and other languages is that the former is interpreted rather than compiled. What this means is that when you run a Python program, the instructions are read as text in sequential order (no executable file is constructed first from the humanreadable statements translated into machinereadable language). However, I still can’t decide if this is a good or bad thing, so I have left it off both lists above!
Python was created by Dutch programmer Guido van Rossum to be as easy to use as possible; he succeeded in his aim! Although the Monty Python references in documentation get a bit tedious (even for this diehard fan), if you’re an experienced C# programmer, you will enjoy the simplicity of programming in Python.
If you like this article, you might also like 10 Reasons Why Visual Basic is Better Than C#
The post 10 reasons why Python is better than C# (or almost any other programming language) appeared first on Simple Talk.
]]>The post Creating TimeIntelligence Functions in DAX appeared first on Simple Talk.
]]>Want to compare this year’s sales with the same period in the previous year? Chart yeartodate costs? Or perhaps you want to create a twelvemonth moving average of profitability? In the last article in this series, you’ll learn how to use the timeintelligence functions built into DAX and understand why they work.
To work through the examples in this article, you’ll need to download the worksheets from this workbook. Tick the following worksheets to load data from this workbook into a new Power BI report:
Now create a relationship between the Calendar and Sales tables by the SalesDate column, as follows (note that initially the Balance and Weight tables aren’t linked to any others):
NOTE: This article assumes that you are only concerned with the date when sales are made. If you wanted to be able to choose between the sales date and the payment date when analysing data, you’d either have to create multiple versions of the Calendar table or multiple relationships, as described in the previous article in this series.
Finally, switch to the Modeling ribbon and choose to sort the month name by the month number:
And make the YearNumber a text column:
Again, the reason for both of these changes is covered in the previous article in this series.
To accommodate all the wonderful measures that you’re going to bring into being, create a matrix visual to show total amount sold by year and month:
I think it’ll be easier to work with if you have all 12 months appearing, so choose to show months even when they have no corresponding data:
Here’s what the start of the matrix should look like:
I’ve decided to hide the row subtotals for the matrix. To turn these off, go to the formatting tab and look under the Subtotals section for Row subtotals.
If for any reason you don’t get the +/ icons to expand/collapse rows, enable this setting in the matrix’s formatting properties:
To begin with, create a new measure on the Sales table to show total sales to date for each year (don’t worry too much yet about how this works):
Yeartodate = CALCULATE( SUM(Sales[Amount]), DATESYTD('Calendar'[DateKey]) )
I’ll come back to the DATESYTD
function in more detail later in this article, but for the moment I want to use it as an example to explain how any DAX timeintelligence function is calculated. Here’s what this measure should show for the matrix:
What I want to do is to focus on one figure – the yeartodate sales for March 2018. Start by looking at the sales amount for March 2018, which is 4.50:
The filter context for this figure is all of the dates in March 2018:
There’s only one sale in March 2018, so that’s why you get 4.50 as the sales for the cell:
By contrast, the measure to calculate yeartodate sales first destroys the existing filter context for the sales date. Without any additional change, you would get the same figure for each month:
However, the measure then replaces the filter context with one which picks out all of the dates from the calendar table which are on or before 31^{st} March in 2018:
Yeartodate = CALCULATE( SUM(Sales[Amount]), DATESYTD('Calendar'[DateKey]) )
The filter context is now as follows for the cell:
Power BI knows when days, months, quarters, and years start and end (they’re called timeintelligence functions for a reason!). Curiously, Power BI doesn’t know about weeks, so if you want to do weekly reporting, you’ll have to create new aggregator columns yourself in your calendar table (again, the previous article in this series gives a guide for how to do this sort of thing).
Here are the sales figures for the start of 2018:
Adding 9.49 and 4.5 gives the yeartodate figure of 13.99! This is what every timeintelligence function does: it destroys the previous filter context for each value in a visual’s underlying data and replaces it with a different one according to the combination of DAX functions you’ve chosen.
Now to take a look in detail now at how to do specific things, beginning with calculating yearly, quarterly or monthly cumulative figures. You can do this using one of these functions:
DATESYTD
or TOTALYTD
(yeartodate)DATESQTD
or TOTALQTD
(quartertodate)DATESMTD
or TOTALMTD
(monthtodate)This functionality is typical of timeintelligence functions in DAX: there are often two or three ways to do the same thing, and which one you use is a matter of personal preference. Here’s a function giving the yeartodate figures using DATESYTD
shown earlier:
Yeartodate = CALCULATE( // you could use a different aggregation function SUM(Sales[Amount]), // calculate over the dates for the yeartodate DATESYTD('Calendar'[DateKey]) )
Here’s a function to do exactly the same thing using TOTALYTD
:
Yeartodate 2 = TOTALYTD( // again, we could use MAX, MIN, COUNT, etc. here SUM(Sales[Amount]), 'Calendar'[DateKey] )
These two measures should give the same figures because they are, after all, doing exactly the same thing:
I prefer the first measure since it’s clearer what it’s doing (destroying the existing filter context, and replacing it with one which uses the dates for the current year up to and including the last day in the current period). The second measure using TOTALYTD
is just a convenient shorthand for this.
Not everyone’s financial years end conveniently on 31^{st} December, so you can specify a second argument for the DATESYTD
function, giving your yearend date in the format DDMM
:
As an example, suppose that your year ends on 31^{st} March. Then you could use this measure:
Yeartodate = CALCULATE( // you could use a different aggregation function SUM(Sales[Amount]), // calculate over the dates for the yeartodate, // but with year ending on 31st March DATESYTD('Calendar'[DateKey],"3103") )
This measure would give this report (the box shows how the figures from 1^{st} April 2018 to 31^{st} March 2019 are calculated):
This result would look much better if you created a new aggregator column to give the financial year and reported by that. An idea of how to do this is shown in the previous article in this series. Since I’m feeling charitable, here is the formula that you could use to create a new calculated column in your Calendar table for the financial year:
Financial year = IF( [MonthNumber] <= 3, // for dates up to and the end of March [YearNumber]  1 & "" & [YearNumber], // for dates from April to December [YearNumber] & "" & [YearNumber] + 1 )
You’ll also need a formula to determine how to sort months, so that April comes first and March last:
Financial month sort order = IF( [MonthNumber] <= 3, // for dates up to and the end of March [MonthNumber]+12, // for dates from April to December [MonthNumber] )
You can then choose to sort your months by the Financial month sort order column you’ve created:
If you then display the financial year column you’ve created in your matrix instead of the year, like this:
You should now get something a lot less confusing:
This may seem like a lot of faff, but remember that you’ll only have to set up the calculated columns in your calendar once and once only.
Finally, on the subject of changing your yearend dates, I’ve only shown so far how to change the financial yearend using the DATESYTD
function. The process for the TOTALYTD
function is similar, but there is a catch. Here’s the syntax:
So it looks for all the world as if the argument to set a new yearend date is the fourth one, and that if you’re not setting any additional filter, you will need to find some way to omit the third argument. However, this measure works:
Yeartodate 2 = TOTALYTD( // again, we could use MAX, MIN, COUNT, etc. here SUM(Sales[Amount]), 'Calendar'[DateKey], // change the yearend date "0331" )
Somehow Power BI works out that you’ve missed out the third Filter
argument. In every other Microsoft product I’ve used, you need to use a comma placeholder to show that you’re omitting an argument to a function, but if you try to do this in the formula above you get an error!
If you’re wondering which functions support the additional yearend argument, the answer is all those for which this would be relevant – here’s the list for reference:
Functions 
What they do 
STARTOFYEAR, ENDOFYEAR 
Return the first or last date in the year for the current filter context. 
PREVIOUSYEAR, NEXTYEAR 
Return a table of the dates in the previous or next year, based on the current filter context’s latest date. 
DATESYTD, TOTALYTD 
As covered on the previous page! 
OPENINGBALANCEYEAR, CLOSINGBALANCEYEAR 
Return the opening or closing balance on the first/last day of the year for the current filter context. 
Suppose that you want to show for each day, month, quarter or year what your sales were in the same period twelve months ago? Again, there are two ways (at least) to do this:
DATEADD
functionSAMEPERIODLASTYEAR
functionThe DATEADD
function is more useful, as it can also show sales 13 months ago, or four years ago or indeed any number of periods of any type ago, but I’ll explain both functions in the interest of fairness. Start by showing for each month the sales in that month against the sales in the same month in the previous quarter, which should give this:
Note that I’ve reverted to using the typical calendar month and year in the matrix, and I’m sorting the calendar months by the MonthNumber column again.
So, for example, the measure should show previous quarter sales for February 2019 as 26.68, since these were what the sales were three months earlier for the same period. You can’t use the SAMEPERIODLASTYEAR
function to do this for obvious reasons (the clue’s in the name), and for some strange reason there isn’t an equivalent SAMEPERIODLASTMONTH
or SAMEPERIODLASTQUARTER
function, so instead, you’ll use the versatile DATEADD
function. This takes three arguments:
The arguments are:
Here’s what you’ll see for the third argument when typing it in:
For this example, you can either go three months back in time or one quarter; it makes no difference which you choose. I’ve gone for three months, to produce this measure:
Previous quarter = CALCULATE( SUM(Sales[Amount]), // go back 3 months (could have used 1 QUARTER) DATEADD('Calendar'[DateKey],3,MONTH) )
Since the DATEADD
function is so powerful (you can go forward or backwards in time, using days, months, quarters or years as the time interval), it seems pointless having the SAMEPERIODLASTYEAR
function as a shortcut for it. Nevertheless, it exists! To illustrate it, switch to showing for each period what sales were 12 months (i.e. one year) previously. You could solve this using this measure:
Previous year 1 = CALCULATE( SUM(Sales[Amount]), SAMEPERIODLASTYEAR('Calendar'[DateKey]) )
Or this measure:
Previous year 2 = CALCULATE( SUM(Sales[Amount]), DATEADD('Calendar'[DateKey],1,YEAR) )
To prove this, here are both measures in the matrix, showing that they both return 9.49 against February 2019, since that’s what sales were for the corresponding period 12 months previously:
The functions above return a figure for a corresponding period, but what happens if you want to get sales for the whole of the previous period? To see what this means, consider this example (it shows yeartodate figures as a fraction of total sales for the whole of the previous year):
The figure shown for May 2019 (boxed above) is 33.40%, since yeartodate sales at this point are 39.19, and sales for the whole of the previous 12month period were 117.35. 39.19 divided by 17.35 gives 33.40%. The fact that by the end of 2019, the sales have exceeded 100% of the sales for the whole of the previous year is presumably a good sign!
To calculate this measure, use the PARALLELPERIOD
function. This has the same format as the DATEADD
function but returns the aggregate figure for the whole of a previous period, rather than for the period corresponding to the one you’re viewing. The syntax of the function is as follows:
The interval can be any one of the following:
Note that unlike for the DATEADD
function you can’t use the PARALLELPERIOD
function to return the aggregate of a number for the whole of the previous day, presumably because this is at too low a level of granularity.
Putting all this together, here’s how the measure to give the figures above might read:
Cumulative % previous year = DIVIDE( // divide the yeartodate figure … CALCULATE( SUM(Sales[Amount]), DATESYTD('Calendar'[DateKey]) ), // … by sales for the whole of the previous year CALCULATE( SUM(Sales[Amount]), PARALLELPERIOD('Calendar'[DateKey],1,YEAR) ) )
That is: divide yeartodate sales by sales for the whole of the previous year.
To format the measure, select it in the list of fields. Then, on the Modeling ribbon, format as a percentage with two decimal places.
Moving averages are one of the most useful ways to show trends, since they iron out any seasonal effects. Sadly, there is no MOVINGAVERAGE
function in DAX, but you can create your own expression in a couple of different ways, of which I’ve shown the one I believe to be the more useful below.
To illustrate moving averages, first create a relationship between the Weight and the Calendar tables:
The Weight table is a bit of an anomaly – it doesn’t have anything to do with the others. I’ve been recording my weight roughly every week for the last couple of years, to see if it’s going up or down. Although this is slightly obsessive behaviour, it does provide a perfect example of the use of moving averages.
Add a calculated column to the Calendar table:
The new column should use this formula:
YearMonth = [YearNumber] &""& if([MonthNumber]<10,"0","") & [MonthNumber]
This column should return the year number and month number for each calendar date, which you can then use to display as labels on a chart:
Now create a line chart showing the average of the Kilos field from the Weight table against the YearMonth field from the Calendar table:
Make sure that you choose to show the average kilos, not the default sum. Also, you will need to sort your chart by the YearMonth field, not by the average kilos (the default):
You should get something like this (I’ve formatted my chart a bit, but it’s the underlying trend which interests us here):
The question is this – is my weight going up or down? To answer this, you have to take account of seasonality –I eat way too much at Christmas but tend to go on family cycling holidays in July during which my weight falls. To show what’s happening, create a 12month moving average. If this works, the figure for February 2019, for example, should return the average for the previous 12 months (that is for the period March 2018 through to February 2019).
Here’s a measure you could create to show the 12month moving average:
Moving average weight = CALCULATE(  average weight in kilos ... AVERAGE('Weight'[Kilos]),  ... over the period between two dates,  as specified in the arguments DATESBETWEEN( 'Calendar'[DateKey],  the first date takes the last date  for the filter context, works out  what the corresponding period would  have been for the previous year and  adds one day to the last date of it NEXTDAY( SAMEPERIODLASTYEAR( LASTDATE('Calendar'[DateKey]) ) ),  the last date is just the end date  for the filter context LASTDATE('Calendar'[DateKey]) ) )
Here’s the chart this would give, and it looks like good news – my weight may be going up and down, but on a seasonally adjusted basis it’s falling steadily, if slowly:
To understand how the measure works, create a table to include these fields:
This table should show the following data (I’ve added the red box separately – you obviously won’t be able to create this in Power BI):
The 12month moving average for February 2019 is 82.62 and is shown selected above. This is the average of the figures for the 12 months shown in the red box. To see how the measure arrives at this figure, start in the middle of it:
LASTDATE('Calendar'[DateKey]
This expression would return 28^{th} February 2019 for the above example (being the last date in the date filter context period for the month under consideration). Now add the next bit of the measure:
SAMEPERIODLASTYEAR( LASTDATE('Calendar'[DateKey]) )
This expression will give 28h February 2018, being the corresponding date in the previous calendar year. Expand the measure a bit more and you get:
NEXTDAY( SAMEPERIODLASTYEAR( LASTDATE('Calendar'[DateKey]) ) )
This expression will give the day following 28^{th} February 2018 (that is, 1^{st} March 2018). So the whole date range across which you’re averaging my weight is given by:
DATESBETWEEN( 'Calendar'[DateKey],  the first date takes the last date  for the filter context, works out  what the corresponding period would  have been for the previous year and  adds one day to the last date of it NEXTDAY( SAMEPERIODLASTYEAR( LASTDATE('Calendar'[DateKey]) ) ),  the last date is just the end date  for the filter context LASTDATE('Calendar'[DateKey]) )
This gives the dates between 1^{st} March 2018 and 28^{th} February 2019, which was the goal!
For the last part of this tutorial on using timeintelligence functions, I’ll discuss semiadditive measures (that is, measures which sometimes aggregate data and sometimes don’t). To start, create a new layout in your report, and create this relationship:
Suppose that the Balance table contains your bank statement for the period July to September 2019. You want to get the closing bank balance at the end of each month, so you create a report based on this Balance table:
To do this, you add a table with these fields:
You should now see this table (if you include a slicer as shown to look at 2019 data only):
This result is clearly wrong (it would be nice if your bank added your daily balances to get your monthly balance, assuming that you aren’t overdrawn, but life doesn’t work that way). The goal is to pick out the last amount in each month. Fortunately, there is a family of semiadditive DAX functions to draw on:
Function(s) 
What the functions do 
CLOSINGBALANCEYEAR, CLOSINGBALANCEQUARTER, CLOSINGBALANCEMONTH, OPENINGBALANCEYEAR, OPENINGBALANCEQUARTER, OPENINGBALANCEMONTH 
Calculate the value of an expression at the first or last date of the year, quarter or month for the current filter context. 
FIRSTDATE, LASTDATE 
Return the first or last date for the filter context. 
FIRSTNONBLANK, LASTNONBLANK 
Return the first or last date for the filter context for which a given expression has a value. 
For this example, you could try using the LASTDATE
function, with this measure:
Attempt at closing balance = CALCULATE(  work out the total balance ... SUM(Balance[Balance]),  for the last date in the current  filter context LASTDATE('Calendar'[DateKey]) )
This formula would give this column in the table:
This formula is a bit better, but it only shows figures for July. This is because in the table of balances, there weren’t any transactions on the last dates of August or September in 2019, so the measure is returning blank for these two months. You could get around this by using the clever LASTNONBLANK
function, which will return the balance on the last date for which a transaction exists:
Closing balance = CALCULATE( // work out the total balance ... SUM(Balance[Balance]), LASTNONBLANK( // ... for the last date in the current filter context ... 'Calendar'[DateKey], // ... for which there are rows in the table of balances COUNTROWS(RELATEDTABLE(Balance)) ) )
This measure will give the correct closing balances. The function is called semiadditive because you could then aggregate these if you chose, although normally you won’t want to do this:
The measure needs a bit of explanation. The syntax of the LASTNONBLANK
function is as follows:
The measure returns the total balance for each month on the last day for which there are any corresponding rows in the Balance table. The reason for the RELATEDTABLE
function is that at this point in the measure it’s slipped from filter to row context. The LASTNONBLANK
function is an iterator function which goes down the rows in the current filter context (for this example, the dates in each month), evaluating for each whether it could be included. Relationships between tables aren’t automatically supported within row context, so you need to use the RELATEDTABLE
to bring information in from another table.
It seems appropriate to end with this paragraph about row and filter context, since understanding these two concepts is key to understanding DAX. Thank you for reading through this article, and (perhaps) the other ones in this series, and happy DAXingthe date of the sale If you’ve enjoyed the series, you may like to know that the author’s company Wise Owl Training provide classroom training in Power BI and DAX, although currently only in the UK.
In this article, you’ve learnt that you can override the default filter context for any measure referencing a calendar date column. The replacement filter context could, for example, allow you to return yeartodate figures, show data from the same period in a previous month, quarter or year, or even show totals from prior periods or moving averages. You’ve also learnt how to use semiadditive measures to show closing (and by analogy) opening balances. You should also now have a feel for the fact that timeintelligence functions have this name because DAX has builtin knowledge of how days, months, quarters and years behave.
The post Creating TimeIntelligence Functions in DAX appeared first on Simple Talk.
]]>The post Using Calendars and Dates in Power BI appeared first on Simple Talk.
]]>Creating and using a calendar table is pretty straightforward, but this article will explain not just how to create a table, but also why you should want to do this. The article will also answer questions such as: what happens if you have two or more dates in the same table that you want to reference? Or if you have another table which holds information at a different level of granularity? Or if you want to report sales by bank holidays? Read on for how to create a robust data model for handling timebased data!
To work through the examples in this article, you’ll need to download the worksheets from this workbook. Tick all of the worksheets when you’re loading data into a new Power BI report:
These tables will give you the following data model:
In addition, you should have the following four tables which aren’t linked:
You’ll be using some of these tables in what follows, and some in the next article in the series.
When I was first learning Power BI (actually, it was PowerPivot in those days, those many moons ago), I didn’t initially see the point of calendars. After all, Power BI allows you to include fields from date hierarchies which are created automatically for you:
However, having a calendar table gives two big advantages:
TOTALYTD
and CLOSINGBALANCE
. Without a calendar table these won’t work.Given that most people’s main interest in creating measures in DAX is to compare numbers across time periods, the second point is a bit of a clincher!
What should a calendar table in Power BI look like? Here’s an example:
Thus, a calendar table should include one row for each date in your model in which you might be interested. In the example above, the table consists of all the dates in 2018, 2019 and 2020, since this is the lifespan of the transactions in the Sales worksheet. In addition, each date row should have a primary key (a unique field which tells you the date you’re looking at). This doesn’t have to be a date; you could use a separate numeric field instead. However, since dates are stored internally as numbers, I can’t see any reason not to use a date column as your primary key, as above.
DAX contains a couple of functions which will autogenerate a calendar table for you (this sounds like a good idea, but probably isn’t – read on). One of these is the CALENDARAUTO
function. To use this, click on the following tool found on the Modeling tab to create a new table:
Type in a name for your table. Here I’ve called mine My calendar. Then use the CALENDARAUTO
function to say what it will contain:
In this case, just assume that the fiscal year ends in December. You could leave out the argument 12, since December – month 12 – is the default anyway:
My calendar = CALENDARAUTO(12)
When you confirm, you’ll see that this formula Power BI will create a set of dates going from the first date it finds in your data model to the last. In this example the first date is 10^{th} February 2018 and the last date is 8^{th} January 2020. Because the financial year ends in December, the function will generate a table containing all the dates for the months January through to December for the years 2018, 2019 and 2020:
This was all very quick, but also not that useful, as you’re now going to have to add columns giving the year, month, quarter and so on for each date AND then do the same thing for each model that you create.
A variation of the above is the CALENDAR
function, which lets you specify a start and end date:
This works in exactly the same way (and suffers from the same drawbacks), but it does at least give you more control over which dates are generated.
Another way to create a calendar is using an Excel spreadsheet. To do this, type in a column heading and the first couple of dates, and click and drag down using the black cross shown:
You can now add columns giving the year number, month number, etc. For example, for the year number:
You could then doubleclick on this black cross to copy this formula down:
Here are some functions that you could use:
=YEAR(A2)
– to get the year number, as above=MONTH(A2)
– to get the month number, as above=TEXT(A2,"mmmm")
– to get the month name=TEXT(A2,"mm  mmmm")
– to get the month number/name=DAY(A2)
– to get the day number=TEXT(A2,"dddd")
– to get the day name="Q" & INT((MONTH(A2)+2)/3)
– to get the quarter numberYou could then save the Excel workbook (possibly pasting the formulae as values first) and use this as a source for your Power BI calendar table.
If you’re using SQL Server, this is probably the best option. Here’s a sample procedure which you can adapt to generate one row for every date in a given range. It doesn’t claim to be that efficient (it uses cursors!), but you’re only going to have to run it once.
CREATE PROC spCreateCalendarTable ( @StartDate datetime = '20180101', @EndDate datetime = '20201231' ) AS  create a table of dates for use in Power BI  first get rid of any old versions of table DROP TABLE IF EXISTS tblCalendar  create the table of dates CREATE TABLE tblCalendar( DateKey date PRIMARY KEY, YearNumber int, MonthNumber int, [MonthName] varchar(10), DayNumber int, [DayName] varchar(10), [Quarter] char(2) )  now add one date at a time DECLARE @i int = 0 DECLARE @curdate datetime = @StartDate WHILE @curdate <= @EndDate BEGIN  add a record for this date (could use FORMAT function if SQL Server 2012 or later) INSERT INTO tblCalendar ( DateKey, YearNumber, MonthNumber, [MonthName], DayNumber, [DayName], [Quarter] ) VALUES ( @curdate, Year(@curdate), Month(@curdate), DateName(m,@curdate), Day(@curdate), DateName(weekday,@curdate),  the quarter number 'Q' + CAST(floor((month(@curdate)+2)/3) AS char(1)) )  increase iteration count and current date SET @i += 1 SET @curdate = DateAdd(day,1,@curdate) END
Once you’ve created a calendar, here’s how to use it. First load it into your data model (in this case, it was loaded with the Excel data), then link it to a date column. Here I’ve assumed that you want to analyse sales by the sale date and not the payment date. Later in this article you’ll learn how to cope with the situation where you have two or more dates in a table.
You now need to tell Power BI that the Calendar table is … a calendar table! To do this, make sure you’re looking at the calendar table in Data view:
On the Modeling tab, choose to mark this as a calendar table:
Choose the column which uniquely identifies each date, then choose OK:
The only problem with all of this is that I’m not convinced it’s necessary! It certainly can’t do any harm, but my understanding is that if you’ve chosen a date column as your primary key, DAX timeintelligence functions will work even if you omit this step.
To see the Calendar table in action, create a matrix based upon your calendar, using these fields:
You’ll get something like this:
There are two problems here: Power BI has assumed that the year number is an integer which needs summing, and it has also assumed that the month name is text which can be sorted alphabetically. There are other solutions to both problems, but the simplest ones are as follows. First, change the year number to a text column:
Secondly, with the month name column selected, choose to sort it by month number:
You can now create a matrix with these fields:
To get this visual:
Note that instead of the visual shown above, you may get something like this one:
In this case, try finding and setting the +/icons settings for your matrix in the Row headers card, to enable you to expand and collapse rows:
If you don’t have this property, it may be that you’re using an older version of Power BI, in which case, drill down to show all the levels of detail for your matrix:
As one final touch, it would be nice to have all of the months appearing, so choose to show items with no date by clicking on the drop arrow next to the MonthName column:
And finally, you’ll see the perfect matrix!
Suppose now that you want to compare actual and forecast sales which is a common enough requirement. This should be easy – you already have a table of monthly forecasts for sales:
However, these forecasts are by month and the Calendar table is by date. The easiest solution is to create another column which arbitrarily assigns each forecast to the first day of the month in which it occurs:
Here’s the formula used:
ForecastDate = DATE([ForecastYear],[ForecastMonth],1)
It’s probably a good idea to narrow the data type for this column from Date/Time to just Date:
You can now create a relationship between this new forecast date column and the calendar’s date key column:
This will enable you to compare actual and forecast data at any level of granularity down to month:
Note that you’ll obviously have to be careful not to drill down to day level, since the forecast sales have been arbitrarily assigned to the first day in each calendar month and the results would be misleading.
Readers outside the UK will need to know that a bank holiday is a day which is treated like a Saturday or Sunday (that is, you don’t have to go to work); there are about 10 of them each year, including Christmas Day, Boxing Day, New Year’s Day, etc. The solution divides sales into working and nonworking days where a nonworking day is either a Saturday, a Sunday or a bank holiday. To do this, create some new columns in the Calendar table.
Note that you can use this principle to report by any type of date: examples could include periods when you’re offering a discount to customers, timesheet weeks, times of the day when shops are open, etc.
Although you could just create a complicated single column, to make things easier to understand – and to work with – you’ll create three:
To do the first, use the WEEKDAY
function, but tweak the second argument so that it returns 6 for Saturday or 7 for Sunday:
So the full calculated column will be:
If weekend = IF(WEEKDAY([DateKey],2)>5,TRUE(),FALSE())
You could use the shorter form, if you’re comfortable with Boolean algebra!
If weekend = (WEEKDAY([DateKey],2)>5)
This shows, for example, that Christmas Day 2019 was on a Wednesday (that is, 3 days before the next weekend started):
To say whether a day is a bank holiday, for example, you should first create and load a table of bank holidays (or use the one supplied in this article’s Excel workbook):
Now create a onetoone relationship between the two tables:
It’s created as onetoone automatically because the DateKey
is unique in the Calendar table, but the BankHolidayDate
column is also unique in the bank holidays table too.
You can now create another calculated column in the Calendar table:
Here’s the code used, for copying:
If bank holiday = IF( // if for this date there's no corresponding row // in the bank holidays table ... ISBLANK(RELATED(BankHoliday[BankHolidayDate])), // ... then it ISN'T a bank holiday FALSE(), // otherwise it is TRUE() )
This shows that Christmas Day 2019 was a bank holiday as expected:
You could now combine the two conditions to get the status of any day:
This method will allow you to create reports dividing sales into working and notworking days, although the results aren’t that exciting because as it happens no sales were made on a bank holiday:
How do you cope when (as is nearly always the case in the real world) a table has two or more dates? For this example, how could you create a visual comparing the month of purchase with the month of payment? There are two ways. One way is to have multiple relationships between the same tables and specify in your measures which one you want to reference:
The second method is to use multiple versions of the calendar table:
Which solution you prefer will tell you a bit about what sort of person you are – think of it as a simple personality test. If you’re the sort of person who likes technology for its own sake, you’ll probably prefer the first solution: you’ll like the fact that you’re not storing the calendar table more than once, and you’ll be prepared to sacrifice a bit of ease of use. If on the other hand, you’re the sort of person who likes technology solely as a means to an end, you’ll probably prefer the second solution. Even though it involves holding multiple copies of the calendar table, the resulting model is easier to work with.
The Wise Owl Recommendation
If you’re interested, I prefer the second method (but then I do work for a training company, so by temperament am likely to want to make things as easy to use as possible). However, it doesn’t make you a bad person if you prefer the first method – just different to me.
Both methods are shown under separate headings below. I’ll begin with the multiple table approach since it’s easier to understand and is probably the one that most people will use.
For this method, start by renaming the first calendar table that you’ve imported. For the model below, the calendar table is linked to the SalesDate column (the date on which a purchase took place), so I’ve renamed the Calendar table to PurchaseCalendar to make it clear what’s going on. I’ve also removed some of the calendar columns to keep the table simple:
If you have the energy, it would probably be a good idea to rename each of the columns in this table too:
Now choose to load another version of the calendar table using your recent sources (the workbook or database from which you loaded the calendar table will be listed here):
Choose to import another version of the calendar table:
Drag this onto the same layout diagram, and create a relationship between the SalesData table and your recently loaded calendar table, but this time using the PaymentDate column as the link field:
Once again, you could now rename this version of the calendar table (and also rename the columns it contains) to make it clearer what’s what. In this example, I’ve also deleted some columns I don’t want:
There are two arguments you could make against this approach: that it wastes memory, and that it clutters up your model. Both are true. But it doesn’t waste that much memory. Each table stores about 1,000 dates, which is peanuts in today’s memory terms. It doesn’t have to clutter up your model if you use different tabs like this (one tab for each table containing multiple dates):
Having loaded all of your calendar tables and created the necessary relationships, you could use your model to create a matrix like this, showing the lag between purchases made and payments received:
Here’s what the fields for this visual look like:
Although the extra table is a bit messy, it does mean that you don’t have to create any additional measures: you can just drag fields into the field well as usual.
The alternative approach is to load one version of the calendar table, but create a second relationship:
There are now two relationships:
You can change which is the active relationship by rightclicking on it and choosing to show its properties:
You can then tick the box to make a relationship the active one, but only after making all of the other relationships inactive first, as the following screenshot explains:
If you’re going to use multiple relationships, it may be a good idea to make all of them inactive. Then you won’t inadvertently create a measure referring to the wrong relationship by mistake:
What you now have to do is to create measures saying for any calculation to which table it should refer. For example, suppose you want to show for each year:
To do this, create two measures using the USERELATIONSHIP
function in each case. Here’s the first one:
Sales by date made = CALCULATE( SUM(Sales[Amount]), USERELATIONSHIP( 'Calendar'[DateKey], Sales[SalesDate] ) )
And here’s the second:
Sales by date paid = CALCULATE( SUM(Sales[Amount]), USERELATIONSHIP( 'Calendar'[DateKey], Sales[PaymentDate] ) )
These measures will allow you to show the required figures:
However, there’s no way that I can see to display an aged debtor matrix like the one created for the multiple table approach. It’s also a bit irritating that in the USERELATIONSHIP
function that you have to specify the two columns that you’re joining together. It would be better if you could just specify which relationship you’re using. Something like this in fact:
You might also expect that because you are referencing the start column and end column of the relationship, you don’t need the relationship to actually exist, but you’d be wrong, as this error message which appears if you delete the above relationships shows:
And with that mild bit of whingeing, that’s the end of this article!
In this article, you’ve learned how and why you might want to create a calendar table in Power BI, how to use it to report on figures at different levels of granularity, how to add additional aggregator columns to the table and two different ways to cope with the situation where you have more than one date column in the same table. In the next and final article of this series, I’ll show how to use the calendar(s) that you’ve created to show things like yeartodate figures, crossperiod comparisons and moving averages.
The post Using Calendars and Dates in Power BI appeared first on Simple Talk.
]]>The post Cracking DAX – the EARLIER and RANKX Functions appeared first on Simple Talk.
]]>If you really want to impress people at DAX dinner parties (you know the sort: where people stand around discussing row and filter context over glasses of wine and volauvents?), you’ll need to learn about the EARLIER
function (and to a lesser extent about the RANKX
function). This article explains what the EARLIER
function is and also gives examples of how to use it. Above all, the article provides further insight into how row and filter context work and the difference between the two concepts.
As for the other articles in this series, the sample dataset consists of four simple tables which you can download from this Excel workbook:
After importing the data, you should see this model diagram (I’ve used Power BI, but the formulae in this article would work equally well in PowerPivot or SSAS Tabular):
One more thing to do – add a calculated column to the country table, giving the total sales for each country:
TotalSales = CALCULATE(Sumx(Sales,[Price]*[Quantity]))
This formula iterates over the Sales table, calculating the sales for each row (the quantity of goods sold, multiplied by the price) and summing the figures obtained. The CALCULATE
function is needed because you must create a filter context for each row so that you’re only considering sales for that country – without it, you would get this:
The first use case is very simple. The aim is to rank the countries by the total sales for each:
DAX doesn’t have an equivalent of the SQL ORDER BY
clause, but you can rank data either by using the RANKX
function (covered later in this article) or by using the EARLIER
function creatively. Here’s what the function needs to do, using country number 3 – India – as an example. Firstly, it creates a row context for this row:
The sales for India are 34.5. What it now must do is to count how many rows there are in the table which have countries whose sales are greater than or equal to India’s sales:
This shows the filtered table of countries, including only those whose sales are at least equal to India’s. There are four rows in this filtered table.
If you perform the same calculations for each of the other countries in the table, you’ll get the ranking order for each (that is, the number of countries whose sales match or exceed each country’s). For those who know SQL, this works in the same way as a correlated subquery. This would imply that it might run slowly, however, here’s what Microsoft have to say on the subject:
Anyone who has driven much in France will have seen this sign at a levelcrossing:
What it means is “one train can hide another” … and so it is with row contexts. The measure will open two row contexts, as the diagram below shows:
The problem is that when DAX gets to the inner row context for the FILTER
function, it will no longer be able to see the outer row context (the country you’re currently considering) as this row context will be hidden unless you use the EARLIER
function to refer to the first row context created in the formula.
I’ve explained that the EARLIER
function refers to the original row context created (nearly always for a calculated column). What would you have called this function? I’d be tempted by one of these names:
The multidimensional (cubes) version of Analysis Services gets it about right, using CURRENTMEMBER
and PREVMEMBER
depending on context. It’s interesting to see that you can’t use these words as names of variables in DAX as they are reserved words:
This is true even though they aren’t official DAX function names!
What I definitely wouldn’t use for the function name is something which implied the passage of time. To me, EARLIER
means something which occurred chronologically before something else. I think this is one of the reasons it took me so long to understand the EARLIER
function: it’s just got such an odd name.
Having got that off my chest, here’s the final formula:
SalesOrder = COUNTROWS( Filter( Country, [TotalSales] >= EARLIER([TotalSales]) ) )
Here’s the English (!) translation of this …
“For each country in the countries table, create a row context (this happens automatically for any calculated column). For this row/country being considered, count the number of rows in a separate virtual copy of the table for which the total sales for the country are greater than or equal to the total sales for the country being considered in the original row context”.
What could possibly be confusing about that?
Readers will be delighted to know that you don’t have to limit yourself to going back to the previous row context – you can specify how many previous row contexts to return to by varying the number used in the second argument to the EARLIER
function:
You can even use the EARLIEST
function to go to the earliest row context created for a formula. The formula could alternatively have been written like this:
SalesOrder = COUNTROWS( FILTER( Country, [TotalSales] >= EARLIEST([TotalSales]) ) )
I find it very hard to believe that anyone would need this function! It could only be useful where you:
FILTER
or SUMX
).In this complicated case, the EARLIER
function would refer to the row context created at step 3, but the EARLIEST
function would refer to the original row context created at step 1.
One of the surprising things about the EARLIER
function is that you often don’t need it. For this example, you could store the total sales for each country in a variable, and reference this instead. The calculated column would read instead:
Sales order using variables = // create a variable to hold each country's sales VAR TotalSalesThisCountry = [TotalSales] // now count how many countries have sales // which match or exceed this RETURN COUNTROWS( FILTER( Country, [TotalSales] >= TotalSalesThisCountry ) )
There’s not much doubt that this is easier to understand, but it won’t give you the same insight into how row context works!
The EARLIER
function refers to the previous row context created in a DAX formula. But what happens if there isn’t a previous row context? This is why you will so rarely see the EARLIER
function used in a measure. Here’s what you’ll get if you put the formula in a measure:
The yellow error message explains precisely what the problem is: measures use filter context, not row context, so no outer row context is created.
Another common requirement is calculating the cumulative total sales for countries in alphabetical order:
The above screenshot shows that this calculates the answer separately for each country, but it looks more sensible when you view it in alphabetical order by country name:
The formula for this column calculates for each country the total sales for all countries coming on or before it in alphabetical order:
Running total = SUMX( FILTER( Country, Country[CountryName] <= EARLIER(Country[CountryName]) ), [TotalSales] )
Again, you could have alternatively used a variable to hold each country’s name:
Running total using variables = // store this row's country name VAR ThisCountry = Country[CountryName] // return the sum of sales for all countries up to or // including this one RETURN SUMX( FILTER( Country, Country[CountryName] <= ThisCountry ), [TotalSales] )
Here’s another use of the EARLIER
function – to create group totals. To follow this example, first, add a calculated column in the Sales table to show the name of each product bought (you don’t have to do this, but it will make the example clearer if you use the product name rather than the product id to match sales rows to the product bought):
Product = RELATED(‘Product'[ProductName])
Here’s the column this should give:
You can now create a second calculated column:
Average product sales = AVERAGEX( // average over the table of sales records for the // same product as this one ... FILTER(Sales,[Product] = EARLIER(Sales[Product])), // ... the total value of the sale [Price]*[Quantity] )
Here’s what this will give for the first few roles in the SALES table. You may want to format the column as shown by modifying the decimal places on the Modeling tab.
The group average for Olly Owl products is 5.40. For this example, for each sales row, the calculated column formula is averaging the value of all sales where the product matches the one for the current row.
So far, you have seen three relatively simple uses of the EARLIER
function – for the fourth example, I’ll demonstrate a more sophisticated one. Begin by adding a calculated column to the Sales table, giving the value of each transaction (which should equal the number of items bought, multiplied by the price paid for each):
It would be good to categorise the sales into different bands, to allow reporting on different customer bands separately. For example:
One way to do this would be to create a calculated column using a SWITCH
function like this:
Customer type = SWITCH( // try to find an expression which is true TRUE(), // first test  low value [SalesValue] <= 10, "Low value", // second test  medium value [SalesValue] <= 15, "Medium value", // third test  high value [SalesValue] <= 20, "High value", // otherwise, they are a premium customer "Premium customer" )
This would allow reporting on the different customer types (for example, using this table):
However, this hardcodes the thresholds into the calculated column. What would be much more helpful is if you could import the thresholds from another table, to make a dynamic system (similar to a lookup table in Excel, for those who know these things).
To create a suitable table, in Power BI Desktop choose to enter data in a new table (you could also type it into Excel and load the resulting workbook):
Type in the thresholds that you want to create, and give your table the name Categories:
Note that in your model’s relationship diagram, you will now have a data island (a table not linked to any other). This is intentional:
You can now go to the Sales table and create a calculated column assigning each row to the correct category:
For each row, what you want to do is find the set of rows in the Categories table where two conditions are true:
The lower band for the category is less than the sales for the row; and
The upper band for the category is greater than or equal to the sales for the row.
It should be reasonably obvious that this set of rows will always exist, and will always have exactly one row in it. Because of this, you can use the VALUES
function to return the value of this single row, giving:
Category = CALCULATE( // return the only category which satisfies both of // the conditions given VALUES(Categories[CategoryName]), // this sales must be more than the lower band ... Categories[Low] < EARLIER(Sales[SalesValue]), // ... and less than or equal to the higher band Categories[High] >= EARLIER(Sales[SalesValue]) )
You might at this point wonder why you need the EARLIER
function when you haven’t created a second row context. The answer is that the CALCULATE
function creates a filter context, so you need to tell your formula to refer back to the original row context you first created in the formula.
One problem remains: the categories aren’t in the right order (I’m well aware that one way to solve this would be to click on the column heading to change the sort order, but I’m after something more dynamic!):
To get around this, it may at first appear that all you need to do is to designate a sort column:
You could do this by selecting the CategoryName column in the Categories table as above, and on the Modeling tab, setting this to be sorted by the SortOrder column. This looks promising – but doesn’t work (it actually creates an error if you’ve typed in the data for the categories table, which you’ll need to refresh your model to clear). The reason this doesn’t work is that the alternative sort column needs to be in the same table as the Category column.
A solution is to create another calculated column in the Sales table:
Category SortOrder = CALCULATE( // return the sales order for the two conditons VALUES(Categories[SortOrder]), // this sales must be more than the lower band ... Categories[Low] < EARLIER(Sales[SalesValue]), // ... and less than or equal to the higher band Categories[High] >= EARLIER(Sales[SalesValue]) )
You now have two columns for each sale row – one giving the category, and one giving its corresponding sort order number. You can select the category column and choose to sort it using the sort order column instead (choosing the option shown below on the Power BI Modeling tab):
Bingo! The categories appear in the required order:
If after reading this far you’re still a bit fuzzy on the difference between row and filter context, practice makes perfect! I’ve found reading and rereading articles like this helps, as each time you get closer to perfecting your understanding (for those in the UK, you could also consider booking onto one of Wise Owl’s Power BI courses).
Out of all the DAX functions, the one which trips me up most often is RANKX
– which is odd, because what it does isn’t particularly complicated. Here’s the syntax of the function:
So the function ranks a table by a given expression. Here is what the arguments mean:
You’ll notice I’ve missed out the Value argument. This is for the good reason that it is a) bizarre and b) not useful. It allows you to rank by one expression, substituting in the value for another for each row in a table. If this sounds like a strange thing to want to do, then I’d agree with you.
To see how the RANKX
function works, return to the Country table in which you created a calculated column at the beginning of this article:
Suppose you now want to order the countries by sales (clearly one way to do this would just be to click on the drop arrow next to the column and choose Sort ascending!). Here’s a calculated column which would do this:
What you should notice is the unusual default ranking order: unlike in SQL (and many other languages), the default order is descending, not ascending. To reverse this, you could specify ASC as a value for the fourth Order argument:
Sales order = RANKX( // rank the rows in the country table ... Country, //... by the total sales column ... [TotalSales], // omitting the third argument , // and ranking in ascending order ASC )
So far, so straightforward!
Where things get more complicated is when you use RANKX
in a measure, rather than in a calculated column. Suppose that you have created a table visual like this in a Power BI report:
Suppose further that you want to create and display this measure to show the ranking order:
Sort order measure = RANKX( // order the countries by sales Country, SUMX( Sales, [Price]*[Quantity] ) )
However, when you add this measure to your table, you get this:
The reason for this is that Power BI evaluates the measure within the filter context of the table. So for the first row (Brazil), for example, here’s what Power BI does:
To get around this, you first need to remove filter context before calculating the ranking order, so that you rank the total sales for each country against all of the other country’s total sales:
However, even this doesn’t solve the problem. The increasingly subtle problem is that the RANKX
function – being an iterator function – creates a row context for each country, but then evaluates the sales to be the same for each country because it doesn’t create a filter context within this row context. To get around this, you need to add a CALCULATE
function to perform context transition from row to filter context within the row context within the measure:
Sort order measure = RANKX( // order across ALL the countries ... ALL(Country), // by the total sales CALCULATE( SUMX( Sales, [Price]*[Quantity] ) ) )
This – finally – produces the correct ranking (note that there is a blank country at the top because some sales don’t have a matching country parent – although this isn’t important for this example):
It’s worth making two points about this. Firstly, this isn’t a formula you’ll often need to use; and secondly, this is about as complicated a DAX formula as you will ever see!
If your ranking produces ties, the default is to SKIP numbers, but you can also use the DENSE
keyword to override this. To see what each keyword means, add this calculated column to the Sales table:
QtyRanking = RANKX(Sales,[Quantity])
What it will do is to order the sales rows by the quantity sold. Here’s what you’ll get if you use Skip, or omit the Ties keyword altogether:
Here, by contrast, is what you’ll get if you use Dense
(I’m not quite sure why you’d ever want to do this):
In the second screenshot, there are no gaps in the numbers.
You can apply the EARLIER
function in various contexts to solve modelling problems in DAX, and it should be an essential part of every DAX programmer’s arsenal. The RANKX
function, by contrast, solves a very specific problem – ranking data. What both functions have in common is that they are impossible to understand without a deep understanding of row and filter context, so if you’ve got this far, you can consider yourself a DAX guru!
The post Cracking DAX – the EARLIER and RANKX Functions appeared first on Simple Talk.
]]>The post Using the FILTER Function in DAX appeared first on Simple Talk.
]]>The FILTER
function works quite differently than the CALCULATE
function explained in the previous article. It returns a table of the filtered rows, and sometimes it is the better approach to take.
I’ll spend most of this article explaining how to create the following measures:
The columns above show, respectively:
First, I’ll show you how to set up the example. Then I’ll dive into the syntax of the FILTER
function. I’ll finish by highlighting the differences between the FILTER
and CALCULATE
functions.
To work through the examples in this article, you’ll need to create a simple Power BI report containing a single table and then create and show a series of measures. Here’s a quick runthrough of how to get started.
First, create a Power BI report based on the tables used in the previous articles. You can load them either from the SQL Server database given or the Excel workbook. You should now have something like this (if your diagram looks a bit different, you may not have updated your instance of Power BI to include the March 2019 update, which included the new Model View):
Now create a table in Report view to list out the city names. Make sure the Table visualization is selected and click CityName in the Fields list.
Switch to the Home ribbon and select Enter Data. This will add a new table to your report to contain your measures. To understand why you might want to do this, see this previous article in this series:
Give this table a name. Here I’ve called mine All measures:
Click Load. Now add the following measure to your All measures table. You can rightclick on the table and choose New measure to do this:
Sales = SUMX( // multiply the price of each transaction by // its quantity, and sum the result Sales, [Price]*[Quantity] )
Choose to display this measure in your table:
You should now be able to see the total value of sales for each city:
What happens if you want to show the sales for American cities only? Or sales taking place in 2018 only?
The measure for the sales column shown above, giving total sales for each city, is as follows:
Sales = SUMX( Sales, [Price]*[Quantity] )
What this does (as readers of this series of articles will know) is to iterate down the rows in the Sales table, calculating the price multiplied by the quantity for each and summing the result for each city to get this:
To get the sales in 2008, you could use a CALCULATE
function so that this measure would work:
2018 sales using CALCULATE = CALCULATE( SUMX( Sales, [Price]*[Quantity] ), YEAR(Sales[SalesDate]) = 2018 )
This takes the filter context for each city, and further reduces it to consider only those rows where the sales occurred in 2018 to get this:
Another way to solve the problem, however, is to treat the sales for the current filter context as a table and filter it accordingly. Consider the example of sales for New York. Here’s the underlying data for this city:
The total figure for sales for New York for 2018 is 25.98 (18.98 + 7.00). One way to get to this would be to follow these steps when compiling the data for New York. Firstly, get the data for the filter context:
Secondly, filter this data to include only those sales for 2018, by iterating down each row deciding whether to include it. This will include only the shaded area below:
This leaves this table, which is the one whose sales Power BI will sum:
Here’s the formula to accomplish this:
2018 sales = SUMX( FILTER( Sales, YEAR(Sales[SalesDate])=2018 ), [Price]*[Quantity] )
This will give exactly the same results as the formula using CALCULATE
above. The CALCULATE
function will run more quickly because it doesn’t have to iterate down each row in the table testing a condition. At this point, you may be asking yourself what the point of the FILTER
function is. I’ll return to this later in this article.
Take a look at how to show total sales for the USA for each city. The Sales, City and Country tables are related as follows:
What’s needed is to iterate down the rows in the sales table, calculating the sales (price times quantity) for each but only where the country name is USA. Here’s the formula to do this:
American sales = SUMX( FILTER( Sales, RELATED(Country[CountryName])="USA" ), [Price]*[Quantity] )
The expression gives these results:
The question is – why use the RELATED
function when the DAX formulae using filter context automatically link tables together? The answer is that within this formula, row context, not filter context, is used. The shaded lines in the formula below iterate over each row in the Sales table returned for the filter context, creating a row context for each:
Because within the shaded bit of the formula DAX has to create a row context for each row in the sales table, it then has to use the RELATED
function to bring in the country name from the Country table.
It’s now time to look at how to combine criteria: how to show sales which happened in 2018 and which took place in the USA. I’ll show in a bit how to do this by nesting one FILTER
function within another, but for now, I’ll show ways to combine criteria. There are two basic ways to do this in DAX – either by using &&
or the AND
function (or if either of two conditions can be true, using 
or the OR
function).
Here’s a version of the measure using the AND
function:
2018 American sales = SUMX( FILTER( Sales, AND( RELATED(Country[CountryName])="USA", YEAR(Sales[SalesDate])=2018 ) ), [Price]*[Quantity] )
Here’s the same measure, but using the &&
symbols:
2018 American sales using && = SUMX( FILTER( Sales, RELATED(Country[CountryName])="USA" && YEAR(Sales[SalesDate])=2018 ), [Price]*[Quantity] )
Personally, I’d use the AND
(or OR
) functions any time, as they work in the same way as their Excel counterparts, and it’s easier to indent and comment formulae. However, you should use whichever floats your particular boat.
Even more sales have dropped from the figures:
The other way to solve this would have been to nest one function within another:
2018 American sales using nesting = SUMX( FILTER( FILTER( Sales, RELATED(Country[CountryName])="USA" ), YEAR(Sales[SalesDate])=2018 ), [Price]*[Quantity] )
Consider what this does for the New York row in the table:
Filter context restricts the data to sales for the current city in question.
The inner FILTER
function iterates over each row in the table of data for the filter context, picking out only rows where the country is in the USA.
The outer FILTER
function then iterates over each row in the table of sales for the USA for the filter context and applies a further constraint that the sales year must be 2018.
Depending on your data, nesting FILTER
functions could speed up processing. If the vast majority of sales were outside the USA, the inner condition could eliminate nearly all rows for each city in the filter context, with the result that Power BI would only need to test the sales date for the few remaining rows.
Every example shown so far has taken the set of rows for the current filter context and applied additional constraints to pick out only certain rows. However, you can use the ALL
function when filtering to work with the entire table, rather than just the data for the current filter context. You could use this, for example, to show the percentage contribution of each city’s sales to the grand total:
Here’s the formula for the above measure:
% of all sales = DIVIDE( // calculate the total sales for the current filter context SUMX( Sales, [Price]*[Quantity] ), // divide this by total sales for all cities SUMX( ALL(Sales), [Price]*[Quantity] ) )
Incidentally, if you’re wondering how to get the nice percentage format, just select the measure you’ve created:
You can then set the formatting in the Modeling tab on the menu:
To see the difference between the way in which CALCULATE
and FILTER
filter data, consider this example:
The first measure applies the filter context (so it only calculates sales for the city in question), and applies an additional constraint that the city should be New York:
New York FILTERED = CALCULATE( // work out total sales // for the filter context SUMX( Sales, [Price]*[Quantity] ), // but whittle the filter context // down to show only those cities // within it called New York FILTER( City, City[CityName]="New York" ) )
The second measure replaces the filter context with a new constraint that the city should be New York, which results in the same figure appearing in every row:
New York CALCULATED = CALCULATE( // work out total sales // for the filter context SUMX( Sales, [Price]*[Quantity] ), // changing the city criteria // so it is New York, not // whatever the filter context // originally said City[CityName]="New York" )
To make debugging easier, first add a couple of calculated columns to the Sales table, to give the city name and sales year. The formulae are shown below:
Here’s the formula for the City column. It just looks up the name of each city in which sales took place:
Here’s the formula for the Sales year column. It gives the year in which each sale took place:
These two columns will make it easier to check what’s going on when debugging.
The FILTER
function creates virtual tables which, under normal circumstances, you never see, but you can use a tool like DAX Studio to show the rows these virtual tables contain. I’ve covered how to download and use DAX Studio in a previous article in this series, but here’s a quick refresher. When you run DAX Studio, choose to connect to an open Power BI report:
Type in a DAX query in the top righthand window and press the F5 key to run this. The results will appear beneath it. For the example below, I’m just listing out the contents of the Sales table:
Incidentally, if you’re wondering what those long date table names are, you’re not the only one. I presume they are created behind the scenes to provide the builtin date hierarchy included in the March 2019 update of Power BI.
You can evaluate any table, including one which is returned from a filter function. A good thing to ask might be: which sales were in the United States? You can do this by copying this part of the measure you created earlier:
Precede this with the word EVALUATE
in DAX Studio, and you’ll get this:
Run this to get the following output:
That’s looking good, so now you can repeat this technique with the outer bit of the FILTER
function:
This gives only 3 rows:
From this, it’s easy to see why you get the figures for this measure.
Another way to debug a DAX formula using the FILTER
function (or any other DAX formula, for that matter) is to use variables. I’ve already covered this for scalar variables (ones holding a single value) in the previous article in this series on measures, but did you know a variable can hold an entire table?
Here’s another way to write the nested FILTER
function:
US sales in 2018 = // create a variable to hold the sales in the USA VAR UsaSalesTable = FILTER( Sales, RELATED(Country[CountryName])="USA" ) // create another variable to filter this to show // only sales in 2008 VAR UsaSales2018 = FILTER( UsaSalesTable, YEAR(Sales[SalesDate])=2018 ) // finally, calculates sales for these figures RETURN SUMX( UsaSales2018, [Price]*[Quantity] )
The advantage of breaking the complicated formula down into different parts is that you could then test each in isolation.
I promised I would return to this question: why would you use the FILTER
function when the CALCULATE
function seems to offer a better alternative? There are at least four advantages:
I’ve already shown that it’s easier to debug DAX expressions that use the FILTER
function.
I think expressions using the FILTER
function are easier to understand than equivalent expressions just using CALCULATE
.
Learning the FILTER
function will help you to understand the EARLIER
function, which will be the subject of the next article in this series.
There are some problems which the CALCULATE
function won’t solve (an example follows).
To illustrate the last point, suppose that you want to create a measure showing total sales for cities having two or more purchases. Here are the figures that this should return:
There are no sales recorded for Chicago, LA and Rio in the new measure because they each only witnessed a single sale.
Assume in all of the following that [Number of purchases]
is a measure with this formula:
Number of purchases = COUNTROWS(Sales)
Here’s a measure which you could use to try to solve this problem (although it won’t work):
Sales for multiple purchases = CALCULATE( // calculate total sales where ... SUMX( Sales, [Price]*[Quantity] ), // ... the number of purchases is more than 1 [Number of purchases] > 1 )
If you type in this measure, you’ll see the following error message:
This isn’t a brilliant description of the problem, which is that you can’t use a measure in the filtering part of a CALCULATE
function; you can only refer to columns. You can, however, solve this problem by rewriting it to incorporate a FILTER
function:
Sales for multiple purchases = CALCULATE( // calculate total sales but ... SUMX( Sales, [Price]*[Quantity] ), // only where the number of // purchases is more than 1 FILTER( City, [Number of purchases] > 1 ) )
This will calculate total sales, but only for those cities where the number of purchases was more than 1.
The FILTER function in DAX allows you to iterate down the rows of any table, creating a row context for each and testing whether the row should be included in your calculation. You can combine filters using keywords like AND and OR and also nest one filter within another. The FILTER function allows you to perform some tasks which the CALCULATE function can’t reach, and also (in my opinion) lets you create formulae which are easier to understand.
The post Using the FILTER Function in DAX appeared first on Simple Talk.
]]>The post Using the DAX Calculate and Values Functions appeared first on Simple Talk.
]]>If you should ever start reading a book on DAX, you will quickly reach a chapter on the CALCULATE
function. The book will tell you that the CALCULATE
function is at the heart of everything that you do in DAX and is the key to understanding the language. A delegate on one of my courses adopted the policy of starting every formula with =CALCULATE
, and it’s not such a bad approach! This article explains how to use the CALCULATE
function and also how to use the (almost) equally important VALUES
function.
This article uses the same simple database as its two predecessors. This database shows sales of three toys for different cities around the world:
You can import this data into your own Power BI data model by first downloading this Excel workbook, or by running this SQL script in SQL Server Management Studio.
As for the previous articles in this series, everything I describe below will work just as well in Power BI, PowerPivot or Analysis Services (Tabular Model), each of which Wise Owl train.
To understand the CALCULATE
function, you must understand filter context, so that’s where I’ll begin for this article.
Suppose you have the following pivot table in Excel, showing the number of sales for each country, city, and product in the database. The figure selected shows that there were three sales of Timmy Tortoise products in London (UK):
The filter context for the shaded cell containing the number 3 is therefore as follows:
Country dimension: UK
City dimension: London
Product dimension: Timmy Tortoise
If you were to doubleclick on this cell in Excel, you would see the underlying rows:
These are the three sales which took place for this product in this country and city.
Now suppose that you change your pivot table to show the number of sales as a percentage of the total for each column. This would give:
The figure for Timmy Tortoise for London is 75%, which is:
Total sales for London for Timmy Tortoise / Total sales for London for all products
This gives 75% because this is the result you get when you divide 3 (the number of sales in London for Timmy Tortoise) by 4 (the number of sales in London for all products).
Note that I’ll often refer in this article to the numerator and denominator. In any fraction A / B, the numerator is A and the denominator is B (but you knew that from school maths, didn’t you?).
Now suppose that you want to recreate this pivot table using a matrix and slicer in Power BI:
The figures are exactly the same, and for Timmy Tortoise for London you’ll see 75% because this is the ratio between the number of sales for this product and city (3) against the number of sales for all products and this city (4).
To solve this problem, you’ll use the CALCULATE
function which is the answer to most questions in DAX. The syntax of the function is as follows:
The measure you should create (and show) is this:
% of all products = DIVIDE( // the numerator: number of sales for the current filter context COUNT(Sales[SalesId]), // the denominator: number of sales for the current filter // context, but for ALL products CALCULATE( COUNT(Sales[SalesId]), ALL('Product'[ProductName]) ) )
I’ve put my measure in a separate table – if you’re not sure how to create this table or how to create measures, see the previous article in this series. What the measure does is to calculate the numerator (the number of sales for the current product and city) and divide this by the denominator (the number of sales for the current city only, with any product constraint removed). Here’s what this calculates:
Total sales for the current filter context / Total sales for the current filter context, but removing any product constraint
If you display row and column totals for this measure, you get this:
The figures in the bottom row make sense: total sales for London for all products divided by total sales for London for all products will always give 100%!
Suppose that you now want to display the number of sales as a percentage of the total for all cities and for all products, to get this:
In this case, the numerator is the total number of sales in the UK in London for Timmy Tortoise, and the denominator is the total number of sales in the UK; the other two constraints have been removed from the denominator. Here is a DAX measure to calculate these figures:
% of all products and cities = DIVIDE( // divide the number of sales ... COUNT(Sales[SalesId]), // ... by the number of sales for all products and // cities CALCULATE( COUNT(Sales[SalesId]), ALL('Product'[ProductName]), ALL(City[CityName]) ) )
You can use the ALL
function as many times as you like – each time it will remove one dimension from the filter context.
An alternative solution to the above problem would be to calculate this ratio:
Total sales for the current filter context / Total sales for the current filter context, but removing every constraint apart from the country one
Here’s a quick comparison of the two approaches:
Here’s a measure which would show each product/city’s contribution to the grand total for each country:
% relaxing everything but country = DIVIDE( // divide the number of sales ... COUNT(Sales[SalesId]), // ... by the number of sales, keeping only the // country constraint CALCULATE( COUNT(Sales[SalesId]), ALLEXCEPT( Sales, Country[CountryName] ) ) )
It’s up to you whether you think it’s more elegant to remove constraints from the filter context individually using ALL
, or to remove all constraints apart from one using ALLEXCEPT
.
The previous examples have all involved removing the filter context in whole or in part. What if you wanted to change it to show the ratio for each matrix cell between the number of sales for that cell and the number of sales for the same filter context, but for the product Timmy Tortoise? That is, you want to calculate:
Total sales for the current filter context / Total sales for the current filter context, but ignoring any product constraint and using the Timmy Tortoise product instead
For this example, it’s inevitable that the figures for Timmy Tortoise should be 100%, because for each cell in this row you’re dividing a figure by itself. The matrix above shows that sales of Olly Owl were only a third of those for Timmy Tortoise in London but were twice those for Timmy Tortoise in Manchester.
A formula that you could use might be:
% of Timmy = DIVIDE( // divide the number of sales for the filter context by ... COUNT(Sales[SalesId]), // ... the number of sales for the filter context, but // removing any product constraint and replacing this // with a constraint that the product should equal Timmy Tortoise CALCULATE( COUNT(Sales[SalesId]), 'Product'[ProductName] = "Timmy tortoise" ) )
What this does is to calculate the number of sales for a particular country, city and product, and divide this by the number of sales for the same country and city, but for Timmy Tortoise. The extra filter you add in the CALCULATE
formula doesn’t build on the filter context for the product, but instead replaces it.
Sometimes you’ll want to reference just the selected items in a dimension, in a slicer, for example, rather than include all of the items in your formula. Here’s an example of a matrix where you might want to do this:
The measure shown initially is as follows:
% of all sales = DIVIDE( // divide number of sales for filter context ... COUNT(Sales[SalesId]), // ... number of sales for all countries CALCULATE( COUNT(Sales[SalesId]), ALL(Country[CountryName]) ) )
The figures don’t add up to 100% because for each country the statistic shown equals:
the number of sales for that country / the number of sales for all countries
In this example the USA is included in the denominator but not in the numerator. To get the statistic to work, you need to reference only the selected countries in the denominator:
% of selected country sales = DIVIDE( // take the number of sales for each country COUNT(Sales[SalesId]), // divide this by the number of sales for all // currently selected countries CALCULATE( COUNT(Sales[SalesId]), ALLSELECTED(Country[CountryName]) ) )
This gives the required 100% total, regardless of the combination of countries you select in the slicer:
Before moving on from the CALCULATE
function, it has one more string to its bow. Consider the following two formulae:
Total sales A = SUMX(Sales,[Price]*[Quantity]) Total sales B = CALCULATE(SUMX(Sales,[Price]*[Quantity]))
If you’ve been following up to now, you’ll realise that these two formulae must give the same result under all circumstances:
The first formula gives the total sales value for the current filter context;
The second formula gives the total sales value for the current filter context, with no extra modifications to it.
However … what happens if there isn’t a filter context to begin with? In this case the second formula will create a filter context, and hence return a different answer than the first. How can you not have a filter context? By creating a calculated column in a table:
The first formula gives the same result for each row. Because calculated columns don’t have a filter context by default, the formula sums sales over all of the rows in the sales table, giving the same answer (238.32) for each.
Remember that the second formula is as follows:
Total sales B = CALCULATE(SUMX(Sales,[Price]*[Quantity]))
The CALCULATE
function doesn’t just allow you to change the filter context, it can create it, too. For each country, this creates a filter context, limiting the rows in the sales table to those for the country in question, and hence giving a different answer for each row of the above table. The process of changing row context into filter context in this way is called context transition.
Learning the CALCULATE
function is key to understanding how to create measures in DAX, but the VALUES
function runs it a close second. The rest of this article shows what this function does, and how to use it to create a range of effects in your Power BI reports.
The VALUES
function returns the table of data for the current filter context. To explain what this sentence means, here’s an example. Suppose you create this table in a Power BI report:
Note for this example that I’ve used a filter (not shown here) on the report to avoid showing any blank countries). The Number of cities column shows the number of cities for each country, using the following measure:
Number of cities = COUNTROWS(VALUES(City[CityName]))
If you could look at the filter context, this is what you would see:
The VALUES
function allows you to return a table containing one or more of the columns in the current filter context’s underlying table. For example, you could create this measure:
Cities = VALUES(City[CityName])
If you display this measure in your Power BI report, you’ll get this error message:
The problem is that you’re trying to display a column of values in a single cell. This would work for Brazil and China, each of which only has one city, but wouldn’t work for the other three countries.
What you could do, however, is to test whether there is only one city for a country, and in this event show its name; otherwise, you could show a message saying that there are multiple cities. Here’s a measure to do this:
Cities = IF( // if there is one city for the current // filter context ... COUNTROWS(VALUES(City[CityName])) = 1, // ... shows the city's name VALUES(City[CityName]), // Otherwise, show a message "More than one city" )
Displaying this measure in our report would give:
For the total row there are lots of cities in the current filter context, so naturally you get the More than one city message.
The above example shows two important features of the VALUES
function. The first is that it returns a table of data. In the measure above, the COUNTROWS
function expects to receive a table:
Fortunately, that’s what’s supplied:
The VALUES
function in this case returns a singlecolumn table which looks like this for each of the 5 countries:
The second important point to understand about the VALUES
function is that you can’t put a table into a cell without performing some sort of aggregation on it first, since a table can potentially contain multiple values. What this means is that the selected part of the measure below shouldn’t work:
This is because the VALUES
function returns a column of data, and even though you know that there is only one row in this column, and hence only one value, you would normally still need to apply some aggregation function (e.g., MAX
, MIN
, SUM
) to the data.
Happily, there is one exception to this rule. If a call to the VALUES
function returns a table with one column and one row, you can automatically treat this as a single scalar value without any additional work. This is why this measure works!
Checking whether the filter context only contains one value for a particular column is a common thing to do. It’s so common, in fact, that DAX has a dedicated function called HASONEVALUE
to do this.
You could rewrite the measure like this:
Cities = IF( // if there is one city for the current // filter context ... HASONEVALUE(City[CityName]), // ... shows the city's name VALUES(City[CityName]), // Otherwise, show a message "More than one city" )
Another solution would be to count how many distinct city names there are in the current filter context:
Cities = IF( // if there is one city for the current // filter context ... DISTINCTCOUNT(City[CityName]) = 1, // ... shows the city's name VALUES(City[CityName]), // Otherwise, show a message "More than one city" )
These three methods, using VALUES
, HASONEVALUE
or DISTINCTCOUNT
, are interchangeable, and I don’t think there’s any clear reason to favour one over another.
For this example, you might want to list out the names of the cities for each country. You can do this using the CONCATENATEX
function, which has this syntax:
The arguments to this function are thus:
For this example, you could modify the measure to read like this:
Cities = IF( // if there is one city for the current // filter context ... DISTINCTCOUNT(City[CityName]) = 1, // ... shows the city's name VALUES(City[CityName]), // Otherwise, list all city names CONCATENATEX( VALUES(City[CityName]), City[CityName], ",", City[CityName], ASC ) )
This more or less works, since it gives this table:
The only remaining problem is that the total row now looks odd. Technically it is correct, because, for this row, the filter context contains all of the cities for all countries. A better solution would be to check whether there is more than one country in the filter context:
Cities = IF( // if there's only one country in the filter context ... HASONEVALUE(Country[CountryName]), // ... show the city name or names ... IF( // if there is one city for the current // filter context ... DISTINCTCOUNT(City[CityName]) = 1, // ... shows the city's name VALUES(City[CityName]), // Otherwise, list all city names CONCATENATEX( VALUES(City[CityName]), City[CityName], ",", City[CityName], ASC ) ), // ... or otherwise show nothing BLANK() )
This is what you should now see when using this measure in the table:
All of this illustrates an important point about DAX measures. You can create a measure which gives sensible results for one particular visual, but can you be sure that it will give sensible results in another? Or in a totals row? Or a totals column, or grand total? You’ll often be faced with a tradeoff in DAX between checking that a measure works under all possible circumstances and keeping things simple.
Suppose that you now want to display the total sales for each country apart from the UK. The obvious way to do this is to sum total sales, but using the CALCULATE
function to amend the filter context to omit the UK:
Value of sales = CALCULATE( // calculate total sales value SUMX( Sales, [Price] * [Quantity] ), // country not UK Country[CountryName] <> "UK" )
This suffers from one major problem – it doesn’t work! Displaying this measure in a table would show the same value for every country:
To understand why this measure is showing 166.57 for every country, remember what I said earlier in this article: when you apply a filter, it replaces the current filter context for a dimension. For this example you’re adding this filter to the CALCULATE
function:
What this does is to lose any existing filter by the country dimension and replace it with one where the country is UK. Here’s what’s going on for each country:
The total sales value for all of the countries apart from the UK is 166.57, so that’s what gets displayed in every row. What you want to do is to keep the existing filter context constraints for the country dimension, but then add to them. One way to do this is to use the VALUES
function, making the new measure read like this:
Value of sales = CALCULATE( // calculate total sales value SUMX( Sales, [Price] * [Quantity] ), // keep the country filter as it is VALUES(Country[CountryName]), // and add that the country should not be the UK Country[CountryName] <> "UK" )
This would give the following results:
If you’re wondering why there is a discrepancy between the 166.57 shown in the first table and the 150.37 shown in the second, it’s explained by the fact that I had filtered the table to remove any sales taking place with no assigned country. If you remove this filter you get:
Add these 16.20 of sales in as above and you would get the required figure.
This is a clever idea, which allows you to make reports dynamic. The idea is to create a slicer which allows you to choose which measure you want to show. In the example below, someone has chosen to show the average price of sales:
To make this work, first create a table to hold the statistics that you might want to report:
However, don’t link this table to any other. That’s why this technique is often called a “disconnected slicer”. Now create a slicer based upon this table:
The idea is that when you select a statistic in the slicer, the bottom table will show its value. All that you now need to do is to create and show a measure which will yield:
Here’s what this measure might look like!
Statistic = // first find out what user wants to see (assume one thing chosen) VAR Choice = SELECTEDVALUE('What to show'[Statistic]) // return different measure according to choice RETURN IF( HASONEVALUE('What to show'[Statistic]), SWITCH ( Choice, "Average price", AVERAGE(Sales[Price]), "Number of sales", COUNTROWS(Sales), "Total sales",SUMX(Sales,[Price]*[Quantity]) ), BLANK() )
One final question: is it possible to display different statistics using different number formatting? I can’t think of any way to do this except to use the FORMAT
function:
Statistic = // first find out what user wants to see (assume one thing chosen) VAR Choice = SELECTEDVALUE('What to show'[Statistic]) // return different measure according to choice RETURN IF( HASONEVALUE('What to show'[Statistic]), SWITCH ( Choice, "Average price", FORMAT(AVERAGE(Sales[Price]),"0.00"), "Number of sales", FORMAT(COUNTROWS(Sales),"#,##0"), "Total sales",FORMAT(SUMX(Sales,[Price]* [Quantity]),"#,##0.00") ), BLANK() )
The problem with this is that the FORMAT
function turns numbers into text, although because it does so only after the calculation is complete for each filter context, this shouldn’t cause too much of a problem. Here’s what you’d see for the number of sales for the above measure, for example:
And just in case you’re wondering, I can’t think of any way to change the column title dynamically!
There’s one more thing to demonstrate with the creative use of the VALUES
function: how to show the choices made in a slicer. For the report page below, you’d like a card visual (shown selected) to display a measure listing the countries chosen:
Here are some examples of what the card should display:
If you’ve been following the article so far, there’s nothing new with this – it just combines lots of the ideas you’ve already seen. Here’s a measure which would fit the bill:
Title = IF( // if there are countries selected ... ISFILTERED(Country[CountryName]), // test to see if one, or more than one IF( HASONEVALUE(Country[CountryName]), // there's one country selected; show it (but must use // the VALUES function to convert the single column, // single row table into a scalar VALUES(Country[CountryName]), // otherwise, join the country names together CONCATENATEX( Country, Country[CountryName], ",", Country[CountryName], ASC ) ), // if we get here, then user didn't select // any countries "All countries" )
Here’s what this would show if you have one country selected assuming that you attach the measure to your card:
If you have multiple countries selected, you’ll see this:
And finally, if you have no countries selected, you’ll see this:
If you want to be even fancier you could use a quick measure to display only the first 3 countries in any list which I covered in the previous article in this series.
This article has shown how you can use two of the most important DAX functions: CALCULATE
and VALUES
. The article began by showing how you can use the CALCULATE
function to amend the default filter context, mainly in order to create ratios. I then showed how you can use functions like VALUES
, HASONEVALUE
and ISFILTERED
to produce a variety of clever effects in DAX. The next article in this series will look at the FILTER
and EARLIER
functions. You should make sure that you understand clearly how you can use the CALCULATE
function to change filter context before progressing, since the DAX formulae won’t get any easier!
The post Using the DAX Calculate and Values Functions appeared first on Simple Talk.
]]>The post Creating Measures Using DAX appeared first on Simple Talk.
]]>A measure is any formula which aggregates data, whether it be counting the number of transactions in a database, summing the sales figures for a region, or working out the highestearning salesperson for each division of a company. A measure always involves aggregating data in some way. In Power BI (and PowerPivot and SSAS Tabular) you create measures using the DAX formula language.
It takes a while – and a few bites of the cherry – to understand DAX properly (I’ve been teaching Power BI for some years now, and still haven’t come up with a way to make it intuitive to understand). The crucial thing is to comprehend two concepts called filter context and row context. In this article, I’ll explain what measures are, where and how to create them, and I’ll explain the difference between filter and row context in DAX.
This article uses the same simple database as its predecessor. This database shows sales of three toys for different cities around the world:
You can import this data into your own Power BI data model by first downloading this Excel workbook, or by running this SQL script in SQL Server Management Studio.
The easiest way to think of a measure is by reference to a pivot table in Excel – like this one:
The value 4 shown in the coloured box represents the total quantity of sales of the Olly Owl product in Cape Town. If you doubleclick on the cell, you’ll see the underlying data:
The selected values sum to 4 and represent all the sales for this product (Olly Owl) and this city (Cape Town). This represents the filter context for this cell. The pivot table doesn’t sum all of the sales in this cell – just the ones which are for the product and city for this particular row and column of the pivot table.
The formula in the red box below – summing the quantity of goods sold for each cell in the pivot table – is a measure. Microsoft flirted with the idea of calling it a calculated field instead in Excel 2013, but they wisely reverted to using the term measure for Excel 2016):
The underlying formula for this implicit measure – if you could but see it – would be =SUM(Sales[Quantity])
.
If you’ve been using Power BI at all, you’ll already have created measures. When you drag a field onto a table, matrix or chart, you create a hidden measure. In the diagram below, someone is about to drag the Quantity
column into the Values section of a matrix, which will by default sum the quantity for each product and country:
Adding the Quantity
field to the Values section of the matrix would show this ‘measure:’
In Power BI there is no way to see the DAX formula underlying this measure, but believe me, it exists, somewhere hidden away behind the scenes. PowerPivot has an advanced option allowing you to view implicit measures like this, but the Power BI elves don’t want to confuse you by letting you do this.
Before you create a measure, you need somewhere to put it. You can add measures to any table in your data model, but the best solution is to create a separate table to hold only measures. One way to do this is to create a new table by clicking on the Enter Data tool in Power BI:
Leave everything as it is but overtype the name Table1 with your own name. All measures is a good choice to ensure the table appears high up in the list alphabetically:
After clicking Load, you’ll now have a nearlyempty table, in which you can create measures:
Note that you should resist the temptation to remove the useless column Column1 at this stage since Power BI would then helpfully remove the now empty table too. As soon as you’ve created at least one measure in your new table, you can delete Column1.
To avoid being too ambitious to start with, begin by creating a basic measure to sum the quantity of goods sold. This example is completely pointless, because I’ve just shown that you can get this figure by dragging the existing [Quantity] column onto a visual, but it is a nice simple example to get you started. First, rightclick on your new All measures table, and choose to add a new measure:
There are many other ways to do the same thing, but this seems the easiest. I’ve covered creating new calculated columns in the previous article in this series, and I’ll show what quick measures are towards the end of this article. Here’s what you’ll see after choosing to create a new measure:
You can now type in any measure name and any valid formula. It’s up to you whether you include blank lines and comments. Here is the formula for Total quantity sold:
Total quantity sold = // sum the quantity column SUM(Sales[Quantity])
Your new measure will look like this in the formula bar.
After pressing Enter, you can now choose to display your squeakyclean new measure in your visual:
It would be worrying if this didn’t give the same results as the implicit measure I showed earlier since it’s doing the same thing:
Finally, it would be a good idea to set default formatting for your measure, so that it will look good wherever you show it. To do this first select the measure:
You can now choose an appropriate default format:
Here I’ve said that a comma will appear for sales of more than 999 items.
Here are the main aggregation functions that you can use in DAX:
Function 
What it does for a column’s values 
AVERAGE, AVERAGEA 
Returns the average (arithmetic mean), ignoring or taking account of text entries (see hint below) 
COUNT 
Returns the number of cells that contain numbers 
COUNTA 
Returns the number of nonempty cells 
COUNTBLANK 
Returns the number of blank cells 
COUNTROWS 
Returns the number of rows in a table 
DISTINCTCOUNT 
Returns the number of different distinct values 
GEOMEAN 
Returns the geometric mean 
MAX, MAXA 
Returns the largest value 
MEDIAN 
Returns the median (the halfway point) 
MIN, MINA 
Returns the smallest value 
PERCENTILE.INC, PERCENTILE.ENC 
Two similar functions which return the n^{th} percentile in a set of values 
STDEV.P 
Returns the standard deviation of the entire population 
STDEV.S 
Returns the sample standard deviation 
SUM 
Adds the cells’ values 
VAR.P 
Returns the variance of the entire population for a column 
VAR.S 
Returns the sample population variance for a column 
Suppose now that you want to sum sales, not quantities. One way to do this is to create a calculated column in the underlying table, and sum this column:
Summing the [Sales value]
column for the above table would give the correct result, however, the method above has two main disadvantages – it will slow loading data, and it will consume more memory. I’ll explain each of these disadvantages in turn.
When you click on a button to refresh your data, Power BI will do this in two stages although the nittygritty of this is hidden from you:
Processing — Power BI reloads the data for each of the tables in your data model.
Calculation — Power BI builds any calculated columns that you’ve added to tables, among other things.
It’s this second stage which will run more slowly since Power BI will have to reconstruct the [Sales value] column in the [Sales] table, even though you may never use it.
The second disadvantage is that the calculated column will take up more memory – probably much more memory. To see why, suppose you have data like this:
The granularity of the columns are as follows:
[Price]
– 3 unique values (2.50, 3.00 and 5.00)
[Quantity]
– 3 unique values (1, 2 and 3)
[Sales Values]
– 7 unique values (2.50, 3.00, 5.00, 7.50, 9.00, 10.00 and 15.00)
Thus, the dictionaries for the calculated column will consume more memory than the original two columns’ dictionaries combined. You can see much more discussion about how DAX uses column storage rather than row storage in the first article in this series.
In summary, there’s a tradeoff between aggregating the values in a calculated column (using functions like SUM
) and aggregating the underlying expression (using a function like SUMX
). The first method uses more memory but will then run more quickly, while the second method consumes less memory but may run more slowly. There doesn’t seem a clear consensus as to which method is better, so I would (as they say in the UK) “suck it and see”.
A solution to the above problem is to sum the expression [Price] * [Quantity] from the Sales table, but the humble SUM
function won’t do this, as the IntelliSense below shows (the function is expecting a single column, not an expression):
Instead, you need something which will sum an expression, and for that, you just add an X onto the end of your function name:
Here are the common function names that you can use to sum an expression across a table:
Function 
What it does for a column’s values 
AVERAGEX 
Returns the average (arithmetic mean) of an expression 
COUNTX 
Returns the number of cells that contain numbers 
COUNTAX 
Returns the number of nonempty cells 
GEOMEANX 
Returns the geometric mean 
MAXX 
Returns the largest value 
MEDIANX 
Returns the median (the halfway point) 
MINX 
Returns the smallest value (named after Minnie from The Beano) 
PERCENTILEX.INC, PERCENTILEX.ENC 
Two similar functions which return the n^{th} percentile in a set of values 
STDEVX.P 
Returns the standard deviation of the entire population 
STDEVX.S 
Returns the sample standard deviation 
SUMX 
Adds the cells’ values 
VARX.P 
Returns the variance of the entire population for a column 
VARX.S 
Returns the sample population variance for a column 
In this example, you could create a new measure in the [All measures] table like this:
Total sales = // sum the product of price and quantity SUMX( Sales, [Price]*[Quantity] )
Why does Power BI have a different library of functions in order to accomplish something which is essentially the same? The answer is that, from the point of view of the DAX database, the two measures are completely different. The first function was this:
SUM(Sales[Quantity])
Consider what this does for this cell containing the number 4:
To calculate this, Power BI works out the filter context for this cell:
It then sums the numbers in the Quantity column for the filter context. Because these numbers are all stored in one place, the calculation is very quick: 1 + 1 + 2 = 4.
Now consider the second measure – the one which sums an expression:
SUMX(Sales, [Price]*[Quantity])
Here’s the equivalent figure for UK sales of Olly Owl products:
To calculate this figure, DAX can’t just sum the value of a column. Instead, it must work its way down the rows for the filter context, multiplying the price for each product by the quantity sold:
This will take much longer. The calculation is:
4.10 * 1 = 4.1
4.40 * 1 = 4.4
4.00 * 2 = 8.0
DAX then sums the results to get 4.10 + 4.40 + 8.00 = 16.50. To do this, it iterates over the rows, creating a row context for each. This is such an important statement that I’m going to spell it out in detail. First DAX creates a row context for the first row in the filter context:
Here’s what DAX can now see:
It multiplies the price by the quantity, stores the total and ends the row context, moving on to the next row in the filter context.
A function which behaves like this, which iterates down all the rows of a table, creating a row context for each row and performing some calculation before going on to the next row, is called an iterator function.
The database for this example contains two price fields. There’s the price at which goods are actually sold in the Sales table:
However, there’s also the list price for each product in the Product table:
A reasonable question to ask is this: for any given cell in a table or matrix, or any given data point in a chart, what is the ratio between the actual sales values of all the goods sold and what the sales value would have been if everything had sold at its list price? The answer should look like this:
The top matrix shows the value of the sales that actually took place, the middle matrix shows what the value of these sales would have been if the list price was charged for each product in each case and the bottom matrix shows the ratio between the two values.
To calculate this ratio, first create a measure to sum actual sales:
Discounted sales values = SUMX( // from the sales table, sum the // price * the quantity for each row Sales, [Price]*[Quantity] )
Now create another measure which will sum the product of two figures:
The sales quantity from the Sales table; and
The product’s list price from the Purchase table.
Here’s what this formula could look like:
Undiscounted sales value = SUMX( Sales, // multiply the product's list price times // the quantity sold RELATED('Product'[ListPrice])*[Quantity] )
Why is the RELATED
function needed to look up the list price for each product from a separate table? You’ve already seen that measures create filter context– you don’t have to specify how the underlying tables are linked together, as this is done automatically in DAX. However, the SUMX
function is an iterator function which creates a row context for each row of the specified table (in this case Sales). Although the [Undiscounted sales value] measure above doesn’t need to crossreference different tables, the SUMX
function within it does, since all each row knows about by default is the columns within that table:
To find out for any row what the purchase list price was, you need to pull in a value from another table for this row context, and for that, you need to use the RELATED
function.
The final measure just divides one measure by another:
% of full value = DIVIDE([Discounted sales values],[Undiscounted sales value] )
Or, if you prefer not to use any intermediate measures, you could do everything in one formula:
% of full value (version 2) = DIVIDE( SUMX( // from the sales table, sum // the price * the quantity for each row Sales, [Price]*[Quantity] ), SUMX( Sales, // multiply the product's list price times // the quantity sold RELATED('Product'[ListPrice])*[Quantity] ) )
The above allinone formula is getting a bit complicated. Here it is again in colour:
What it’s doing for each cell of a matrix or table, or for each data point in a visual, is as follows:
Calculating one number (total sales for the filter context) – call this A;
Calculating a second number (total sales for the filter context, but using the product’s list price) – call this B;
Dividing the first number by the second number.
You could make this formula easier to read by dividing this into separate stages, using variables to hold the value of the numbers calculated along the way. For this example, you’ll create the following two variables:
DiscountPriceSales
to hold the value of A; and
ListPriceSales
to hold the value of B.
Then divide one variable by the other to get the answer. The syntax for creating variables in a DAX measure is as follows:
MeasureName = VAR Variable1 = expression … VAR VariableN = expression RETURN expression
You can declare as few or as many variables as you like in a measure, but you must finish up by returning a value (every measure must calculate a single value for each filter context).
Given the above, here’s what the measure could look like using variables:
% of full value 3 = VAR DiscountPriceSales = SUMX( // from the sales table, sum // the price * the quantity for each row Sales, [Price]*[Quantity] ) VAR ListPriceSales = SUMX( Sales, // multiply the product's list price times // the quantity sold RELATED('Product'[ListPrice])*[Quantity] ) RETURN DiscountPriceSales / ListPriceSales
It’s important to realise that the two variables retain their value only while each measure is being calculated. A matrix displaying the above two measures might show this after formatting as a percent:
DAX will generate different values of the DiscountPriceSales and ListPriceSales measures for each cell in the filter context. Those who are experienced in programming in other languages should note that there’s no such thing as a public, static or global variable in DAX – at least, not yet.
Variables are worth using in their own right since they break complicated formulae up into smaller, more manageable chunks. However, they also have another advantage – they allow you to debug code. Suppose you think you have a problem with the formula above (the version using variables). You could comment out some lines to experiment:
The above example will just show the value of the first variable, but when you’re happy that this is working OK, you could change things to show the value of the second variable:
When you’re happy about this too, you could reinstate the original formula.
I’d like to show how I managed to produce many of the code diagrams above quickly, using a free standalone DAX editing tool called DAX Studio. You can download DAX Studio here (one of a few possible download sites, actually), and then install it in the standard Windows way.
To use DAX Studio, first create a Power BI Desktop report – I’ve saved the one I’m working on as Variables example.pbix:
When you run DAX Studio, you can choose to connect to this data model:
As the above diagram shows, you can link directly to a Power BI or SSAS Tabular data model. To link to a model in PowerPivot, you should install the Excel addin option for DAX Studio which is out of scope for this article.
Here’s a quick rundown of some of the things you can do in DAX Studio. Firstly, it gives you great colourcoding:
Secondly, you can zoom in and out by holding down the Ctrl key and using your mouse wheel, or by using this dropdown, although Power BI has just introduced this feature, at long last):
Thirdly, you can drag table and column names from the model’s Metadata on the left into the formula window:
This example would give the following:
And fourthly, you can easily comment out or comment back in blocks of code which is how I produced my variable measure so quickly. To do this just select the block of text that you want to comment out, or back in, then click on the appropriate tool in the DAX Studio ribbon:
Here’s what this would give:
However, the single best thing about DAX Studio – and the reason I use it extensively – is a very simple one. When you press the ENTER key in DAX Studio, it adds a new line rather than assuming that you’ve finished creating your formula and hence trying to validate it.
DAX Studio does have one big drawback, however – you can’t use it to test a measure. After having painstakingly created a formula, it’s then up to you to select the text which comprises the formula, and then copy this back into Power BI Desktop. It’s worth noting that when you’re writing DAX queries as shown in this article by Robert Sheldon the exact opposite is true – you can run the DAX Queries in DAX Studio, but not within Power BI Desktop).
Future articles in this series are going to take you a long way down the murky rabbithole that is DAX, but it’s worth mentioning that you don’t actually have to write a single formula yourself – you could let Power BI do it for you, using a builtin wizard called Quick Measures. Opinion in Wise Owl is deeply divided on this subject (it’s the Marmite of the DAX world, although I’m not sure that a UK cultural reference like this will travel well!). Some of our trainers love quick measures; I confess I don’t. On the plus side, they allow you to create complicated formulae very quickly, and without any typing. My objections to them are threefold:
They won’t help you to learn DAX, in fact, they might do exactly the opposite, as the formulae that they create can be quite offputting;
Like all wizards, they generate overcomplicated solutions; and
It’s not always obvious which calculation you should choose to use.
Using a quick measure is thus very similar to recording a macro in VBA: you’ll save yourself lots of typing, but the resulting code may be hard to understand, and it may well be written in a way which is often more complicated than a human would choose.
However, why not judge all this for yourself? Take a look at a couple of reasonably typical case studies of using quick measures: one to show the difference between each city’s sales and the sales figure for London, and one to show the chosen values for a slicer in a card.
For the first one, suppose you have a column chart showing the quantity of goods sold for each city:
You now want to show this figure relative to the figure for London, to produce this:
To do this, first create a quick measure by rightclicking on any table and choosing to create a new quick measure (although it makes sense to put your quick measures – just like your normal measures – into a dedicated table such as the All measures table we’ve been using):
Choose the calculation you want to perform – in this case, it’s to show the difference from a filtered value, although I don’t think this is that obvious!
Now drag the field in that you want to aggregate, summing the quantity of goods sold:
Finally, drag the field that you want to filter by from the fields on the right onto the Filter section of the dialog box to get this:
Power BI creates a new measure, automatically giving it a reasonably sensible name:
The measure generated – like many quick measures – makes copious use of variables. It probably won’t make a great deal of sense to you at the moment, as you haven’t yet seen the allimportant CALCULATE
function in this series of articles:
Quantity difference from London = VAR __BASELINE_VALUE = CALCULATE(SUM('Sales'[Quantity]), 'City'[CityName] IN { "London" }) VAR __MEASURE_VALUE = SUM('Sales'[Quantity]) RETURN IF(NOT ISBLANK(__MEASURE_VALUE), __MEASURE_VALUE  __BASELINE_VALUE)
For the second example, suppose you have a slicer by country:
You want to show the countries selected in a card:
To do this, you could create a quick measure using the same method as above and choose to concatenate the country values selected. You have to scroll right down to the bottom of the list of calculation options to find this quick measure:
You could now drag the CountryName field from the list of fields on the right onto your formula:
The number of values before truncation determines how many countries you’ll need to select before Power BI stops listing them, showing “etc.” instead. For example, if you leave this as the default value 3 – as above – and choose all of the countries, here’s what you’ll see:
Here’s what the measure generated for this example looks like:
List of CountryName values = VAR __DISTINCT_VALUES_COUNT = DISTINCTCOUNT('Country'[CountryName]) VAR __MAX_VALUES_TO_SHOW = 3 RETURN IF( __DISTINCT_VALUES_COUNT > __MAX_VALUES_TO_SHOW, CONCATENATE( CONCATENATEX( TOPN( __MAX_VALUES_TO_SHOW, VALUES('Country'[CountryName]), 'Country'[CountryName], ASC ), 'Country'[CountryName], ", ", 'Country'[CountryName], ASC ), ", etc." ), CONCATENATEX( VALUES('Country'[CountryName]), 'Country'[CountryName], ", ", 'Country'[CountryName], ASC ) )
This is pretty hardcore DAX and probably won’t make any sense at all at the moment. It’s basically doing the same thing twice – once for the case where the number of countries chosen is more than 3, and once for when it’s 3 or less.
In this article, you’ve seen that the best place to put measures that you create is in a separate table. You’ve seen that measures always involve aggregation, whether this be for a single column using functions like SUM
, AVERAGE
, and COUNT
or for an expression using the same functions, but with an X suffix. These last functions are called iterator functions and create a row context for every row in the table to which they refer. I then showed how you can create and use variables, and how you can use DAX Studio to edit your measures. Finally, finishing with two case studies of how to create quick measures to avoid typing any DAX in at all. In the next article in the series, I’ll show how you can use the CALCULATE
function to change the filter context, and I’ll even explain what that sentence means!
The post Creating Measures Using DAX appeared first on Simple Talk.
]]>The post Creating Calculated Columns Using DAX appeared first on Simple Talk.
]]>DAX is Microsoft’s new(ish) language which allows you to return results from data stored using the xVelocity database engine, which, unlike for most databases, stores data in columns rather than rows. You can program in DAX within Power BI (Microsoft’s flagship BI tool), PowerPivot (an Excel addin which allows you to create pivot tables based on multiple tables) and Analysis Services Tabular Model (the successor to SSAS MultiDimensional, which allows you to share data models and implement security).
As the demand for these technologies increases, I have been teaching these topics (Power BI, PowerPivot and SSAS Tabular, as well as in DAX) to countless students in the UK. In the hopes that I can reach even more students, I decided to write this series of articles for the great readers of SimpleTalk.
This article shows how to create a calculated column using DAX. If you’re not sure what I mean by this, here’s an example from Power BI:
I’ve used Power BI for all my examples, but the methods and formulae used would be identical in PowerPivot or SSAS Tabular.
The examples in this article are based on this simple database, containing sales of fluffy toys:
You can follow all the examples on your computer – just download this Excel workbook containing these four tables or run this SQL Server script to generate the database. You’ll then need to import the tables into a Power BI report, PowerPivot data model, or SSAS Tabular model.
Let’s start with the basics. Suppose you want to calculate the value of each sale in the database by multiplying the price times the quantity. For the first item in the Sales table, this should give 9.49 since someone bought 1 item on 10^{th} February 2017, at a price of 9.49.
To do this, create a formula which multiplies the price by the quantity for each row. Start by adding a column. Here’s how to do this in Power BI:
Now rename the column. Call it Sales value. You don’t need to be afraid of using spaces in column names in Power BI; they work perfectly:
You can now either click on the first column to which you want to refer or type in a square bracket ( [ ) symbol to bring up a list of columns:
It’s enough to type in a single P in this instance:
You can now press the TAB key to add the selected item into your formula, then continue typing to complete your formula:
Calculated columns aren’t the only way to show the sales value for each transaction. There are at least three possible alternative ways to get at the sales value for each row in the example. Firstly, you could add the column to the underlying data source, for example, by creating a view in SQL like the one below:
A second way to avoid using calculated columns would be to do the calculation using the M formula language in the Query Editor (for SSAS Tabular this is only possible for SQL Server 2017 and later):
And a third way is not to add the column at all, but to create it on the fly in a measure. Here’s an example of a measure to do this. I’ll cover measures in a later article in this series, but the comments should make it reasonably obvious what this is doing:
The obvious question is which of these three methods is the best one? I think it’s easiest to create the calculated column in DAX, but it will slow up processing data and eat up memory. How so? When you refresh your data in a data model, Power BI, PowerPivot or SSAS Tabular divides the process (forgive the pun) into two steps:
The more calculated columns you have, the slower processing may be. The column will also have a significant effect on the memory you use since it has much higher cardinality than either of the two columns it references. Here are the values for the price, quantity and sales amount columns for every sales transaction in the database:
The number of unique values in the three columns are as follows:
All programs using DAX store information in columns, not rows, an important point which I’ll keep coming back to in this series of articles. When importing the data, Power BI will construct a dictionary of unique values for each column, and the above table shows that the dictionary for the sales value column will occupy more space than the dictionary for the other two columns put together. This won’t be a problem with 25 rows, or even with 25,000, but with 25 million or even 25 billion rows, it might start using up valuable memory. Weighed against this is the fact that it will be convenient to have access to a readycalculated sales column.
If you’re interested in how DAX stores data in columns I’ve included some more detailed examples at the end of this article.
In general, it’s best practice in DAX to refer to columns using the TableName[ColumnName]
syntax – so the formula would become:
Sales value = Sales[Price] * Sales[Quantity]
It’s helpful sometimes to type in a single apostrophe character to bring up a list of all the table names in a data model:
The apostrophes are essential for any table which has a name which is also a reserved word in Power BI, PowerPivot or SSAS Tabular. The following calculated column wouldn’t work without the apostrophes since Product
is a reserved word in Power BI:
Discounted list price = 0.90 * 'Product'[ListPrice]
The Power BI, PowerPivot, and SSAS Tabular editors have all improved greatly over the years and versions (they needed to), but they can still be annoying at times. I have two particular bugbears:
Because of these inconveniences, I often write my DAX formula in DAX Studio, a standalone editor which you can download here. It’s reasonably intuitive to use as the main thing to do when you run the application is to connect to a data model:
You can choose the Power BI or SSAS Tabular model you want to connect to, provided that you have a Power BI file or SSAS Tabular model open. To use DAX with PowerPivot, you must install the DAX Studio addin, and then run this from within Excel.
You can now type in your DAX formulae, including dragging field or table names from the metadata shown on the left into your formula:
Eventually, you’ll have to copy this formula into Power BI, but at least it gives you a nice editing environment.
DAX contains the same IF
function as Excel, which allows you to test whether a condition is true or not, and return different values in either case:
Suppose that you want to deliver a verdict on each sale – if it costs more than 10 units (let’s say they’re dollars, for the sake of argument), you’ll call it expensive. Otherwise you’ll call it cheap. Here’s a formula which would do this:
Verdict = IF( // if the sales amount is more than 10 ... [Sales value] > 10, // ... then show EXPENSIVE ... "Expensive", // ... otherwise, show CHEAP "Cheap" )
This new column would correctly distinguish between sales of more than or less than 10 dollars:
It’s time now to complicate things – what happens if you want to show a verdict on each sale with the following criteria?
You could do this by nesting IF
functions within each other, but the results would be difficult to read. A better solution is to use the SWITCH
function, a version of which also exists in the latest versions of Excel, although many people aren’t aware of this.
The first argument (the first bit of information the function asks for) is an expression. This can be anything, but typically is True()
, showing that it will keep trying to evaluate possibilities until it finds something which is true. Once true is found, it will immediately stop. This is easier to understand if you look at the suggested formula:
Better Verdict = SWITCH( // we're looking for something which is true TRUE(), // start at the bottom  is this item's sales less // than or equal to 5? If so, it's cheap [Sales value] <= 5, "Cheap", // if we reach here, it wasn't less than 5, so // maybe it was greater than 5 but less than 10? [Sales value] < 10, "Cheapish", // OK, that failed too  maybe it was between 10 // and 15? [Sales value] < 15, "Middling", // if we get here, the sales value must be 15 or more "Expensive" )
You can have as many pairs of conditions and values as you like. You don’t have to have the ELSE value at the end, although, it’s highly recommended, as you may have inadvertently omitted some values from the tests.
So far Excel users are probably wondering what’s different about DAX. You are about to find out! Suppose that you import all the worksheets in the supplied Excel workbook apart from the last one called TempSales:
Your data model should now look like this:
This makes for a messy interface, as a user must figure out from which table to get fields:
It would be better for the user if you could combine the fields into a single table like this:
There are two steps to accomplish this. The first is to hide tables and columns from report view, which you can do by rightclicking on them. This doesn’t remove them from the data, however.
The other part of the magic is to add calculated columns to show the city, country, and product for each sale:
To do this, use the RELATED
function:
The RELATED
function will look up the value of any other column in any other table, providing that there is a direct path between the two tables and that the question makes sense. For example, you could look up for any sale the name of the country in which it took place because each sale belongs to a single city, and each city belongs to a single country. It doesn’t make sense to go the other direction, to look up a sale for each country, because this wouldn’t be uniquely defined. The question doesn’t make sense.
Another way to look at this, is to check for each relationship whether it is onetomany or manytoone. I’ll cover the special case of onetoone relationships later in this series of articles.
Using the above relationship diagram, you can see that one country has many cities and one city has many sales, so it’s legitimate to show the country where each sales transaction took place. The RELATED
function is like the VLOOKUP
function in Excel, with the difference being that you can daisychain between tables provided that they are joined by the correct type of relationship.
Here, for example, is the formula to show the country for each sale:
Country = RELATED( // look up the country name from // the country table (indirectly // linked via the city table) Country[CountryName] )
And here’s what the final sales table might look like after adding the Product and City columns as well with the same technique:
Note that there’s a problem with sales in Cape Town, which don’t seem to belong to any country – I’ll show how to resolve this further down in this article.
The previous calculated columns show the parent for any child; the RELATEDTABLE
function allows you to show the children for any parent. The difference is that whereas the RELATED
function will always return a single value, the RELATEDTABLE
function will always return a table of data.
Here’s an example to show, for each country, how many sales took place within it:
Number of sales = COUNTROWS( // count how many rows there are in the // table of sales for this country RELATEDTABLE(Sales) )
The formula would return these values in this example:
If you’re a SQL programmer, you’ll be familiar with the perils of including null values in your formula since any formula which has a null value as one of its inputs tends to spit out a null value as its result. The equivalent to a null value in DAX is BLANK. You can test whether a column equals blank in one of two ways – either by comparing it directly with BLANK or by using the ISBLANK
function.
Review the example showing the country in which each sale took place. The results show a blank next to any sale in Cape Town:
Cape Town belongs to country number 6:
There is no country number 6 in the database which has caused the problem with Cape Town:
It’s not very good database design practice, but it is a very convenient way to illustrate how blanks work. The problem is that the RELATED
function looks across to the country table and returns the country in which each sale takes place. For sales in Cape Town, the function can’t find a related country, and so just shows a blank.
This looks suspicious, so change it to show the message “Country not found.” One way to do this is by testing to see whether the value returned from the RELATED
function is a blank:
Country = IF( // if there's no country found for this sale ... RELATED(Country[CountryName]) = BLANK(), // ... show blank, or otherwise ... "Country not found", // ... show the country found RELATED(Country[CountryName]) )
Alternatively, you could use the ISBLANK function:
CountryISBLANK = IF( // if there's no country found for this sale ... ISBLANK(RELATED(Country[CountryName])), // ... show blank, or otherwise ... "Country not found", // ... show the country found RELATED(Country[CountryName]) )
In either case, the results are much better!
Perhaps an even better way to solve this problem is to create an intermediate column which you can then hide from view:
This will make the formula simpler since you won’t need to repeat the RELATED function. The formula for the intermediate country column above could be:
Intermediate country = // ... show the country found (may be blank) RELATED(Country[CountryName])
The formula for the final country column could then refer to the intermediate column:
Country = IF( // if the country returned is blank ... ISBLANK([Intermediate country]), // ... show suitable message ... "Country not found", // ... otherwise, show country name [Intermediate country] )
The files referenced at the top of this article also include a separate table called TempSales which looks like this:
For some reason, Los Angeles has managed to record sales despite having no stores. How this happened is beyond the scope of this article (although it looks suspiciously like an attempt to create an example illustrating errorhandling in DAX).
If you create a calculated column showing salesperstore within this table, you will get an error:
The reason there’s an infinity sign next to Los Angeles is that this row contains a dividebyzero error. If you divide total sales for Los Angeles of 4.5 by the number of stores, 0, you get infinity. There are several possible solutions to this: don’t let the error happen in the first place, let it happen but trap it, or specifically watch out for dividebyzero errors and handle them where they occur.
You could use the IF
function to test if the denominator is zero, therefore, preventing the error in the first place. For those who have forgotten their schooldays:
In the equation 22/7, 22 is the numerator and 7 the denominator.
Using this method would give a calculated column like this:
Sales per store = IF( // if no stores, show blank [NumberStores] = 0, BLANK(), // otherwise, do the division [TotalSales] / [NumberStores] )
This method would give the following results (as would the other two methods used below):
The second method for solving this problem is to use the IFERROR
function, which works in the same way as it does in Excel. The first argument is the thing which may contain an error, and the second one is what you want to display if it does:
For this case, you could enter the following DAX formula for the calculated column:
Errortrapping = IFERROR( // try dividing sales by number of stores [TotalSales]/[NumberStores], // if this fails, show blank BLANK() )
The final method you could use is the DIVIDE
function, which, as its name suggests, divides one number by another. The difference is that it automatically returns a value that you specify if this division returns a dividebyzero error. The syntax of the function is:
In this case, you could use the following formula:
Divide by zero trap = DIVIDE( // divide the sales by the number of stores [TotalSales], [NumberStores] // could specify what to return if denominator // is zero, but this will default to blank, // which is what we want )
Which of the three methods is best? I think a purist would answer the first one because it prevents the error from happening in the first place, but I’d go with the last one. If you’re going to trap for an error, it’s good to be as specific as possible about the nature of the error you’re trapping; IFERROR
is a bit vague.
To end this article, I’d like to expand a bit on how DAX stores data in columns, not rows. Understanding this will help you make sense of DAX formulae when they get more complicated (which they certainly will!).
To explain column storage, I’m going to take a quick digression into image compression. As my youngest daughter says: “Bear with!”.
Here’s a picture of a house (and a very nice one it is too):
Notice how the sky is the same shade of blue nearly everywhere. To store this image, you wouldn’t have to store every pixel; instead, you could use an image compression algorithm to condense everything down, using the fact that lots of adjacent pixels share the same colour. An image which compresses down to a small size tends to be one which would be hard to do as a jigsaw!
The above example is a bit complicated, so here’s an easier one:
Not quite as nice a house, but much easier to use for an explanation! Here’s how to store this as a compressed image. First, build a dictionary of colours:
Now (starting from the top left) create a table showing how the colours are used:
By my calculation, this table will have 34 rows. Instead of storing 100 different cell/colour combinations, I’m storing three colours and 34 rows, reducing the size of the image by a factor of three. Real image compression algorithms are much more sophisticated than this, but the principle is the same.
What has all this got to do with column storage? Well, the principle is identical. Here’s what you might think the sales table would look like when storing each row as a separate entity:
Here’s what the table actually looks like to DAX. Each column is stored separately, although not necessarily using different colours!
The engine which stores data for PowerPivot, Power BI and SSAS Tabular is called the xVelocity engine (at least, it has been called that since SQL Server 2012). This database engine would store each column above separately. To show how this works, consider the ProductId column, which starts like this:
The first thing the database engine would do is to build a dictionary of unique values:
There are only three rows because, although there are 25 sales, these are all for one or another of the same three products. The engine would then store for each product its dictionary entry.
This storage method allows the xVelocity engine to compress data so that it takes up less memory. The storage method also allows DAX to access all the values in a column more quickly. It’s one reason why aggregating data over a column is so quick in DAX: all the figures you’re aggregating are stored in the same place. It’s worth bearing all of this in mind when you’re writing DAX formulae and reading the rest of this series.
One of the consequences of the above storage algorithm is that the database engine will store columns with low granularity (i.e., with lots of repeated values) much more efficiently. In the diagram below, I’ve changed the colours to reflect how expensive each column is to store:
The SalesId column is the most expensive item to store because it has no duplicate values at all. It’s the primary key in the Sales table, so that’s unique by definition. The best thing to do to the data model would be to delete this column since it’s not used in any reports, and it’s not needed for any of the relationships.
Unfortunately, you probably do need all the other columns in this table. Product id and city id fields are needed to link tables together. The price and quantity are needed to create measures, and the sales date are used to report figures by month, quarter and year.
This article has shown that you can create formulae to calculate columns within a table in Power BI, PowerPivot or SSAS Tabular. Although these formulae use the DAX language, they look remarkably similar to formulae used within Excel. You’ve also seen that DAX stores data in columns rather than rows, which has important implications. Every calculated column that you create will need to be stored separately in memory. The next article in this series will look at the flipside to calculated columns – measures – and explain the two most important concepts in DAX, which are row context and filter context.
The post Creating Calculated Columns Using DAX appeared first on Simple Talk.
]]>The post Moving Data From Excel to SQL Server – 10 Steps to Follow appeared first on Simple Talk.
]]>You need to create a package to import data from an Excel workbook into a SQL Server table, using SQL Server Integration Services as a tool. What could possibly go wrong? Well … plenty, as it turns out. The aim of this article is to help people to avoid all the pitfalls that I fell into when first learning SSIS.
This article uses SSIS 2012, the latest version at the time of writing, but you won’t notice many differences if you’re using 2008 or 2005. The article assumes that you’re using SQL Server Data Tools – Business Intelligence within Visual Studio 2012 to create SSIS packages: Business Intelligence Development Studio (BIDS) was used until SQL Server 2012. Do Microsoft have a whole department devoted to thinking up misleading names for software?
Let’s suppose that you want to import an Excel workbook of purchases into a SQL Server table (you can download this workbook at the bottom of the article):
You could use Excel to manually delete the top two title rows and bottom two blank rows to make life easier, but this would be kind of cheating (and also kind of pointless, since whatever application produced the Excel workbook of purchases would just recreate the unwanted rows next time you ran it). To quote the words of Caiaphas in the musical Jesus Christ Superstar: “we need a more permanent solution to our problem”.
Before you begin, make sure that you’ve closed your Excel workbook down. If you run any SSIS package to import data from an Excel workbook which is open, you will get some horribly misleading error messages.
Before you can play about with data (sorry: extract, transform and load it), you need a project to do it in. Go into SQL Server Data Tools or Visual Studio, then choose to create a new project:
You may be able to miss out this step if you’ve just gone into SSIS for the first time.
At the top of the dialog box which appears, make sure you create the right sort of project:
Choose to create a business intelligence SSIS project.
You can then give your project an exciting name (at least, more exciting than Integration Services Project1, which is what I’ve used!):
Choose a name and location for your new package.
SSIS will now create a new project, and also (by default) a new package too, imaginatively called Package.dtsx. There are two ways you can see that this is what’s happened. One is that you can see the package in Solution Explorer:
The other clue that SSIS has created a package for you is that it’s staring you in the face!
By default you are put in Control Flow view, which is like a flow diagram showing the sequence in which tasks that you create will execute.
Before you continue, you need to make sure that you’ve created a connection to your SQL Server database. To do this, first rightclick on the ‘Connection Managers‘part of Solution Explorer:
Rightclick to create a new connection manager.
The most efficient way to link to SQL Server is using an OLEDB connection manager:
Choose to add an OLEDB connection manager.
Now click on the ‘New…‘ button to create your new connection manager:
Create a new connection manager as above.
Choose your server, authentication method and database on the next screen, then select ‘OK’ twice to see your new connection manager listed:
It makes sense to create this connection manager for the entire project, since it’s likely you’ll use the same connection in other packages within the same project.
I haven’t shown any more details about this here for two reasons: the settings will be different on your machine, and anyone reading this article is likely to have created connections many other times in many other software applications!
You can’t import data into a nonexistent table, so the next thing we’ll do is to create the table shown below. We could do this manually within SQL Server Management Studio, but we’re aiming for an automated solution which we can run time and time again, so instead we’ll create the table as part of our SSIS package.
Our table will look something like this: we’ll import the item name, price and quantity, but the purchase id will be generated automatically. As for the total in column E of our spreadsheet – we’ll just choose not to import that, since it can be recreated by multiplying the Price and Quantity columns at any time.
To create the table, first doubleclick (or click and drag) on the ‘Execute SQL’ task to add a task to the control flow which should be visible on screen (we want to create the shell table within this task):
This task will run some SQL to remove any existing purchases table, and create a new one.
I tend to give my tasks long, descriptive names (geeks may prefer to use shorter meaningless names!):
You can also add sortofcomments to packages using something called annotations:
You can rightclick to add an annotation to your package – they appear like postit notes:
Anyway, returning to the main story, you can now edit your Execute SQL task:
The easiest way to edit any SSIS task is to doubleclick on its icon, although you can also rightclick on the task and choose ‘Edit…‘ as above.
In the dialog box which appears, choose to connect to your database, using the connection manager that you’ve just created at project level:
You can use a projectlevel connection manager in any package.
You can now enter the SQLStatement property, specifying the SQL that SSIS should run for this task. Here’s what I’ve used for this article.
IF EXISTS (SELECT 1 FROM information_schema.tables where table_name like 'tblPurchase') DROP TABLE tblPurchase  create a table to hold purchase ledger items CREATE TABLE tblPurchase( PurchaseId int PRIMARY KEY IDENTITY(1,1), ItemName varchar(50), Price float, Quantity int )
This will first delete any table called tblPurchase which already exists, and then create a new, empty one. The PurchaseId column is an identity one, which will automatically take the values 1, 2, 3, etc. Here’s what the Excecute SQL task dialog box now looks like:
It’s time now to test that this works by running your singletask package:
Rightclick on the package name in Solution Explorer and choose to execute it as shown here. SSIS will save your package automatically before executing it.
If all goes well, you should see this:
The green tick means things went well!
If your package doesn’t run at this point, you may be trying to run it on a 64bit computer. The default mode in SSIS on a 64bit SQL Server installation is 64bit. In this case, you have to specifically change the mode to run a package. I don’t want to clutter this article up with an explanation of how to do this so please refer to this article for how to do this.
You should now have a table, which you can view in SQL Server Management Studio if you should so wish:
You now need to stop the package running:
Select the menu option (or press the keystroke above) to stop your package running, and wave goodbye to the green ticks for now!
It’s now time to create the data flow tasks – although first we need to create an Excel connection.
Before you can import data from an Excel workbook, you will need to create a connection to it. You should probably create this connection within your package, as it’s likely that this’ll be a oneoff (you won’t need to use the same connection in any other package):
Rightclick in the ‘Connection Managers‘ section of your package, and choose to create a new connection.
Note that you could alternatively use the ‘Source Assistant‘ to do this, but I always like to do things explicitly:
You can now choose to create an Excel connection:
Browse to your Excel workbook and choose it:
Leave the‘First row has column names‘option ticked.
When you select ‘OK‘, you should see your Excel connection:
You could rename this connection manager, but we’ll leave it as it is.
It’s time now to start the real work! We want to add a data flow task to the control flow tab of your package. This new data flow task should import data from the Excel workbook into the SQL Server table (although as we’ll see, things can go awry at this stage).
Add a ‘Data Flow‘ task to your package, and rename it to say ‘Import data‘ (as above).
You now need to get the two tasks shown to run in sequence; you’ll firstly want to create a new table to hold your purchases, and then import the data into it. To do this, click on the first task and drag its arrow onto the second. This arrow is called a precedence constraint.
You can now doubleclick on the data flow task to edit what it does – we’ll spend the rest of this article in the data flow tab of SSIS:
Data has to come from somewhere, and in our case it’s from Excel:
Drag an Excel Source from the SSIS toolbox onto your empty data flow window (here we’ve also then renamed it).
You can now doubleclick on this source to edit it, and tell SSIS where it should get its data from:
SSIS will automatically guess the connection manager if you only have one Excel connection manager for this package/project, but you’ll still need to choose the worksheet name (as shown above).
It’s a good idea now to preview your data, by clicking on the ‘Preview…‘ button:
We’ve got obvious problems with our first 2 and last 2 rows, but we can solve these by losing any rows for which the first column is null, which we’ll do shortly using a conditional split transform.
It’s a good idea now to rename all of the columns, so that you know what they refer to:
Click on the ‘Columns‘ tab (as shown above), then give the output columns better names, as we’ve done here.
When you select ‘OK‘, you should have an Excel source with no errors shown for it:
Now to do something with this data!
The next thing we want to do is to divert all of the purchases with nulls in to… nowhere, really! To do this, add a conditional split transform to your data flow:
Add a Conditional Split as above (here we’ve renamed it also, to lose the nulls), and direct the output (or “data flow path”, if you want the technically correct name) from the Excel course into it.
You can now doubleclick on the ‘Conditional Split‘ task to configure it. We’ll set up two flows out of it:
Data where the id column is null will go down a pipe called ‘Case 1‘ (which we won’t actually connect to anything); while
All other data will flow down a pipe called ‘OK Data’.
Here’s how to set this up:
Set up an output (called ‘Case 1‘ by default) which tests the condition that the Id column is null. You can drag the ISNULL function and Id column down into the ‘Condition‘ box to avoid having to type them in.
At the bottom of this dialog box you can type in a name for your default output:
Here we’ve called the default output ‘OK Data’.
We should be getting near the end of our journey now – all that we should need to do is to send the good data into our purchases table. Here’s how to do this:
Add an OLE DB destination (as shown above) – here we’ve renamed ours as‘Purchases table‘.
You can now drag the green arrow from the ‘Lose the nulls‘ transform into the Purchases table destination:
When you release the arrow, you’ll be asked which output you’re choosing: ‘Case 1‘ or ‘OK Data’ (the two outputs from the conditional split). Choose ‘OK Data’.
Having mapped data into the purchases table, it’s now time to configure it. Doubleclick on the ‘Purchases table‘destination to edit it:
Firstly, choose the connection manager to use (although you probably won’t have to do this, as SSIS will assign it automatically if you’ve only got the one), and the table to target.
You can now choose which columns from Excel to map onto which columns in the SQL Server table:
Be warned – the Item column will cause a problem soon… !
Here’s what you’ll be left looking at when you choose ‘OK‘:
There’s a problem with the ‘Purchases table‘destination.
If you mouse over the red circle, you’ll see what the problem is:
The problem is that Excel uses Unicode data, and we’ve created a varchar column in SQL Server.
When you’re creating columns in SQL Server, you can use either nvarchar or varchar for variable length strings:
The varchar data type uses half the amount of bytes that nvarchar uses, because it can’t store extended characters. We could have used nvarchar and avoided this problem!
The benefit of using Unicode is that it allows you to store international characters: currently over 110,000 different characters from over 100 scripts, according to Wikipedia.
However, we used varchar, so we need to convert our Excel Unicode characters into normal ones. To do this we can use a ‘data conversion‘ task. First, however, we need to break the link we’ve created:
Rightclick on the link between the transformation and the destination and delete it.
You can now add in a ‘data conversion‘ task:
Here we’ve added a ‘Data Conversion‘ task (shown selected on the left), and renamed it to ‘Turn Unicode into varchar‘. The next thing is to pipe our data into it:
Pipe the ‘OK Data’ from the conditional split transform into this further data conversion transform.
You can now doubleclick on the ‘Turn Unicode into varchar‘ data conversion task, and say what it should do:
Here we’ve chosen to create a new column called ItemVarchar, which takes the Item column and turns it into a nonUnicode string using the default ANSI code page.
I’ve also changed the length to 50 characters at this point. This will mean that strings longer than 50 characters will be truncated, giving rise to truncation errors. Dealing with these is beyond the scope of this article – for now it’s sufficient to note that none of the purchase descriptions is long enough for our example to cause us any worries.
Nearly there! You can now take the output from this data conversion task and feed it into the Purchases table destination:
We’ve still got an error, as we haven’t redone the column mappings for the destination.
You can now doubleclick on the Purchases table destination to configure the column mappings:
Choose to map the newly derived ItemVarchar column onto the ItemName column in the SQL Server table.
All of your errors should now have disappeared, and you can run your package!
The final step is to import your data by executing the package:
Rightclick on the package in Solution Explorer to execute it (wish we’d renamed it …).
Here’s what the data flow should look like:
You should now have 5 purchases in your tblPurchase table:
OK, it would have been quicker to type them in on this occasion, but you’ve now got a package which you can run every monthend, and which will work whether there are 5 purchases or 500,000.
Integration Services is just one of those software applications which is a joy to use. I hope this has encouraged you to use it to automate moving data around in your company. There’s nothing quite so satisfying as seeing the green ticks appear next to all of the tasks in your packages when you run them!
The post Moving Data From Excel to SQL Server – 10 Steps to Follow appeared first on Simple Talk.
]]>The post 10 Reasons Why Visual Basic is Better Than C# appeared first on Simple Talk.
]]>Visual Basic is a better programming language than Visual C#. Who says so? This article! Here are 10 reasons why you should always choose VB over C#.
This is a quotation from Gertrude Stein’s 1922 play Geography and Plays. However, the poetry wouldn’t work in C#, because – unforgivably – it’s a cASeSeNSitIvE language. This is madness!
Before I start ranting, let me just acknowledge that casesensitivity confers one (and only one) advantage – it makes it easier to name private and public properties:
Writing properties like this means that you can refer to the public Name property, and it’s obvious what the private equivalent will be called (name). 
// private version of variable private string name = null; public string Name { get { return txtName.Text; } set { name = txtName.Text; } } 
So now we’ve got that out of the way: casesensitive programming languages make everything else harder. Why, you ask?
The only possible benefit is that you can use more combinations of variable names, that is, you can use more of one of the few infinite resources in this universe…
It doesn’t matter if you disagree with everything else in this article: casesensitivity alone is sufficient reason to ditch C#!
Both VB and C# contain a way of testing mutually exclusive possibilities, the Select Case and Switch clauses respectively. Only one of them works properly.
A Visual Basic Select Case clause, returning a description of how old someone is. The age range for a young person is a tad generous, reflecting the age of the author of this article.
A Visual Basic Select Case clause, returning a description of how old someone is. The age range for a young person is a tad generous, reflecting the age of the author of this article. 
Select Case AgeEntered Case Is < 18 txtVerdict.Text = "child" Case Is < 50 txtVerdict.Text = "young person" Case Is < 65 txtVerdict.Text = "middleaged" Case Else txtVerdict.Text = "elderly" End Select 
You can’t do this using Switch in C#, as – astonishingly – it can’t handle relational operators. You have to use an If / Else If clause instead. But even if you could, you’d still have to type in lots of unnecessary Break statements:
switch (AgeThreshold) { case 18 : txtVerdict.Text = "child"; break; case 50 : txtVerdict.Text = "young person"; break; case 65 : txtVerdict.Text = "middleaged"; break; default: txtVerdict.Text = "elderly"; break; }

It’s easy to forget to type in each of these Break statements! 
This is specific to Visual Studio (I’m using 2010, the latest version). Suppose I want to attach code to anything but the default Click event of a typical button:
Let’s suppose that I want to attach code to the MouseHover event of this button. 
I can do this in Visual Basic without leaving the code window:
a) First choose the object from the drop list. 
b)Then choose the event you want to code. 
In C# you can’t do this – you have to return to the button’s properties window and choose to show its events:
You can doubleclick to attach code to this event for the selected button – but that’s the only simple way to create it for C#. 
But it’s even worse than that. If you then rename a control (in this case btnApply) you have to reassociate the eventhandler with the renamed control in the properties window (or in the initialisation code, if you can find it). In Visual Basic, of course, you can do all of this in code:
Private Sub btnApply_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnApply.Click MessageBox.Show("Hello") End Sub 
Globally change btnApply to the new button’s name in code, and everything will work as before. 
C# was written by academics. It shows. Consider this table of C# symbols and their VB equivalents:
What you’re trying to do  C# Symbol  VB Equivalent 
Test if two conditions are both true  &&  and 
Test if one or other condition is true    or 
Test if a condition is not true  !  not 
Concatenate two strings of text  +  & 
Test if a condition is true within an if statement  ==  = 
Which column looks like it was designed by a real person?
IntelliSense works much better for Visual Basic than for Visual C#. Take just one example – creating a writeonly property. Let’s start with Visual Basic:
When you press return at the line end… 

WriteOnly Property PersonName As String Set(value As String) End Set End Property 
… You get this fullycompleted clause. 
For C#, the same thing doesn’t happen:
When you press return here, nothing happens (other than a blank line appearing). 
This is just one example. I’ve just spent ages transcribing our VB courses into C#, and believe me, there are many, many more!
Here are a couple of functions I use from time to time in VB:
Function  What it does 
IsNumeric  Tests if a value can be converted to a number 
PMT  Calculates the annual mortgage payment for a loan 
Great functions, but they don’t exist in C#.
Why do I have to end every line in C# with a semicolon? The argument used to be that it avoided the need to use continuation characters in Visual Basic:
MessageBox.Show( _ text:="This article is a bit opinionated", _ caption:="Message") 
You used to have to use an underscore as a continuation character to show incomplete lines of code in VB. 
However, as of Visual Basic 2010 you rarely need to do this anymore. Come on, C#: Visual Basic has ditched its lineending character; why can’t you?(;)
Someone commented on my original (much shorter) blog about this:
“In a short amount of time you’ll type those semicolons without thinking about it (I even type them when programming in visual basic).”
That’s like saying that you’ll soon get used to not having any matches to light your fire, and you’ll even start rubbing sticks together to start a fire when you finally manage to buy a box!
The order of words in a C# variable declaration is wrong. When you introduce someone, you wouldn’t say, “This is my friend who’s a solicitor; he’s called Bob”. So why do you say:
string PersonName = "Bob";
To me:
Dim PersonName As String = "Bob"
…is much more logical. I also find the C# method of having to prefix arguments with the word out confusing, particularly as you have to do it both in the called and calling routine.
C# is a much fussier language than Visual Basic (even if you turn Option Strict on in Visual Basic, this is still true). “And a good thing, too!”, I hear you cry. Well, maybe. Consider this Visual Basic code:
Enum AgeBand Child = 18 Young = 30 MiddleAged = 60 SeniorCitizen = 90 End Enum Select Case Age Case Is < AgeBand.Child MessageBox.Show("Child") Case Else MessageBox.Show("Adult") End Select 
With Option Strict turned on this shouldn’t really work, as it’s comparing an integer with an enumeration – but VB has the common sense to realise what you want to do. 
The equivalent in Visual C# doesn’t work:
A less forgiving language…
What this means is that you end up having to fill your code with messy type conversions:
The simplest way of converting an enumeration to an integer; but why should you have to? 
// find out the age entered int Age = Convert.ToInt32(txtAge.Text); if (Age < (int) AgeBand.Child) { MessageBox.Show("Child"); } else { MessageBox.Show("Adult"); } 
If you want to dynamically change the length of an array in Visual Basic, you can use Redim Preserve. To do the same thing in Visual C#, you have to copy the array, add a value and then copy it back:
The vile, clunky C# method of extending an array. 
string[] PartyGuests = new string[2]; PartyGuests[0] = "Sarah Palin"; PartyGuests[1] = "Richard Dawkins"; // whoops! forgot to invite Mitt // create a new extended array string[] tempGuests = new string[PartyGuests.Length + 1]; // copy all of the elements from the old array into new one Array.Copy(PartyGuests,tempGuests,PartyGuests.Length); // add Mitt as last element tempGuests[PartyGuests.Length] = "Mitt Romney"; // restore full list into original array PartyGuests = tempGuests; // check works foreach (string Guest in PartyGuests) { System.Diagnostics.Debug.Write(Guest); }

This epitomises Visual C# for me. Critics will tell me that:
That’s hardly the point! The point is that – as so often – Visual Basic makes something easier to code than C# does.
So those are my 10 reasons to code in Visual Basic. What are you waiting for, all you C# codemonkeys? Convert all of your code to VB – you have nothing to lose but your semicolons!
The post 10 Reasons Why Visual Basic is Better Than C# appeared first on Simple Talk.
]]>