Monday, January 17, 2011

DAX Time Intelligence Functions

Time intelligence functions are an advanced feature of DAX. They are typically composite functions built on top of other more basic DAX features and have more stringent requirements for the structure of the underlying database. But since time related calculations are fundamental to almost all BI projects, early adopters of DAX started using the time intelligence functions almost immediately after the first release of DAX. Due to the advanced nature of these functions, questions naturally arise about how they work and why they work that way even when people managed to get the desired results.
Marius Dumitru, architect of AS engine team, gave the following recipe to users of time intelligence functions.
1.            Never use the datetime column from the fact table in time functions.
2.            Always create a separate Time table with contiguous dates (i.e. without missing day gaps in the date values).
3.            Create relationships between fact tables and the Time table.
4.            Make sure that relationships are based on a datetime column (and NOT based on another artificial key column).
5.            Make sure you have full years’ worth of data in the Time table. For instance, even though your transaction table may only have sales data up to May 2010, the Time table should have dates up to December 2010 (same goes for fiscal years).
6.            The datetime column in the Time table should be at day granularity (without fractions of a day).
In this post, I’ll expose more details behind the implementation of time intelligence functions to shed light on the rationale behind Marius’ advice. This post is meant to supplement online documentation of SQL Server 2008 R2. I will not cover basic usage examples of time intelligence functions in typical BI applications. DAX is a young functional language that evolves rapidly. What is covered here applies to SQL Server 2008 R2.
I assume you are already familiar with advanced DAX concepts such as row context, filter context, Calculate function, Values function, measures, etc. I’ll start with some common features across all time intelligence functions before I delve into individual functions.
The <dates> argument
All time intelligence functions have a special <dates> argument. FirstNonBlank and LastNonBlank operate on non-date/time columns but the conversion rules descibed below still apply. With the exception of DatesBetween and DatesInPeriod, the <dates> argument can be one of three forms:
1.       A reference to a date/time column.
2.       A table expression that returns a single column of date/time values.
3.       A boolean expression that defines a single-column table of date/time values.
Internally, all three forms are converted to a DAX table expression in the following fashion:
Format number
<dates> format
Internal table expression
1
T[C]
CalculateTable(Values(T[C]))
2
Single column table expression
As is
3
Well-formed scalar expression
Filter(All(T[C]), <scalar expression>)


What is a well-formed scalar expression? Vaguely speaking, it is a scalar expression that DAX engine can analyze and extract a fully qualified column reference to a date/time column. T[C] = Date(2011, 1, 17) is a well-formed expression as DAX engine can extract T[C] from it. T[C] * 2 is a well-formed expression for the same reason even though it is not a boolean expression DAX engine will cast it to boolean data type implicitly. Date(2011, 1, 17) is not a well-formed scalar expression as no column reference can be extracted from it.
Note that internally inserted CalculateTable in the first format automatically converts current row contexts to filter contexts. So
MaxX(Distinct(DimDate[CalendarYear]), OpeningBalanceYear([m], DimDate[Datekey]))
will give the maximum opening balance among all years, but
                MaxX(Distinct(DimDate[CalendarYear]), OpeningBalanceYear([m], Values(DimDate[Datekey])))
will not as the <dates> argument is in the second format hence not correlated to the current calendar year on the row context.
A special rule regarding datetime filter inside Calculate/CalculateTable
If a Calculate filter has a unique column that is of data type date/time, all previous filters on all columns from the table which contains this date/time column are removed. This hacky feature implies that
                Calculate(<expression>, <TI function>) = Calculate(<expression>, <TI function>, All(DateTable)).
This is the reason behind Marius’ recommendation #4. Let’s say you have some years on the pivot-table row and then you drag a measure which uses time-intelligence function DateAdd to show sales from the previous year. The author of the measure formula may not realize that DateAdd function only returns a single column of Datekey which overwrites existing filter on the same column. This special rule makes sure that the filter on the CalendarYear column, which comes from the pivot-table, is also removed so you get back the expected result. Without this special rule, Calculate(<expression>, <TI function>) would set days of the previous year on the Datekey column but leave the previous year as the filter on the CalendarYear column. The conflicting filters would have produced a blank result.
In PowerPivot v1, the only way to mark a column as unique is to create an incoming relationship to this column, hence Marius’ recommendation #3. You obviously also need a separate Date table in order to create a relationship.
In practice, people want to use integer keys to create relationship between Fact table and Date table. If you want to use time intelligence functions in that case, you must add All(DateTable) filter yourself to the measure expression.
In future versions of PowerPivot, users may be able to mark a date/time column as unique without creating an incoming relationship to the column. When that happens, recommendation #3 would no longer be needed.
Calendar
DAX engine creates an internal calendar data structure for each date/time column used in the time intelligence functions. The calendar contains a hierarchy of year-quarter-month-day. The minimum year and the maximum year in the calendar are determined by the actual dates in the date/time column. Internally, intermediate results are represented as arrays of days. So to represent January 2011, intermediate result would hold the 31 days in that month. Although a calendar contains all days in the years involved, it marks which days actually exist in the column. Calendar member functions return result days only if they are marked as exists. So if you have missing dates in the date/time column, they cannot be returned as a result of time intelligence functions.
When a calendar data structure is initially populated, it raises an error if two values from the date/time column correspond to the same day. So if you have a date/time column below the day granularity, it cannot be used in time intelligence functions.
Obviously calendar is an internal implementation detail which is subject to change in future releases of DAX. I mention it here to explain some of the limitations imposed on the current version of time intelligence functions. I’ll refer to calendar again when I discuss individual functions next.
Primitive time intelligence functions
As I mentioned at the beginning of the post, time intelligence functions are an advanced feature of DAX. They are built on top of other DAX features and functions, so there is nothing primitive about them. But many time-intelligence functions are composed from more basic time-intelligence functions which have their native implementations, I call the latter primitive ones.
FirstDate/LastDate/StartOfMonth/EndOfMonth/StartOfQuarter/EndOfQuarter/StartOfYear/EndOfYear return a table of a single column and a single row. They can be used anywhere a table expression or a scalar expression is needed due to implicit table to scalar cast.
As I described in the section on the <dates> argument, no matter which format the user uses, internally they are all converted to a table expression. From now on I will use <dates table> to denote the table expression equivalent to <dates>. I will use DimDate[Datekey] or simply [Datekey] to denote the date/time column extracted from <dates> expression.
FirstDate(<dates>)
LastDate(<dates>)
These two functions return a single column, single row table with values equivalent to
                MinX(<dates table>, [Datekey]), or
                MaxX(<dates table>, [Datekey]).
FirstNonBlank(<column>, <expression>)
LastNonBlank(<column>, <expression>)
These two functions are not pure time-intelligence functions as the first argument is not limited to <dates> as in all other time-intelligence functions. They are internally rewritten as
                Top1(Filter(<column table>, Not(IsBlank(<expression>)), [column]), and
                Bottom1(Filter(<column table>, Not(IsBlank(<expression>)), [column]) respectively.
Top1 and Bottom1 are internal functions similar to MinX and MaxX functions but support all DAX data types include boolean and string data types.
Note that <expression> is not automatically wrapped in Calculate, therefore,
                FirstNonBlank(DimDate[Datekey], Sum(FactSales[SalesAmount]))
does not give you what you want, but
                FirstNonBlank(DimDate[Datekey], Calculate(Sum(FactSales[SalesAmount])))
will produce the expected result.
StartOfMonth(<dates>)
StartOfQuarter(<dates>)
StartOfYear(<dates>, <year_end_date>)
Although not implemented this way, these functions first find FirstDate(<dates>), then jump to first day that exists in the same month/quarter/year.
EndOfMonth(<dates>)
EndOfQuarter(<dates>)
EndOfYear(<dates>, <year_end_date>)
Although not implemented this way, these functions first find LastDate(<dates>), then jump to the last day that exists in the same month/quarter/year.
DateAdd(<dates>, <number_of_intervals>, <interval>)
ParallelPeriod(<dates>, <number_of_intervals>, <interval>)
SamePeriodLastYear(<dates>)
SamePeriodLastYear(<dates>) is identical to DateAdd(<dates>, -1, Year).
Both DateAdd and ParallelPeriod invoke a function, Move, on the calendar object. The only difference is that ParallelPeriod requires the result days to fill an entire month/quarter/year, while DateAdd does not have this requirement.
Currently, DateAdd has a limitation that <dates table> must contain continuous days so that the result days can be continuous too.  We have this limitation because when you move all 28 days in February one month forward, like below,
                DateAdd(filter(All(DimDate[Datekey]), Year([Datekey]) = 2006 && Month([Datekey]) = 2), 1, Month)
you get back 31 days in March! As we mentioned in the section about Calendar, all intermediate results of time intelligence functions are represented as an array of days. So there is no difference between all days in February and February itself. But what if you have one day missing in the middle of February? Should we return one day missing in March or 27 days in March? This seems to be a tricky but less important question, so we left it undecided and raised an error instead.
The calendar object’s Move function is the most intelligent part of all time intelligence functions. It knows that moving both 3/30 and 3/31 one month forward end up on the same day 4/30. Internally it only moves by number of days or months, number of quarters or years is simply translated to corresponding number of months. Sometimes the Move function can be too smart for the user. When it moves a continuous range of days forward N months, it checks the start day and the end day of the range to see if you are moving whole months or just individual days. This works well when all days exist in the date/time column. But if you have missing days, the Move logic still tries to guess whether you are moving whole months based on days that exist, this may not be what you would expect, hence Marius’ recommendation #2.
DatesBetween(<dates>, <start_date>, <end_date>)
DatesInPeriod(<dates>, <start_date>, <number_of_intervals>, <interval>)
These two functions are different from the other time intelligence functions in that the <dates> argument can only be a fully-qualified column reference to a date/time column. Note that online document incorrectly states that the <dates> argument in DatesInPeriod can be any of the three formats.
Internally, both functions are rewritten as
                Filter(All(DimDate[Datekey]), DatesRange([Datekey], Earlier(<start_date>), Earlier(<end_date>))) or
                Filter(All(DimDate[Datekey]), DatesRange([Datekey], Earlier(<start_date>, Earlier(<number_of_intervals>), <interval>))
In the formulas above, Earlier is an internal extended version of the public Earlier function that evaluates an entire DAX expression by skipping the last row context. DatesRange is an internal boolean function that returns true if the value of the first argument falls within an interval. In case of DatesRange(<date_value>, <start_date>, <end_date>), the interval is defined as [start_date, end_date]. In case of DatesRange(<date_value>, <start_date>, <number_of_intervals>, <interval>), the interval is defined as [start_date, end_date) where end_date is calculated by calling the Move function on the calendar object, so it is a smart moveJ Note that since <number_of_intervals> can be a negative number, like when you move backwards in time, the interval specified by DatesInPeriod can be in the reverse order. DatesBetween does not allow reversed end points.
Also note that All(DimDate[Datekey]) implies that existing filter context does not apply to the date/time column, unlike when all other time intelligence functions use the name-pair syntax.
PreviousDay(<dates>)
PreviousMonth(<dates>)
PreviousQuarter(<dates>)
PreviousYear(<dates>, <year_end_date>)
Although not implemented this way, these function are equivalent to
FirstDate(DateAdd(<dates>, -1, Day)) in case of PreviousDay, or
DatesInPeriod(
<dates>,
StartOfMonth/StartOfQuarter/StartOfYear(
DateAdd(<dates>, -1, Month/Quarter/Year)
),
1,
Month/Quarter/Year
) in case of PreviousMonth/PreviousQuarter/PreviousYear.
NextDay(<dates>)
NextMonth(<dates>)
NextQuarter(<dates>)
NextYear(<dates>, <year_end_date>)
Although not implemented this way, these functions are equivalent to
LastDate(DateAdd(<dates>, 1, Day)) in case of NextDay, or
DatesInPeriod(
<dates>,
EndOfMonth/EndOfQuarter/EndOfYear(
DateAdd(<dates>, 1, Month/Quarter/Year)
),
-1,
Month/Quarter/Year
) in case NextMonth/NextQuarter/NextYear.

Composite time intelligence functions
The remaining time intelligence functions listed below are always internally rewritten using other time intelligence function.
DatesMTD(<dates>)
                DatesBetween(<dates>, StartOfMonth(LastDate(<dates>)), LastDate(<dates>))
DatesQTD(<dates>)
                DatesBetween(<dates>, StartOfQuarter(LastDate(<dates>)), LastDate(<dates>))
DatesYTD(<dates>, <year_end_date>)
                DatesBetween(<dates>, StartOfYear(LastDate(<dates>), <year_end_date>), LastDate(<dates>))
TotalMTD(<expression>, <dates>, <filter>)
                Calculate(<expression>, DatesMTD(<dates>), <filter>)
TotalQTD(<expression>, <dates>, <filter>)
                Calculate(<expression>, DatesQTD(<dates>), <filter>)
TotalYTD(<expression>, <dates>, <filter>, <year_end_date>)
                Calculate(<expression>, DatesYTD(<dates>, <year_end_date>), <filter>)           

OpeningBalanceMonth(<expression>, <dates>, <filter>)
                Calculate(<expression>, PreviousDay(StartOfMonth(<dates>)), filter)
OpeningBalanceQuarter(<expression>, <dates>, <filter>)
                Calculate(<expression>, PreviousDay(StartOfQuarter(<dates>)), filter)
OpeningBalanceYear(<expression>, <dates>, <filter>, <year_end_date>)
                Calculate(<expression>, PreviousDay(StartOfYear(<dates>, <year_end_date>)), filter)

ClosingBalanceMonth(<expression>, <dates>, <filter>)
                Calculate(<expression>, EndOfMonth(<dates>), <filter>)
ClosingBalanceQuarter(<expression>, <dates>, <filter>)
                Calculate(<expression>, EndOfQuarter(<dates>), <filter>)
ClosingBalanceYear(<expression>, <dates>, <filter>, <year_end_date>)
                Calculate(<expression>, EndOfYear(<dates>, <year_end_date>), <filter>)

Thursday, January 6, 2011

Execution Plans and Plan Hints for MDX IIF Function and CASE Statement

MDX IIF function, and similarly MDX CASE statement, is often a cause of performance issues during execution of MDX formulas.  As a developer who worked on this feature, I'll explain the execution plans considered by the MDX formula engine (FE) in this post.

Introduction to some FE concepts

Before I delve into IIF function, I first need to introduce some basic concepts used by MDX FE. I admit that these concepts can be very complex and are worth several dedicated blog posts. The cursory description here can only give you some general ideas to help you understand execution plans for IIF. You can skip this section directly to the discussion about IIF plans and come back here for references when you need to.
  • Subspace
Every MDX expression is evaluated in a current context which includes a subspace.  A subspace is a subset of cells within the cube space. The original subspace is defined by the MDX query, for example, evaluate a measure expression for all product categories and all years. For a given MDX expression, FE will produce a result, like a value for scalar expressions, for every cell in the current subspace.
  • Rectangular subspace vs. arbitrary-shaped subspace
FE represents most subspaces in a very efficient rectangular shape which is a crossjoin of all attributes with non-trivial slices. Many common operations are fast and efficient when performed on rectangular subspaces. For example, search for a cached execution plan; determine whether a cell belongs to a subspace; decide whether an MDX scoped calculation applies to a subspace. But sometimes FE has to create arbitrary-shaped subspaces which you can imagine as rectangular subspaces plus additional constraints such as an enumerated set of tuples. { food, drink } x { 2009, 2010 } is an example of rectangular subspace. { (food, 2009), (drink, 2010) } is an example of arbitrary-shaped subspace. In general, Arbitrary-shaped subspaces are bad for performance.
  • Cell-by-cell mode vs. block mode
FE can evaluate a given MDX formula in cell-by-cell mode, in which case FE first builds a context for each cell in the subspace and then calculates the expression in each single-cell context, or in block mode, in which case FE builds a context which contains all cells in the subspace and evaluates the expression in a single scoop. Microsoft introduced MDX block mode execution in SQL Server 2005 and extended it to much more functions in SQL Server 2008. Typically block mode execution is orders of magnitude faster than cell-by-cell execution.
  • Logical plan
An MDX expression is evaluated in multiple stages, at one stage, a logical plan tree is built. If you have any SQL background, don't confuse the logical plan for an MDX expression with the logical plan of relational engine. In particular, MDX FE may need to execute some subtrees in the middle of building a large logical plan tree.
  • Varying attributes
The varying attributes of an MDX expression in a given subspace are those attributes that have more than one member in the subspace and on which the MDX expression depends. Varying attributes of an expression is derived after FE has built a logical plan tree for the expression.
  • Sparse expression vs. dense expression
An MDX scalar expression is called sparse in a subspace if most of the time it produces a single value, typically NULL. For example, physical measures are sparse in typical query subspaces consisting of crossjoin of dimensional attributes. If an MDX expression is not sparse, it is called dense.
  • Default value
The common single value of a sparse MDX scalar expression is called the default value of that expression. The default value of an MDX scalar expression is determined after FE has built a logical plan for the expression.

Execution plans for IIF function


Now we are ready to explore the execution plans of IIF. Implementation of IIF underwent a major overhaul in SQL Server 2008. In SQL Server 2005, there is limited support of block mode for most MDX functions, therefore IIF branches are more often evaluated in cell-by-cell mode. When SQL Server 2005 is able to evaluate IIF in block mode, it evaluates both branches in the original subspace with some special flag to indicate that the subspace may be too large, see eager mode below.

In this post I focus on the algorithm adopted by SQL Server 2008 and above. As a first step, FE transforms a single IIF function call into nested IIF function calls by eliminating AND operators and OR operators in the predicate expression. Notice that one branch expression is duplicated as a result of this rewrite.

        IIF(a AND b, x, y) => IIF(a, IIF(b, x, y), y)
        IIF(a OR b, x, y) => IIF(a, x, IIF(b, x, y))

Next, FE builds a logical plan for the simplified predicate expression. Note that FE may execute some subtrees while building a logical plan tree, this is a big topic which I have to discuss in a seperate blog post. After FE has built a logical plan tree for the predicate expression, FE performs constant folding by checking whether the simplified predicate is constant in the current subspace, and if so, reducing IIF function to one of its branch expressions.

The next step is to build logical plans for the then branch expression and the else branch expression. FE can evaluate each branch expression in one of two modes: eager mode and strict mode. As mentioned before, the IIF expression is evaluated in a subspace, let's denote it as S0. FE can either evaluate the then branch expression in S0 or FE can build a new subspace, S1, by restricting S0 to those cells that the predicate is true. Similarly, FE can either evaluate the else branch expression in S0 or FE can build another new subspace, S2, by restricting S0 to those cells that the predicate is false. If FE evaluates a branch expression in the original subspace S0, we call it in eager mode. If FE evaluates a branch expression in the restricted subspace S1 or S2, we call it in strict mode.

Strict mode has the obvious benefit of reducing the number of cells an expression is evaluated, so why doesn't FE choose it all the time? As it turns out, constructing a new subspace by evaluating the predicate will produce an arbitrary-shaped subspace if the varying attributes of the predicate expression contain more than one attribute. In general, arbitrary-shaped subspace can lead to bad performance due to various reasons such as cache misses. Moreover, if a branch expression is simple enough, like when it is a constant value, it doesn't make sense to build a new subspace at all.

MDX gives user direct control over whether a branch is executed in eager mode or strict mode through branch hints.  For example, IIF(p, x HINT EAGER, y HINT STRICT) tells FE to build logical plan in eager mode for the then branch and in strict mode for the else branch.

When there is no direct hint from user, FE takes several steps to come up with a final logical plan for each branch expression. It first tries to build a logical plan for each branch in eager mode if it is cheap. It is always cheap to build a plan for a constant value expression. When branch expressions are more complex, FE looks at the varying attributes of the predicate logical plan. If the varying attributes contains more than one attribute, hence will produce an arbitrary-shaped subspace which makes an eager plan more appealing, FE first tries to build an eager logical plan tree for a branch expression if it can be cheaply done, in other words, if it can build a plan tree without executing any subtree. In the cheap mode, if at any point in the process of building a logical plan tree for a branch expression, FE finds out that it needs to execute a subtree before it can proceed, FE will abandon the process and roll back all the way to the root of the current tree. I have seen bad performance caused by this trial-and-error process when a deeply nested IIF tree recursively tried to build a cheap plan, aborted the effort, built a real plan, and then repeated this process at the next nesting level. FE may abandon a cheaply built plan when the predicate expression is sparse and a branch with a cheaply built eager plan is on the minority side. For example, if the predicate is mostly true, the else branch is on the minority side. If a branch is on the minority side and if it is dense or if its default value doesn't match the default value of the IIF expression, its cheaply built eager plan will be abandoned.

At this stage, if FE still doesn't have a logical plan for a branch expression, FE will build a logical plan for that branch in strict mode. It first produces a new subspace for the branch by actually executing the predicate expression, producing a materialized list of tuples of the varying attributes of the predicate that satisfy the condition for that branch, and then applying the list of tuples to the original subspace as a new constraint. FE then generates a logical plan of the branch expression in the new subspace.

As you can see, MDX FE employes a rule-based approach when building execution plans for IIF function.  FE makes decisions based on sparsity and complexity of sub-expressons. Since sparisity of MDX expression is based on heuristics rather than statistics, FE may make the wrong decision which is why we need user hints.

IIF functions are often deeply nested, although this may not be obvious at first glance. As mentioned earlier in the post, AND and OR operators in IIF predicate are rewritten into nested IIF calls. Searched CASE statement is internally rewritten as nested IIF calls as well. Moreover, MDX script IF statement is also internally mapped to an IIF function call. All these internal tree rewrites make deeply nested IIF trees a common phenomenon in many cube.

Although MDX FE employes a complex algorithm to build a final plan for the IIF function, we can still learn some basic lessons that may help us understand performance issues associated with the function.
  • Strict mode helps when a branch expression is very expensive. It can be a lot of extra data has to be read from disk for those extra cells in eager mode. It can also be a very expensive MDX scoped calculation has to be included for one unwanted cell in eager mode. Try plan HINT STRICT if you think AS fetches too much data from disk as a result of evaluating a branch in eager mode.
  • All the trial-and-errors when building the eager plan can be a source of bad performance when the same IIF function is evaluated many times, typically since the parent function chooses cell-by-cell mode, or when deeply nested IIF calls recursively try eager mode and then fall back to strict mode.  Adding plan hints in this case will force FE to make up its mind right away.
  • Order of predicates in nested IIF calls can be important. A wrong condition at the top level may produce an expensive arbitrary-shaped subspace which makes it expensive to evaluate a child subtree. Or a right condition at the top may save a lot of time by eliminating the need to evaluate a more expensive condition at lower level.  Keep in mind that AND and OR operators are rewritten as nested IIF so the order in AND and OR operators is important for the same reason.
  • Sometimes cell-by-cell mode can be fast especially when a top level function is in cell-by-cell mode. I have seen cases where MDX FE spends a lot of time building plans for a branch expression but never execute the plans as the IIF condition was never met. When the IIF function itself is called many times since its parent is in cell-by-cell mode, all the time FE spent on building unused plans is wasted. Here you may try an undocumented HINT LAZY which applies to all MDX expressions but only takes effect when a server config flag LazyEnabled is set to 1.  Using this hint will force FE to delay evaluating the MDX expression until execution time.  No logical plan is built by traversing the expression tree.  WARNING: You will not get support from Microsoft if you use this hint on your own.  You must get approval from Microsoft Customer Service and Support if you want to use this hint in your production system.
Execution plans for CASE statement

There are two types of CASE statements: simple case statement and search case statement.

Search case statement is internally rewritten as nested IIF function calls in the following way:

CASE
WHEN p1 THEN e1
WHEN p2 THEN e2
...
WHEN pN THEN eN
ELSE eL
END

=>

IIF (p1, e1, IIF(p2, e2, ... IIF(pN, eN, eL)...))

After the rewrite, FE executes all IIF functions according to the algorithm outlined in the previous section.

SQL Server 2008 SP2 CTP1 introduced a hint for search case statement. The syntax of hint for CASE is:

CASE HINT <EAGER|STRICT>
WHEN ... THEN ...
...
ELSE ...
END

Internally, the provided hint is propagated to all branches of all IIF functions generated during the rewrite. Afterwards, FE will use the hint in the same way as hints provided directly to an explicit IIF function.

The simple case statement has an execution plan similar to that of IIF function. The main difference is that IIF function only has two branches while simple case statement has N branches. As of this writing, users cannot specify eager or strict hints to simple case statements.

In summary, since MDX IIF function and CASE statement are frequently a source of performance problems, using plan hints may help you solve your problem. Note that plan hints for IIF function are only available in SQL Server 2008 and later and plan hints for search case statement are introduced after SQL Server 2008 SP2 CTP1.  I have to admit trying out plan hints is like shooting in the dark due to the lack of end user tools to display MDX execution plan trees.