Saturday, October 22, 2011

Three Strategies of Evaluating the MDX Aggregate Function

In the past two months, the Analysis Services development team was preoccupied with wrapping up SQL Server Denali development. I was racing against time to put finishing touches on some exciting new DAX features that I will write in more details once Denali goes public. It didn’t help that PASS 2011 fell on the same week that developers were allowed to make a final batch of code changes before the checkin bar would be lifted prohibitively high. All the flurry of activity forced me to postpone writing blogs on a couple of topics queued up over the past year. Now that things have quieted down a lot on the Denali front, the development team immediately switched gear to work on the next batch of important features requested by many customers, like supporting DAX queries against multidimensional cubes. Over the next few months before Denali hits the stores, I’ll try to finish up several MDX topics that I had put on the backburner due to Denali crunch time.

Before I get to the main subject of today’s post, I’d like to say a few words on PASS Summit 2011 that happened here at Seattle about a week ago. AS team, and Microsoft at large, always encourages product developers to interact with customers through venues like PASS so that we get a chance to see firsthand how the products we have built impact people’s lives. I was gratified to see that Kasper received a round of applause when he gave the audience a glimpse of DAX query plans, a feature I personally fought very hard to be included in Denali. Obviously the room was filled with MDX users who have waited for years an MDX query plan feature. One MVP told me that she thinks MDX is a beautiful language. That was the first time I heard the word beautiful to be associated with MDX, what I had previoulsy heard were all in line with hard or difficult. I have also heard touching stories about how brothers both turned into MDX experts and that a mother passed her MDX knowledge to her daughter.  Who could have thought that MDX can promote family bonding? I could use stories like this to lift my spirit after finishing another grueling product release cycle at Microsoft.
Without further delay, let’s jump into the MDX Aggregate function which is the focus of this post. Many users think of Aggregate as a smart function that dynamically chooses an aggregation type based on the current measure. But this way of thinking only works when the current measure is a physical measure that has an additive aggregation type like Sum, Min, Max, or Count. What should MDX formula engine do when the current measure is a calculated measure which doesn’t have a default aggregation type? Often times the formula engine simply gives up and returns an error, as seen in this blog. Even when the formula engine is able to pick a calculation strategy, it often has limitations that may surprise you when you move beyond basic scenarios. In this blog post I’ll describe three execution plans the formula engine uses to evaluate the Aggregate function and the decision logic employed to pick the winning strategy. The information provided here is valid as of SQL Server 2008R2.
Here are the three possible execution plans for calculating the value V of
Aggregate(«Set»[, «Numeric Expression»])
in the context of the current cell Cwhose subspace is S0

1.      The basic plan.
Many people think of Aggregate as a generic form of Sum, Min, Max, or Count with a similar execution strategy.
Let Agg be the aggregation function derived from the current measure in S0
Set V = NULL
For each tuple t in «Set»
Apply t to S0 to build a new subspace S1
If «Numeric Expression» is present
Set V = Agg(V, value of «Numeric Expression» in S1)
Set V = Agg(V, value property of S1)

The formula engine chooses this strategy when the current measure in S0 is a basic physical measure or is an alias to a basic physical measure, as shown below.

2.      The switching-solve-order plan.
The basic plan stops working when the current measure is a non-trivial calculated measure since the formula engine cannot extract the Agg function any more. One way to work around the problem is to switch the solving order of the AGGREGATE function and the calculated measure. Mosha alluded to this strategy in his comment to Thomas Ivarsson’s post.

Let «Measure Expression» be the expression of the current measure in S0
Set V = Calculate the value of «Measure Expression» in S0

The thinking behind this strategy is that by evaluating «Measure Expression» first, it will eventually lead to a physical measure that has a good Agg function. Afterwards, the Aggregate calculation can be evaluated using the basic plan. This strategy essentially moves an otherwise higher-priority Aggregate calculation behind lower priority calculated measures.
For example, when I slightly change the expression for calculated measure [x] to be more than just a physical measure name, the formula engine would switch the solving orders of [x] and [y] to produce the same result as in the basic plan.

But this strategy falls apart when the calculated measure is more than just a single expression.
If you create [x] and [y] in the cube script in the following fashion,
Measures.[x] = [Measures].[Internet Sales Amount];
[Date].[Calendar Year].[y] =
AGGREGATE({[Date].[Calendar Year].&[2007], [Date].[Calendar Year].&[2008]});

you will get the following surprising query result.
This is because the formula engine can only retrieve the original expression of the calculated measure but is unable to take into account scope assignments which also affect the value of the calculated measure. In my opinion, this strategy is more or less a hack as it doesn’t provide a coherent solution under all circumstances.
3.      The set-in-the-where-clause plan.
Neither of the above strategies works when the current measure is a physical measure with a non-additive function, such as semi-additive measures or distinct count measures. To make the function return meaningful result, we borrowed a page from another hacky feature of MDX: multi-select through query-scope calculated member using Aggregate function. The formula engine rewrites the incoming multi-select query into an equivalent query by putting the set argument extracted from the Aggregate function into the where clause.

As it turned out, the same strategy can be used as a generic solution for evaluating the Aggregate function. In 2008R2, this is used for non-additive physical measures.
Construct a new subspace S1 by adding «Set» as a slice to S0
If «Numeric Expression» is present
Set V = Calculate the value of «Numeric Expression» in S1
Set V = Calculate the value property of S1

If you are familiar with DAX, this strategy is similar to the Calculate function in DAX: it transforms the current subspace by applying the set argument as a filter and then evaluates the «Numeric Expression» in the new subspace.
As you can see from today’s discussion, current implementation of the Aggregate function in the face of a calculated measure has limitations and inconsistencies. It may cause confusions among users who saw that basic scenarios worked and then ventured into more advanced usages. Therefore we are considering to give users an option to switch to the set-in-the-where-clause plan where currently the switching-solve-order plan is used.