MDX and DAX topics
Sunday, September 27, 2015
Use Calculated Table to Figure out Monthly Subscriber Churn
With the recent GA of Power BI and the release of Excel 2016, Microsoft has unleashed to the public years of enhancements to its BI products. The rapid development enabled by the cloud service model means detailed documentation can be lagging behind product advancement. The freemium pricing strategy allows millions of business users easy access to Power BI products and services. I plan to resume blogging for at least the next 12 months to help spread the knowledge of the rich new capabilities of the data engine and the DAX programming language that serve as the foundation of all the exciting new features popping up rapidly in all Microsoft BI products. I have moved to a better blogging site http://pbidax.wordpress.com and here is my first post there Use Calculated Table to Figure out Monthly Subscriber Churn.
Saturday, June 15, 2013
The Currency Data Type and VertiPaq Caching
It’s been a while since I published my last blog post. That’s
not for lack of topics to write, but rather for lack of spare time. I switched
from a developer role to a development lead role a year and a half ago, while
life has always been busy at Microsoft, the extra responsibilities in the new
role had put too much on my plate. Moreover there have been a lot of changes
here at Microsoft to get the company transition into a “devices and services”
company, the flurry of initiatives and refocuses has kept many people extra busy
in the past year.
SELECT [m] ON 0,
[Date].[Calendar Year].[Calendar Year].MEMBERS ON 1
FROM [Model]
I am writing today’s post since recently there was a long discussion
among some MVPs regarding a query performance concern in the Tabular database.
They have discovered that when a VertiPaq query performs certain arithmetic
operations on a column of Currency
data type, the calculation is not done directly inside the VertiPaq calculation
engine but rather done through a callback to the DAX calculation engine. For
example, if you issue the following MDX query against AdventureWorks
Tabular Model SQL Server 2012 and capture the VertiPaq SE Query Begin/End events inside SQL Server Profiler, you
would see CallbackDataID in the trace event to indicate a callback from the
VertiPaq calculation engine to the DAX calculation engine.
WITH MEASURE 'Internet Sales'[m] =
SUMX('Internet Sales',
[Unit Price] * [Unit Price Discount Pct])SELECT [m] ON 0,
[Date].[Calendar Year].[Calendar Year].MEMBERS ON 1
FROM [Model]
Some MVPs went on to investigate which combinations of data
types and operations would fall into this category. To save people a lot of
research time and unnecessary guesswork, I’ll list the cases right here.
First let me give you some background information on why
MVPs are interested in this behavior. The VertiPaq calculation engine has a set
of built-in operations it can perform in a very efficient fashion. If an
operation is not natively supported, it calls back to the DAX calculation
engine to perform the operation on its behalf. A callback operation is not as
efficient as a native operation, but more importantly, a VertiPaq query
containing callback operations is not added to the VertiPaq cache. If you run
the above query again, you will not see the VertiPaq SE Query Cache Match event for the VertiPaq query going to
the ‘Internet Sales’ table.
The following table lists cases when VertiPaq calculation
engine calls back to DAX calculation engine to perform arithmetic operations.
Arithmetic Operator
|
Left Operand Data Type
|
Right Operand Data Type
|
+
|
Currency
|
Non-Currency
|
+
|
Non-Currency
|
Currency
|
-
|
Currency
|
Non-Currency
|
-
|
Non-Currency
|
Currency
|
*
|
Currency
|
Currency
|
*
|
Currency
|
Real
|
*
|
Real
|
Currency
|
*
|
Currency
|
Date/Time
|
*
|
Date
|
Real
|
/
|
Non-Currency
|
Currency
|
/
|
Currency
|
Real
|
As you can see, all but one row in the table has at least
one operand of Currency data type and
you may wonder why Currency data type
is out of favor with the VertiPaq calculation engine. As it turns out, VertiPaq
engine stores Currency data as Integers, e.g. $1234.5678 is stored as
12345678. Now you may wonder why Integer
* Real is not on the list but Currency * Real is. The answer lies in DAX rules governing the data types of
the results of arithmetic operations. In DAX, the result of Integer * Real is Real, but the
result of Currency * Real is Currency. So when VertiPaq calculates $12.3456 * 7.8, it does 123456
* 7.8 = 962956.8 and then it has to round the result to 962957 that represents
$96.2957. Currently, VertiPaq calculation engine only supports a subset of
conversions between basic data types and delegates the rest of the type
conversions to DAX calculation engine. The following table lists the basic DAX
data type conversions supported natively in VertiPaq calculation engine.
From
|
To
|
Boolean
|
Integer
|
Boolean
|
Real
|
Integer
|
Real
|
Date/Time
|
Real
|
Boolean
|
Date/Time
|
Integer
|
Date/Time
|
Real
|
Date/Time
|
Integer
|
Boolean
|
Currency
|
Boolean
|
Date/Time
|
Boolean
|
Real
|
Boolean
|
Before we conclude today’s post, I want to add that this is
the behavior of SQL Server 2012 SP1. The Analysis Services team is busy adding
new capabilities to the product. Many of the limitations, such as VertiPaq
queries with callbacks missing VertiPaq cache, is likely to be lifted in a
future release.
Sunday, March 4, 2012
DAX Query Plan, Part 3, Vertipaq Operators
The Vertipaq operators are an important subset of leaf-level
operators which are responsible for preparing and sending queries to the
Vertipaq Engine for execution and receiving query results. Table 1 is a list of
all Vertipaq operators. You can see them in DAX Query Plan trace events by running
the sample queries in column 3 against the tabular
AdventureWorks database.
sumx(
filter(
'Internet Sales',
Related(Product[List Price]) < 10
&& Related(Customer[Yearly Income]) < 50000
),
[Sales Amount]
)
)
evaluate
calculatetable(
addcolumns(
crossjoin(values('Date'[Month]), distinct('Product Category'[Product Category Name])),
"YTD",
calculate([Total Sales Amount],
filter(
All('Date'[Month]),
'Date'[Month]
<=
earlier('Date'[Month])
)
)
),
'Date'[Calendar Year] = 2003
)
addcolumns(
'Date',
"x",
sumx(
filter('Internet Sales', [Order Date] <= [Date]),
[Sales Amount]
)
)
addcolumns(
'Date',
"x",
calculate(
sum('Internet Sales'[Sales Amount]),
'Internet Sales'[Order Date] <= earlier([Date]),
all('Date')
)
)
Table 1. Vertipaq
Operators
Logical
Operators
|
Description
|
Example
|
Scan_Vertipaq
|
This RelLogOp is the foundation of all other Vertipaq
logical operators. It represents a basic Vertipaq query that joins a root
table with related tables following many-to-one relationships, determines
which rows are retrieved by testing against Vertiscan predicates, and groups the
resultset by output columns.
|
evaluate ‘Product’
|
GroupBy_Vertipaq
|
This RelLogOp renames columns and adds rollup columns to a
Vertipaq query.
|
evaluate summarize(‘Product’, rollup(‘Product Category’[Product
Category Name], ‘Product’[Product Name]))
|
Filter_Vertipaq
|
This RelLogOp adds a Verticalc predicate to a Vertipaq
query.
|
evaluate filter('Product', right([Product Name], 4) =
"Tire")
|
Sum_Vertipaq
|
This ScaLogOp adds a SUM aggregation to a Vertipaq query.
|
evaluate row("x", sum('Internet Sales'[Sales
Amount]))
|
Min_Vertipaq
|
This ScaLogOp adds a MIN aggregation to a Vertipaq query.
|
evaluate row("x", min('Internet Sales'[Sales
Amount]))
|
Max_Vertipaq
|
This ScaLogOp adds a MAX aggregation to a Vertipaq query.
|
evaluate row("x", max('Internet Sales'[Sales
Amount]))
|
Count_Vertipaq
|
This ScaLogOp adds a COUNT aggregation to a Vertipaq query.
|
evaluate row("x", countrows('Internet Sales'))
|
DistinctCount_Vertipaq
|
This ScaLogOp adds a DISTINCTCOUNT aggregation to a
Vertipaq query.
|
evaluate row("x", distinctcount('Internet Sales'[Due
Date]))
|
Average_Vertipaq
|
This ScaLogOp adds an AVERAGE aggregation to a Vertipaq
query.
|
evaluate row("x", average('Internet Sales'[Sales
Amount]))
|
Stdev.S_Vertipaq
|
This ScaLogOp adds a STDEV.S aggregation to a Vertipaq
query.
|
evaluate row("x", stdev.s('Internet Sales'[Sales
Amount]))
|
Stdev.P_Vertipaq
|
This ScaLogOp adds a STDEV.P aggregation to a Vertipaq
query.
|
evaluate row("x", stdev.p('Internet Sales'[Sales
Amount]))
|
Var.S_Vertipaq
|
This ScaLogOp adds a VAR.S aggregation to a Vertipaq query.
|
evaluate row("x", var.s('Internet Sales'[Sales
Amount]))
|
Var.P_Vertipaq
|
This ScaLogOp adds a VAR.P aggregation to a Vertipaq query.
|
evaluate row("x", var.p('Internet Sales'[Sales
Amount]))
|
Physical
Operators
|
|
|
VertipaqResult
|
This IterPhyOp iterates over the resultset returned by a Vertipaq
query.
|
You can find this operator in the physical plan tree after
running any of the above queries.
|
The Vertipaq Engine runs the fastest when it executes a query
in pure Vertiscan mode, which is when the query only contains simple
aggregations and simple predicates. A simple aggregation aggregates a single
column or counts rows of the table. A simple predicate, called Vertiscan
predicate in Table 1, typically looks like [Column] = 5 or [Column] IN { 5, 6,
7, … } or ([Column1], [Column2]) IN { (5, 6), (7, 8), … } where the numbers are
data IDs. Note that all column values are encoded as integer data IDs inside
the Vertipaq Engine. The Vertipaq Engine extends its basic Vertiscan capabilities
through its own calculation engine, called Verticalc. When aggregation
expressions or predicate expressions become more complex, the Vertipaq Engine builds
calculation trees to evaluate those expressions and can even call back to the DAX
Formula Engine for help on those operators it doesn’t support natively. For
example, when the aggregation expression is maxx('Internet Sales', year([Due
Date])), the Vertipaq Engine calls back to the DAX Formula Engine to evaluate
the Year function. Similarly, complex
predicate expressions are also evaluated using the Verticalc technology
therefore are called Verticalc predicates in Table 1. The Vertipaq Engine typically
runs slower when it has to invoke Verticalc evaluations during query execution
therefore the DAX Formula Engine converts filter predicates into Vertiscan slices
whenever possible.
Ever since the MDX days, users have been complaining that the
Formula Engine is single-threaded when it executes a single query. The problem
is partially solved when the DAX Formula Engine pushes calculations down to the
Vertipaq Engine. Not only is the Vertipaq Engine multi-threaded, it is also
able to greatly reduce the number of calculations by taking advantage of large
blocks of continuous rows with identical values for those columns referenced by the calculation. Therefore, the DAX Formula Engine
tries very hard to push operations down to the Vertipaq Engine whenever
possible and it is highly desirable to see mostly Vertipaq operators in DAX
Query Plan events.
Caveats of
Understanding Vertipaq Operators in DAX Query Plan Trees
As a brand new feature in Denali, the DAX Query Plan has
plenty of room for future improvements. Here I am going to highlight three
aspects of the current implementation which might confuse users who are trying
to read DAX Query Plans for the first time.
1. The display
of Vertipaq operators is too closely tied to the underlying implementation, as
a result, users might see multiple nested Vertipaq operators in a plan tree but
only a single Vertipaq query is issued. For example, if you run Query 1, you
will see four Vertipaq operators, Sum_Vertipaq,
Filter_Vertipaq, Filter_Vertipaq, Scan_Vertipaq,
in the DAX Query Plan/DAX Vertipaq
Logical Plan event, as shown in Figure 1, but only a single Vertipaq SE Query Begin/Vertipaq Scan
event. As you can see in Figure 1, a Scan_Vertipaq
operator serves as the foundation for all other Vertipaq operators, each aggregation
or Verticalc filter has its own operator, and the cascading operators are
displayed in a chain of parent-child relationships even though aggregations and
Verticalc filters are eventually folded into a single Vertipaq query.
// Query 1
evaluate
row("x",sumx(
filter(
'Internet Sales',
Related(Product[List Price]) < 10
&& Related(Customer[Yearly Income]) < 50000
),
[Sales Amount]
)
)
2. As
mentioned in the
second installment of the DAX Query Plan blog series, spools are not
first-class citizens in Denali DAX Query Plans, but users can still detect the
presence of spools in Denali DAX Query Plan trees indirectly through the
presence of physical operators which sit directly on top of spools such as the Spool LookupPhyOp or the Spool_IterOnly, Spool_LookupOnly, Spool_SliceIndex
IterPhyOps. Let’s call them spool operators in this post to make it easier to
refer to them. If you run Query 2 and
examine the physical plan tree shown in Figure 2, you can see a Spool_IterOnly operator with a child VertipaqResult operator. Since the real
child operator of Spool_IterOnly is a
spool, what’s with the VertipaqResult
IterPhyOp? As it turns out, some iterators supply the rows needed to fill a
spool when it is materialized and DAX Query Plan shows iterator subtrees which are
used to populate the spool as child operators of the spool operator. In Denali,
a spool is always constructed to receive the resultset of a Vertipaq query,
hence VertipaqResult operator is always
a child of a spool operator. An important property of a spool operator is #Records, highlighted in Figure 2, which
tells you how many rows of data are in the underlying spool and is currently
the most important property to help identify performance problems of a query. When
VertipaqResult is the sourcing
iterator, this property tells you how many records are returned by the Vertipaq
Engine.
// Query 2
evaluate
values(Product[List Price])
3. As stated
earlier, Verticalc evaluations may call back to the DAX Formula Engine for
unsupported logical operators which in turn would construct the corresponding
physical operator trees just as it would when the calculation happens entirely
in the Formula Engine. Sometimes the callback functions can be very expensive
themselves but those physical operators are not shown in the plan tree. If you
run Query 3 you will see in the Vertipaq
SE Query Begin event, shown in Figure 3, that the Vertipaq Engine calls
back to the DAX Formula Engine for its help with the Year function, but the corresponding physical operator does not
show up in the DAX Query Plan/DAX
Vertipaq Physical Plan event, shown in Figure 4.
// Query 3
evaluate row("x", maxx('Internet Sales', year([Due
Date])))
Special
Properties of Scan_Vertipaq RelLogOp
As you have seen in the previous examples, the Scan_Vertipaq
operator is at the core of every Vertipaq query and has several special
properties worth mentioning here. To show you all the properties of Scan_Vertipaq, run Query 4 and then look
at its logical plan.
// Query 4
define
measure 'Internet Sales'[Total Sales Amount] = Sum([Sales Amount])evaluate
calculatetable(
addcolumns(
crossjoin(values('Date'[Month]), distinct('Product Category'[Product Category Name])),
"YTD",
calculate([Total Sales Amount],
filter(
All('Date'[Month]),
'Date'[Month]
<=
earlier('Date'[Month])
)
)
),
'Date'[Calendar Year] = 2003
)
Below is
the longest line I extracted from the logical plan with operator specific
properties in bold face.
Scan_Vertipaq: RelLogOp
DependOnCols(1, 2)('Date'[Month], 'Product Category'[Product Category Name])
4-141 RequiredCols(1, 2, 133)('Date'[Month], 'Product Category'[Product
Category Name], 'Internet Sales'[Sales Amount]) Table='Internet Sales_78de3956-70d9-429f-9857-c407f7902f1e' -BlankRow
JoinCols(2)('Product Category'[Product Category Name]) SemijoinCols(3)('Date'[Month])
As you can see, in addition to the common properties DependOnCols, range of column numbers,
and RequiredCols, Scan_Vertipaq also has a couple of extra
properties:
·
Table
Displays the internal ID of the
root table in Denali. I think this should change to user friendly table name in
the future.
·
Whether the blank row is requested
DAX introduced an optional
blank row to an IMBI table in order to deal with any incoming referential
integrity violations. When users want to include this blank row in the
resultset of a Vertipaq query, query plan shows +BlankRow; when users don’t
want to include the blank row in the resultset, query plan displays –BlankRow.
In Query 4 I deliberately used both the Values
function and the Distinct function
inside the CrossJoin function to
demonstrate the difference, see Figure 5 which was an excerpt from the logical
plan.
·
JoinCols
The columns from this table which
are needed for natural join with other tables. JoinCols are a subset of DependOnCols,
the latter is actually a union of the former and the DependOnCols of all semi-join filter operators.
·
SemijoinCols
The columns from this table which
are needed for natural semi-join with filter tables. The DAX Formula Engine
simplifies the logical operator tree by converting most explicit semi-join filters
to Vertiscan slices or Verticalc slices so that the filtering operations happen
inside the Vertipaq Engine, and since DAX
Query Plan/DAX Vertipaq Logical Plan events are fired after the
simplification stage, users typically don’t see the SemijoinCols property. But when SemijoinCols
do show up in the plan, the DAX Formula Engine may have to fetch more columns
back, join with the filter tables, and then remove unwanted SemijoinCols by grouping on the
desired output columns. When this happens, the Vertipaq operator can be quite expensive
as a lot of post-processing happens in the Formula Engine before the final
resultset can be returned.
Understand
Performance Differences between Equivalent Queries
To conclude today’s post, let’s run two queries which are
written in different ways but return the same results and use DAX Query Plans
to figure out why they perform differently. Both Query 5 and Query 6 use the AddColumns function to simulate adding a
calculated column to the ‘Date’ table. The calculated column calculates the sum
of [Sales Amount] after filters the ‘Internet Sales’ table based on a predicate that depends on a column value from the current
row. While it is natural to write the calculation as sum of filter as done in
Query 5, to get much better performance, you should split the sum and the
filter into separate functions and then calculate the sum by adding the filter to
the filter context, see Query 6. Compare their logical plans, shown in Figure 6
and Figure 7 respectively, you can see that Query 5 executes both the SumX function and the Filter function in the Formula Engine
but Query 6 only executes the Filter function
in the Formula Engine but pushes the SumX
down to the Vertipaq Engine through the Sum_Vertipaq
operator. Since today we have learned that it is always good if calculations can
be pushed down to the Vertipaq Engine, Query 6 runs a lot faster than Query 5. So write your DAX expressions to take
advantage of Vertipaq operators.
// Query 5
evaluate addcolumns(
'Date',
"x",
sumx(
filter('Internet Sales', [Order Date] <= [Date]),
[Sales Amount]
)
)
// Query 6
evaluate addcolumns(
'Date',
"x",
calculate(
sum('Internet Sales'[Sales Amount]),
'Internet Sales'[Order Date] <= earlier([Date]),
all('Date')
)
)
Subscribe to:
Posts (Atom)