Tuesday, November 15, 2011

Avoid Pitfalls of Fact Data Prefetching

As mentioned in Chris Webb’s article Query Performance Tuning in Microsoft Analysis Services and SQL Server CAT team’s white paper SQLServer Best Practices Article: Identifying and Resolving MDX Query PerformanceBottlenecks in SQL Server 2005 Analysis Services, MDX Formula Engine (FE) may request more data from the Storage Engine (SE) upfront so that following SE queries can be answered directly from the SE cache. In order to retrieve the right amount of data, FE uses some clever heuristics to construct the prefetch query. While the underlying design delivers great performance most of the time, sometimes the resulting query prefetches too much data. In this blog post I am going to show you a pathological case where a prefetch query seems to completely ignore the slices in the original MDX query thereby scan all partitions instead of just the needed ones. I will explain why it happens so you understand what’s going on in the engine if you ever run into such cases and deal with it accordingly.

To illustrate the problem, I am going to run a series of MDX queries against the AdventureWorks sample database for Microsoft SQL Server 2008R2. In between queries, I send ClearCache commands to eliminate inter-query interference. For each MDX query, we are going to observe the sequence of SE queries through the Query Subcube Verbose profiler trace events. Instead of showing the complete output of Query Subcube Verbose events, I am going to extract only those attributes in the [Date] dimension which are relevant to our discussion. As well known in the MDX community, Analysis Services engine internally uses subcubes to represent queries, therefore I will use the terms query and subcube interchangeably throughout the post.
Now, let’s run our first MDX query to retrieve the values of measure [Internet Sales Amount] for all days in January 2007.
Query 1:
// January 2007
select [Internet Sales Amount] on 0,
head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 31) on 1
from [Adventure Works]

Since measure [Internet Sales Amount] is defined as a measure expression, there are three Query Subcube Verbose events, one for the [Exchange Rates] measure group, two for the [Internet Sales] measure group, although only one of them hits the partitions, the other is just an artifact of measure expression evaluation.

Here are the attributes corresponding to levels in the [Calendar] hierarchy extracted from the trace events:
Dimension 7 [Date] (0 * 8 0 6 0 0 0 0 0 0 20 4 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:[January 2007]
[Calendar Year]:[CY 2007]

The SE query asks for all days in January 2007, which match exactly the days listed on Axis 1 of the MDX query, with the added benefit of not listing each day individually. There is no prefetch query in this case which makes sense as the original query is asking for all days in January 2007 that matches exactly the SE query.
Now let’s add one more day to the query.
Query 2:

// January 2007 + February 1 2007
select [Internet Sales Amount] on 0,
head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 32) on 1
from [Adventure Works]

From the captured profiler trace below

  
You can extract the following information form the three queries sent to the [Internet Sales] measure group:

Non-cache data
Dimension 7 [Date] (0 * 8 0 6 0 0 0 0 0 0 20 4 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:[January 2007]
[Calendar Year]:[CY 2007]

Non-cache data
Dimension 7 [Date] (0 * 8 0 6 0 0 0 0 0 0 21 4 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:[February 2007]
[Calendar Year]:[CY 2007]

Cache data
Dimension 7 [Date] (0 582 8 0 6 0 0 0 0 0 0 21 4 0 0 0 0 0 0 0 0)
[Date]:[February 1, 2007]
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:[February 2007]
[Calendar Year]:[CY 2007]

Again there is no prefetch query for January 2007 but FE does prefetch all days in February 2007 first before getting the value of February 1, 2007 from cached data.
Now let’s add one more day again to the query.
Query 3:
// January 2007 + February 1 2007 + February 2 2007
select [Internet Sales Amount] on 0,
head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 33) on 1
from [Adventure Works]

Below is the profiler trace events and relevant information extracted from them:

Non-cache data:
Dimension 7 [Date] (0 * * 0 * 0 0 0 0 0 0 * * 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:*
[Calendar Semester]:*
[Month Name]:*
[Calendar Year]:*

Cache data:
Dimension 7 [Date] (0 * 8 0 6 0 0 0 0 0 0 20 4 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:[January 2007]
[Calendar Year]:[CY 2007]

Cache data:
Dimension 7 [Date] (0 + * 0 * 0 0 0 0 0 0 * * 0 0 0 0 0 0 0 0)
[Date]:+
[Calendar Quarter]:*
[Calendar Semester]:*
[Month Name]:*
[Calendar Year]:*

One EventSubclass  = Non-Cache data query prefetches fact data into SE cache which is used to answer the subsequent two SE queries shown as EventSubclass  = Cache data, one for all days in January 2007, the other for the first two days of February 2007. But why did the prefetch query ask for data for all dates? Why did we lose all slices in the [Date] dimension? This is really bad as it would have caused all partitions to be scanned. The answer lies in the imprecise nature of the algorithm used to construct the prefetch subcube.

In order to build the prefetch SE subcube, FE keeps track of all relevant MDX sets in a stack-like data structure called Sonar registry. Later on each MDX set in the Sonar registry is used to build a collection of Sonar subcubes. Perfmon counter MDX\Total Sonar subcubes displays the total number of Sonar subcubes generated inside FE.


By default, the conversion from an MDX set to its collection of Sonar subcubes is not a precise process. The resulting Sonar subcubes will cover the original MDX set but may include additional members on some attributes. The Cache Ratio connection string property controls how a Sonar subcube attribute is expanded to include more members than what the original set contains. The actual algorithm is very complex. I am only going to show here a greatly simplified version that illustrates the main idea of the algorithm.

Algorithm 1: Build Sonar subcubes from an MDX set
Input: an MDX set S, for simplicity, assume S is single-grained

For each hierarchy H in S
Since S is single-grained, let Ls be the level of H where members of S reside
For each level L in H starting from the topmost level down to Ls
Identify all members on L with good coverage of S
A member M has good coverage of S if the ratio of the number of its descendants in S to the number of all its descendants on level Ls is greater than or equal to the Cache Ratio
Add M to the list of slicers on level L
For each level with a nonempty list of slicers
Create a Sonar subcube with granularities on attributes corresponding to the levels of H from the topmost level down to Ls
Apply the list of slicers to the subcube

Output: the collection of generated Sonar subcubes

First, apply Algorithm 1 to the [Calendar] set in Query 1

head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 31)

you get the following sequence of slices encoded in internal data IDs and eventually FE finds a single good cover slice on the month level.

Sonar [Calendar] [1 * * * * *] ratio: 31/1188
  Sonar [Calendar] [1 4 * * * *] ratio: 31/365
   Sonar [Calendar] [1 4 6 * * *] ratio: 31/181
    Sonar [Calendar] [1 4 6 8 * *] ratio: 31/90
     Sonar [Calendar] [1 4 6 8 20 *] ratio: 31/31 ß good cover

The resulting Sonar subcube is

Sonar Subcube 1
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
2007
H1 CY 2007
Q1 CY 2007
January 2007
*

Here and in the following I only show the important attributes in the Sonar subcube which correspond to levels of the [Calendar] hierarchy. Next, apply Algorithm 1 to the [Calendar] set in Query 2

head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 32)

FE finds two good cover slices, one on the month level and the other on the date level.

Sonar [Calendar] [1 * * * * *] ratio: 32/1188
  Sonar [Calendar] [1 4 * * * *] ratio: 32/365
   Sonar [Calendar] [1 4 6 * * *] ratio: 32/181
    Sonar [Calendar] [1 4 6 8 * *] ratio: 32/90
     Sonar [Calendar] [1 4 6 8 20 *] ratio: 31/31 ß good cover

     Sonar [Calendar] [1 4 6 8 21 *] ratio: 1/28
      Sonar [Calendar] [1 4 6 8 21 582] ratio: 1/1 ß good cover

The two Sonar subcubes created are:

Sonar Subcube 2.1
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
2007
H1 CY 2007
Q1 CY 2007
January 2007
*

Sonar Subcube 2.2
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
2007
H1 CY 2007
Q1 CY 2007
February 2007
February 1, 2007


Finally, apply Algorithm 1 to the [Calendar] set in Query 3

head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 33)

FE finds three good cover slices, one on the month level and two on the date level.

Sonar [Calendar] [1 * * * * *] ratio: 33/1188
  Sonar [Calendar] [1 4 * * * *] ratio: 33/365
   Sonar [Calendar] [1 4 6 * * *] ratio: 33/181
    Sonar [Calendar] [1 4 6 8 * *] ratio: 33/90
     Sonar [Calendar] [1 4 6 8 20 *] ratio: 31/31 ß good cover

     Sonar [Calendar] [1 4 6 8 21 *] ratio: 2/28
      Sonar [Calendar] [1 4 6 8 21 582] ratio: 1/1 ß good cover

      Sonar [Calendar] [1 4 6 8 21 583] ratio: 1/1 ß good cover

Again, FE created two Sonar subcubes.

Sonar Subcube 3.1
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
2007
H1 CY 2007
Q1 CY 2007
January 2007
*

Sonar Subcube 3.2
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
*
*
*
*
+

Note that Sonar Subcube 2.2 retains slices on all levels of the [Calendar] hierarchy while Sonar Subcube 3.2 only has a slice on the lowest level of the [Calendar] hierarchy. This is an artifact of the current implementation. When multiple covering members of a hierarchy are added to a Sonar subcube, only a single slice is created at the lowest level of the hierarchy.

FE uses Sonar subcubes in several different places, including constructing a larger evaluation node when FE is executing in cell-by-cell mode. In the simple case where an MDX query can be answered directly from SE, Sonar subcubes are used to construct SE queries.

Algorithm 2: Evaluate cell values
Input: a cell iterator and the current Sonar registry
For each cell C
Find the Sonar subcube S that contains C
Build an evaluation node E for S
Execute E. In the simplest case, create SE subcube SSE from S and send SSE to SE.
Lookup value of C from cached result in E
Output: cell values

Another usage of Sonar subcubes is to build a prefetch subcube from a given SE query subcube. The full algorithm is again very complex with exceptions and special handling of arbitrary shapes or unary operators, etc. Here I include a simplified version to illustrate the essence of the algorithm:

Algorithm 3: Build a prefetch subcube for a query subcube
Input: a SE query subcube Squery and the current Sonar registry
Initialize the prefetch subcube Sprefetch as a clone of Squery,
For each Sonar subcube Ssonar in the Sonar registry
For each attribute Aprefetch in Sprefetch
Find the corresponding attribute Asonar in Ssonar
Skip Asonar if it is not a granularity attribute
If Aprefetch is a granularity attribute
If the slice of Asonar is a superset of the slice of Aprefetch
Copy the slice of Asonar to Aprefetch
Else
If the slice of Asonar includes all members
Make Aprefetch a granularity attribute
Output: Sprefetch if it is different from Squery

In Query 1, Sonar Subcube 1 is equivalent to the SE query subcube therefore no prefetch subcube is needed. In Query 2, the first SE query subcube matches Sonar Subcube 2.1 so again no prefetch subcube is needed. But the second SE query would copy the * from [Date] in Sonar Subcube 2.1 therefore produce a prefetch subcube that fetches all days in February. The * in Sonar Subcube 2.1 really means all days in January 2007 but Algorithm 3 treats each attribute independently and is oblivious to the fact that [Date]:* is cross-filtered by slices on other attributes. But the end result of prefetching all days in February 2007 isn’t too bad.

Prefetch Subcube 1
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
2007
H1 CY 2007
Q1 CY 2007
February 2007
*

But when we come to Query 3, there is a * on each attribute from either Sonar Subcube 3.1 or Sonar Subcube 3.2, we end up with a prefetch subcube that has * on all its attributes.

Prefetch Subcube 2
Calendar Year
Calendar Semester
Calendar Quarter
Month Name
Date
*
*
*
*
*

Nowadays we are seeing more and more implementation of systems with multiple terabytes of data in Analysis Services. The above example of overly aggressive prefetching of fact data is unacceptable in such environments. You can disable prefetching altogether by setting connection string property Disable Prefetch Facts=true or you can play with the Cache Ratio connection string property to influence the Sonar subcubes generated by FE. But either approach is likely to have negative performance impact that would require a lot of testing to confirm their overall benefit against a certain query set. If you can control the types of queries in your deployment, it is better to replace queries that fall into this pitfall with ones that produce good Sonar subcubes in the default settings. Using our example, since we know that there are 90 days in the first quarter and the default Cache Ratio is 0.5, we can simply ask for 45 days instead of 33 days to produce a single good cover at quarter level.

Query 4:

// January 2007 + February 1-14 2007
select [Internet Sales Amount] on 0,
head(descendants([Date].[Calendar].[Calendar Year].&[2007], [Date].[Calendar].[Date]), 45) on 1
from [Adventure Works]

 

The captured profiler trace confirmed our conjecture as it shows a single SE query to the [Internet Sales] measure group to retrieve all days in the first quarter of 2007.

Dimension 7 [Date] (0 * 8 0 6 0 0 0 0 0 0 * 4 0 0 0 0 0 0 0 0)
[Date]:*
[Calendar Quarter]:[Q1 CY 2007]
[Calendar Semester]:[H1 CY 2007]
[Month Name]:*
[Calendar Year]:[CY 2007]

1 comment:

  1. Brilliant! Simply Mind-blowing. Very very helpful article

    ReplyDelete