Sampling Problem with Filtered Statistics

October 12, 2009 · Posted in SQL Server 

I was writing a blog post to show how to solve the problem of bad estimations due to uneven data distribution with filtered statistics. After an hour or so of banging my head against the wall, I discovered that I ran into a bug feature. It appears that if the predicate on the filtered stats is very selective then a non-null histogram will not be generated unless they are created\updated with full scan. This is true for automatically generated stats for filtered indexes as well but they are created with full scan when the index is created.

Let’s look at demo data created with this code:


CREATE TABLE filtered_stats
(
c1 INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
c2 DATETIME DEFAULT GETDATE(),
c3 CHAR(500),
c4 INT
)

--lets at 1 million rows with c4=1
--lets add 1 row with c4 = 2
--Lets add a million more with c4 =3
--Finally, lets add 50 more with c4 =3 so the estimates are skewed.
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
   INSERT INTO
filtered_stats(c3, c4)
  
VALUES
  
('blah', 1)
  
SELECT @ctr = @ctr-1
END
GO
INSERT INTO filtered_stats(c3, c4)
  
VALUES
  
('blah', 2)
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
   INSERT INTO
filtered_stats(c3, c4)
  
VALUES
  
('blah', 3)
  
SELECT @ctr = @ctr-1
END
GO
DECLARE @ctr INT = 50
WHILE @ctr > 0
BEGIN
   INSERT INTO
filtered_stats(c3, c4)
  
VALUES
  
('blah', 2)
  
SELECT @ctr = @ctr-1
END

So now we have 1 million rows with c4 = 1 and 3. We have 51 rows with c4=2.

If we create this filtered stats, we have a null histogram: CREATE statistics fstats2 on filtered_stats(c4) where c4 =2

image

Updating or creating with fullscan give use the following data from dbcc show_statistics.

image

Bug or Feature? This is what BOL says:

“When only a few values are returned, however, sampling might not provide accurate statistics. For highly selective predicates, which return relatively few values, using the WITH FULLSCAN option will improve accuracy because it will scan all of the values instead of sampling.”

So they “recommend” that you use full scan on highly selective stats. However, I think that is a bug since we do not even get unfiltered rows or rows sampled back. Not to mention that inaccurate stats means to me that numbers are returned but they are most likely less than what is actually there. I will submit it as a bug unless one of you have thoughts otherwise.

So in a nut shell, ALWAYS update filtered stats with full scan or they may not actually be doing you any good. There are also some other considerations in regards to filtered stats that Kimberly Tripp has recently written about. There is some good conversion about it as well. Conor also posted about it here.

I will have more on filtered stats coming up in the next post.

This content is published under the Attribution-Share Alike 3.0 Unported license.

Comments

blog comments powered by Disqus