Filtered Stats to Counter Data Skew Issues
There are a handful of things that can make SQL Server choose a “bad plan”. They usually revolve around variable sized result sets. Some problematic scenarios are parameter sniffing, data skew, local variables, table variables, UDF’s or complex calculations in the where clause. All of these and more are detailed in my favorite and recently updated white paper.
We are going to be talking about data skew in this post. Picture a table where you have 1 billion rows where IsShipped = 1 but 50k rows where IsShipped=NULL. Now, let’s say you have a query that returns orders that are not shipped(NULL). During compilation, the optimizer asks the stats how many rows are IsShipped=NULL. If the stats sampling touched, few, if any of the pages, where IsShipped = NULL, then SQL Server will guess that only 1 row will be returned. This may surpass the key lookup threshold(Part 1, Part 2) and a scan would not be optimal but better than a key lookup. Get it? We will look at an example soon if not.
In the past, your options have been limited. Update stats with fullscan was usually the best option unless the table was just too large and volatile. If it was, you probably had to use index hints or OPTIMIZE FOR the lesser of the two evils. You have covering indexes and is probably a good solution for the example above unless the result set has to be very wide.
Filtered statistics are a significant improvement and can be used nicely when dealing with this problem. They are smaller so cheaper to maintain. Their scope can be very narrow so they are more accurate.
There are a couple of caveats. I wrote about the need to use full scan on very selective stats here and Kimberly Tripp wrote about the fact that they will not auto update until the entire table crosses the threshold.
Let’s look at an example.
CREATE TABLE filtered_stats
(
c1 INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
c2 datetime DEFAULT GETDATE(),
c3 CHAR(500),
c4 INT
)
--lets add 1 million rows with c4=1
--lets add 1 row with c4 = 2
--Lets add a million more with c4 =3
--Finally, lets add 50 more with c4 =2 so the estimates are skewed.
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 1)
SELECT @ctr = @ctr-1
END
GO
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 2)
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 3)
SELECT @ctr = @ctr-1
END
GO
DECLARE @ctr INT = 50
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 2)
SELECT @ctr = @ctr-1
END
--Let's add a clustered index
CREATE CLUSTERED INDEX ix1 ON filtered_stats(c2)
--This index will be used for the key lookups
CREATE INDEX ix2 ON filtered_stats(c4)
--The estimated rows returned are 1 but the actual rows are 51.
--This is a bad estimate due to uneven data distribution.
SELECT c1, c2
FROM filtered_stats
WHERE c4=2
--Lets create narrow and more accurate filtered stats
--You still have to do full scan but you are touching just a fraction of the data – NOTE: Full scan is needed based on my observations here.
CREATE STATISTICS fstats ON filtered_stats(c4) WHERE c4 = 2 WITH fullscan
--This should estimate 51 no matter how many times you update stats since they are more accurate
SELECT c1, c2
FROM filtered_stats
WHERE c4=2
This is not bullet proof but I say it is an improvement. The downside include the fact that filtered stats are manually created. Manual stats updates will need to be done with a full scan. You may also say that a covering index(or filtered covering index) would kill this. That is true but there may be an actual need to return all columns of a table.
A quick wrap up: Put it in the tool box. It will be like that $50 wrench that you use once a year but it saves you 4 hours every time you touch it.
Sampling Problem with Filtered Statistics
I was writing a blog post to show how to solve the problem of bad estimations due to uneven data distribution with filtered statistics. After an hour or so of banging my head against the wall, I discovered that I ran into a bug feature. It appears that if the predicate on the filtered stats is very selective then a non-null histogram will not be generated unless they are created\updated with full scan. This is true for automatically generated stats for filtered indexes as well but they are created with full scan when the index is created.
Let’s look at demo data created with this code:
CREATE TABLE filtered_stats
(
c1 INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
c2 DATETIME DEFAULT GETDATE(),
c3 CHAR(500),
c4 INT
)
--lets at 1 million rows with c4=1
--lets add 1 row with c4 = 2
--Lets add a million more with c4 =3
--Finally, lets add 50 more with c4 =3 so the estimates are skewed.
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 1)
SELECT @ctr = @ctr-1
END
GO
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 2)
DECLARE @ctr INT = 1000000
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 3)
SELECT @ctr = @ctr-1
END
GO
DECLARE @ctr INT = 50
WHILE @ctr > 0
BEGIN
INSERT INTO filtered_stats(c3, c4)
VALUES
('blah', 2)
SELECT @ctr = @ctr-1
END
So now we have 1 million rows with c4 = 1 and 3. We have 51 rows with c4=2.
If we create this filtered stats, we have a null histogram: CREATE statistics fstats2 on filtered_stats(c4) where c4 =2
Updating or creating with fullscan give use the following data from dbcc show_statistics.
Bug or Feature? This is what BOL says:
“When only a few values are returned, however, sampling might not provide accurate statistics. For highly selective predicates, which return relatively few values, using the WITH FULLSCAN option will improve accuracy because it will scan all of the values instead of sampling.”
So they “recommend” that you use full scan on highly selective stats. However, I think that is a bug since we do not even get unfiltered rows or rows sampled back. Not to mention that inaccurate stats means to me that numbers are returned but they are most likely less than what is actually there. I will submit it as a bug unless one of you have thoughts otherwise.
So in a nut shell, ALWAYS update filtered stats with full scan or they may not actually be doing you any good. There are also some other considerations in regards to filtered stats that Kimberly Tripp has recently written about. There is some good conversion about it as well. Conor also posted about it here.
I will have more on filtered stats coming up in the next post.
Resource_semaphore_query_compile
This wait brings back nightmares from SQL Server 2005 x86 pre SP2. However, I have been troubleshooting this wait type for a customer recently so I thought I would write about it. Loosely, it means that new plans cannot be put into the procedure cache fast enough so they wait. There can be, in my experience, be two causes(post SQL Server 2005 sp2) of this:
- Plan reuse issues
- Large memory grants
Plan Reuse Issues
This is the easy one or hard one depending on your perspective. It is easy to identify but hard to fix. In an ideal world, you rewrite all of your adhoc sql as stored procedures following best practices for plan reuse. Easier said than done for large application that did not begin this way. Here is how you diagnose it.
SELECT TEXT ,cp.*
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(plan_handle)
WHERE cacheobjtype = 'Compiled Plan' and usecounts = 1
ORDER BY cp.size_in_bytes DESC
The key columns are the size_in_bytes and usecounts. You want low size_in_bytes and high usecounts. Rows with 1 usecount are not being reused.
Aside from rewriting the application, there are a few setting that you can use to help. In SQL Server 2005+, you can turn on forced parameterization. This will auto-parameterize a lot of queries. It sounds good but it is usually not a silver bullet. SQL Server 2008 introduces the “Optimize for ADHOC Workloads” server setting. This creates small stub plans for plans that will not be reused. This further reduces the footprint of the procedure cache.
Here are some additional resources:
Compilation and recompilation whitepaper
Parameters and Execution Plan Reuse
Queries take a longer time to finish running when the size of the TokenAndPermUserStore cache grows in SQL Server 2005 (See workaround section)
Large Memory Grants
This is because the more memory used for grants, the less that will be available for query compilation. The symptoms are easy to identify. A high number in the perfmon counter memory manager : memory grant outstanding is a dead giveaway. Once you see that, run this query.
SELECT TEXT, query_plan, requested_memory_kb, granted_memory_kb, used_memory_kb
FROM sys.dm_exec_query_memory_grants
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
CROSS APPLY sys.dm_exec_query_plan(plan_handle)
This will give you queries currently waiting on a memory memory grant along with the query text and their xml showplan. Typically, when I see a server in a problem state, this query returns more than 10 rows(sometimes a lot more rows) and the queries are requesting fifty to several hundred megabytes of memory each.
I usually look for a common query that is waiting many times and requesting a large amount of memory. Look at its query plan. You want to find iterators that are large memory consumers like hash joins\aggregates, sorts and merge joins with sorts.
Sometimes the resolution is just recompiling the query or updating stats. If that doesn’t work, you can look for missing indexes. I would keep covering indexes in mind as well as filtered indexes. If the query still does a hash join(and index scan) on the filtered index, it will be cheaper. If there are no good choices for indexes, look for places to tune the query. Specifically, I have found help by moving the filters to the on clause but that might not meet requirements so be careful.
If there are cardinality overestimation problems and the iterator is actually only touching a few rows, you can try LOOP join hint. However, this should be rare or there may be parameter sniffing issues due uneven data distribution. If this is the case, simply recompiling the query should fix it most of the time. If that works, a statement level recompile, plan guides or OPTIMIZE FOR hint.
This is a tough wait state to solve. If it is not bad statistics, then you probably have some application design issues that need to be addressed in the long run. Hopefully this post will help you identify the problem and at least put some short term Band-Aids in place.
Key Lookup Threshold Part 2
Grant Fritchey(@gfritchey) posed this question to me on twitter after my posted on the key lookup threshold post.
@statisticsio Interesting post. Do you think that threshold is dependent on the data involved? Or maybe on the size of the key?8:28 AM Feb 5th from TwitterGadget in reply to statisticsio
I kind of assumed so which is why I did char(1000). However, I have just tested with char(1). The numbers are interesting. Please refer to the original post for repro code.
| IO | IO | |
| Rows | CHAR(1000) | CHAR(1) |
| Scan | 422 | 9 |
| 1 row | 4 | 4 |
| 5 rows | 12 | 12 |
| 10 rows | 22 | 22 |
| 25 rows | 52 | 52 |
| 50 rows | 114 | 102 |
| 100 rows | 217 | 202 |
| 250 Rows | 526 | 502 |
| 500 Rows | 1043 | 1003 |
| 1000 Rows | 4141 | 4007 |
The threshold is crossed much earlier because the scan is smaller. This is a small table especially with CHAR(1) so take this test with a grain of salt. Run your own tests when you are working with large production data.
I would also venture to guess that as the complexity of the query increases the threshold gets lower especially on more complex queries like when grouping especially when grouping and outer joining.
A covering index is probably the best solution in most cases unless you need to touch all rows even then, it might be better,.
Key Lookup Threshold
Gail talks about bookmark lookups…. err.. key lookups in this post. So are they good or bad? Well, like many things in SQL, it depends. The main factor is the number of rows returned. A few rows are fine but the cost rises sharply with larger result sets. There comes a point where the threshold is crossed and a scan is more efficient. This is because a scan leverages sequential IO while a lookup does random IO.
Here are the results of the code at the end of the post.
| Rows | IO | |
| Scan | 422 | |
| 1 row | 4 | |
| 5 rows | 12 | |
| 10 rows | 22 | |
| 25 rows | 52 | |
| 50 rows | 114 | |
| 100 rows | 217 | |
| 250 Rows | 526 | <–Threshold |
| 500 Rows | 1043 | |
| 1000 Rows | 4141 |
As you can see, at 250 rows, we have crossed the threshold and it is cheaper do a scan. If you are passing in a literal, the optimizer can detect this and switch to a scan. If it is a stored proc or parameterized SQL, a plan is cached the first time it is run. Problems happen when the result size greatly varies depending on the parameter. There are ways around this all with their pro’s and con’s. Here are some:
- A covering index.
- A statement level recompile hint
- Plan guides and hints
Here is the sample code that can repro these numbers on SQL Server 2008.
CREATE TABLE #temp
(
id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
c1 CHAR(1000) DEFAULT( 'blah'),
c2 INT
)
INSERT INTO #temp(c2)
VALUES
(1)
GO
INSERT INTO #temp(c2)
VALUES
(5)
GO 5
INSERT INTO #temp(c2)
VALUES
(10)
GO 10
INSERT INTO #temp(c2)
VALUES
(25)
GO 25
INSERT INTO #temp(c2)
VALUES
(50)
GO 50
INSERT INTO #temp(c2)
VALUES
(100)
GO 100
INSERT INTO #temp(c2)
VALUES
(250)
GO 250
INSERT INTO #temp(c2)
VALUES
(500)
GO 500
INSERT INTO #temp(c2)
VALUES
(1000)
GO 1000
INSERT INTO #temp(c2)
VALUES
(1000)
GO 1000
CREATE INDEX ix ON #temp(c2
)
--The baseline
SET STATISTICS io ON
SELECT *
FROM #temp WITH (INDEX=1)
WHERE c2 = 1
--1 row returned
SELECT *
FROM #temp
WHERE c2 = 1
--5 rows returned
SELECT *
FROM #temp
WHERE c2 = 5
--10 rows returned
SELECT *
FROM #temp
WHERE c2 = 10
--25 rows returned
SELECT *
FROM #temp
WHERE c2 = 25
--50 rows returned
SELECT *
FROM #temp
WHERE c2 = 50
--100 rows returned
SELECT *
FROM #temp
WHERE c2 = 100
--250 rows returned
--Must begin using hints because the optimizer can tell that a scan is better
SELECT *
FROM #temp WITH (INDEX=ix, forceseek)
WHERE c2 = 250
--1000 rows returned
--Must begin using hints because the optimizer can tell that a scan is better
SELECT *
FROM #temp WITH (INDEX=ix, forceseek)
WHERE c2 = 1000
Finding Index Scans due to Implicit Conversions
It is fairly widely known that implicit conversions of data types can cause scans. However, you may not be able to control what goes in your database, you may have inherited an app and we just are not perfect.
On top of that some can be tricky. For example, did you know that scope_identity and @@identity both return numeric data types. An AVG of a SMALLINT column returns an INT.
Here is a simple repro which the same query does an index scan instead of an index seek
CREATE TABLE t1
(
id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
, c1 VARCHAR(100)
)
GO
INSERT INTO t1(c1)
SELECT 1
GO 10000
INSERT INTO t1(c1)
SELECT 0
GO 10
CREATE INDEX ix ON t1 (c1)
CREATE PROC p1
@blah NVARCHAR(100)
AS
SELECT id FROM t1
WHERE c1 = @blah
EXEC p1 N'0'
CREATE PROC p2
@blah VARCHAR(100)
AS
SELECT id FROM t1
WHERE c1 = @blah
EXEC p2 '0'
DROP TABLE t1
DROP PROC p1
DROP PROC p2
I found a sample query from Umachandar Jayachandran and made some mod’s to it. It basically scans your procedure cache for index scans containing implicit conversions.
I have posted it at the SQLServerPedia Wiki here. It is the 2nd query listed.
Note: Implicit conversions may not always result in a scan.
Never Index a BIT?
Never say never, right? For example, you have an orders table. It has an IsShipped bit column which is what marks an order as done to the business. You might have a more complex version of this example so employees can see the outstanding order count in the app:
CREATE TABLE #temp
(
c1 INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
c2 VARCHAR(10) DEFAULT 'bob',
c3 bit
)
INSERT INTO #temp(c3)
VALUES
(1)
GO 5000000
INSERT INTO #temp(c3)
VALUES
(0)
GO 500
--13,000 IO's
SET STATISTICS io ON
SELECT COUNT(*)
FROM #temp
WHERE c3 = 0
CREATE INDEX ix ON #temp(c3) include (c2)
--6 IO's
SET STATISTICS io ON
SELECT COUNT(*)
FROM #temp
WHERE c3 = 0
In SQL 2008, we can get a little better with filtered indexes but really all we are saving is space.
CREATE INDEX ix2 ON #temp(c3) include (c2)
WHERE c3 = 0
--4 IO's
SET STATISTICS io ON
SELECT COUNT(*)
FROM #temp
WHERE c3 = 0
Why do I have to use a MERGE\HASH JOIN Hint?
I have run into this several times lately so I thought I would whip out a quick post.
The scenario
A query runs much faster with a hash join hint than a nested loop but that is what the optimizer is choosing.
High level background
Merge and hash are good when you are working with lots of rows because it can leverage sequential IO. If you do a nested loop join with lookups on 100-1000+ rows, the random IOs can be 10-1000% more expensive than a sequential scan.
Why?
So why does the optimizer choose a nested loop? I have seen it is because the optimizer had to make a best guess at how make rows were being returned. It most cases, the optimizer guesses 1 row will be returned so a nested loop is the best choice as far as the optimizer can tell.
Identifying the problem
Identifying the problem is pretty easy. All you need to do is look at the query plan. It can be text, xml or graphical. Look actual and estimated rows returned. If the actual rows are 11teen,000,000 but the estimated rows is 1 then this may be the problem
Causes
I most often see this when large table variables, local variables or modified stored procedure parameters are involved. There can be other causes. See my favorite white paper.
Fix?
Easy. Don’t use table variables and local variables. If you have to modify sp parameters, pass them to a nested stored proc. Make sure stats are up to date. If the WHERE cause is complex, try putting the logic in a persisted computed column or indexed view. Another option is to use a covering index. Of course, the last resort is the JOIN hint.
Need more?
Want to get real deep on stuff like this? Check out Craig Freedman’s blog or his Inside SQL Server 2005 chapters.
5 Quick Tips for the Query using the “Wrong” Index
I quote “wrong” because with the info the optimizer has, it thinks it has the right index.
1. Look for cardinality problems. You can do this by looking at the actual rows versus estimated rows in the query plan. If they are off, chances are you are not following the best practices listed here.
2. Conversions in the WHERE clause can cause issues. Be sure to look at query plan for implicit conversions
3. Complicated logic in the WHERE clause can also cause problems. Consider indexed persisted computed columns or indexed views.
4. Constraints give the optimizer more info when choosing a plan with the “right” index. It can take a PK or unique constraint into consideration but it does not know about similar logic in a trigger.
5. Large datasets are being returned but the index is not covering. An index with just the columns in the WHERE clause is ok if just a few rows are returned but it does not take much for an index scan to be more efficient.
#1 is the most important one because the link gives 5 or 7 more reason why the “wrong” index is chosen.
P.S. If all else fails, update stats and if that does not work, make the index covering.
This Post Needs More Stored Procedures
New in SQL Server 2008 is a server setting called “Optimize for adhoc workloads”. I was happy to see this. You should be too especially if you have ever had arm wrestle an app that causes a bloated proc cache on an x86 box. Ugghh… Adam blogged on it here, Bob blogged on it here and here is the documentation.
I was wondering how this setting would play with the forced parameterization database setting. It looks like forced parameterization trumps the new server wide setting. If my simple tests below are right, it could present an interesting problem. What if your app generates a lot of totally unique adhoc queries that have no benefit from parameterization and you have queries that would benefit from parameterization? Well, the answer is still the same. Write stored procedures.
In the real world, sometimes you get a pig and all you can do is put lipstick on it. If you have to choose, “Optimize for Adhoc” would most benefit memoryIO bound servers while “forced parameterization” could help both the CPU and memory usage. However, that is a very general statement. Other things should be considered so it will depend.
--So what happens when you turn on "Optimize for adhoc" and forced parameterization exec sp_configure 'show advanced options', 1reconfigure with overridegoexec sp_configure 'optimize', 1reconfigure with override alter database master set parameterization forceddbcc freeproccache–Let’s run a query will not generate a trivial plan.
select name, object_id, create_datefrom sys.all_objects
where object_id = 3 and create_date = ‘2008-07-09 16:19:59.943′
go
–The CacheObjType is a compiled plan and 57344 bytes
select p.cacheobjtype, p.size_in_bytes , s.*from sys.dm_exec_query_stats s
join sys.dm_exec_cached_plans p
on s.plan_handle = p.plan_handle
cross apply sys.dm_exec_sql_text(sql_handle)
where text like ‘%select name , object_id , create_date from%’
–Now, let’s turn off forced parameterization
alter database master set parameterization simple
dbcc freeproccacheselect name, object_id, create_date
from sys.all_objectswhere object_id = 3 and create_date = ‘2008-07-09 16:19:59.943′
go
–The CacheObjType is a Compiled Plan Stub and 320 bytes
select p.cacheobjtype, p.size_in_bytes , s.*from sys.dm_exec_query_stats s
join sys.dm_exec_cached_plans p
on s.plan_handle = p.plan_handle
cross apply sys.dm_exec_sql_text(sql_handle)
where text like ‘%select name , object_id , create_date from%’
