More To The Matter
In the year 950 B.C., Craig Freedman write a post about subqueries in CASE expressions. It’s amazing how relevant so much of this stuff stays.
In today’s post, we’re going to look at a slightly different example than the one given, and how you can avoid performance problems with them by using APPLY.
Like most query tuning tricks, this isn’t something you always need to employ, and it’s not a best practice. It’s just something you can use when a scalar subquery doesn’t perform as you’d like it to.
How Much Wood
Our starting query looks like this. The point of it is to determine the percentage of answered questions per month.
SELECT x.YearPeriod, MonthPeriod = RIGHT('00' + RTRIM(x.MonthPeriod), 2), PercentAnswered = CONVERT(DECIMAL(18, 2), (SUM(x.AnsweredQuestion * 1.) / (COUNT_BIG(*) * 1.)) * 100.) FROM ( SELECT YearPeriod = YEAR(p.CreationDate), MonthPeriod = MONTH(p.CreationDate), CASE WHEN EXISTS ( SELECT 1/0 FROM dbo.Votes AS v WHERE v.PostId = p.AcceptedAnswerId AND v.VoteTypeId = 1 ) THEN 1 ELSE 0 END AS AnsweredQuestion FROM dbo.Posts AS p WHERE p.PostTypeId = 1 ) AS x GROUP BY x.YearPeriod, x.MonthPeriod ORDER BY x.YearPeriod ASC, x.MonthPeriod ASC;
Smack in the middle of it, we have a case expression that goes looking for rows in the Votes table where a question has an answer that’s been voted as the answer.
To start with, we’re going to give it this index.
CREATE INDEX p ON dbo.Posts(PostTypeId, AcceptedAnswerId) INCLUDE(CreationDate) WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON);
In all, this query will run for about 18 seconds. The majority of it is spent in a bad neighborhood.
Why does this suck? Boy oh boy. Where do we start?
- Sorting the Votes table to support a Merge Join?
- Choosing Parallel Merge Joins ever?
- Choosing a Many To Many Merge Join ever?
- All of the above?
If we change the way the query is structured to use OUTER APPLY instead, we can get much better performance in this case.
SELECT x.YearPeriod, MonthPeriod = RIGHT('00' + RTRIM(x.MonthPeriod), 2), PercentAnswered = CONVERT(DECIMAL(18, 2), (SUM(x.AnsweredQuestion * 1.) / (COUNT_BIG(*) * 1.)) * 100.) FROM ( SELECT YearPeriod = YEAR(p.CreationDate), MonthPeriod = MONTH(p.CreationDate), oa.AnsweredQuestion FROM dbo.Posts AS p OUTER APPLY ( SELECT AnsweredQuestion = CASE WHEN v.Id IS NOT NULL THEN 1 ELSE 0 END FROM dbo.Votes AS v WHERE v.PostId = p.AcceptedAnswerId AND v.VoteTypeId = 1 ) oa WHERE p.PostTypeId = 1 ) AS x GROUP BY x.YearPeriod, x.MonthPeriod ORDER BY x.YearPeriod ASC, x.MonthPeriod ASC;
This changes the type of join chosen, and runs for about 3 seconds total.
We avoid all of the problems that the parallel many-to-many Merge Join brought us.
Thanks, Hash Join.
It’s also worth noting that the OUTER APPLY plan asks for an index that would help us a bit, though like most missing index requests it’s a bit half-baked.
USE [StackOverflow2013] GO CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>] ON [dbo].[Votes] ([VoteTypeId]) INCLUDE ([PostId]) GO
Any human could look at this query and realize that having the PostId in the key of the index would be helpful, since we’d have it in secondary order to the VoteTypeId column
CREATE INDEX v ON dbo.Votes(VoteTypeId, PostId) WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON);
If we add that index, we can make the subquery fairly competitive, at about 4.5 seconds total.
But the issue here is now rather than poorly choosing a Sort > Merge Join, we go into a Nested Loops join for ~6 million rows. That’s probably not a good idea.
This index doesn’t leave as profound a mark on the APPLY version of the query. It does improve overall runtime by about half a second, but I don’t think I’d create an index just to get a half second better.
But hey, who knows? Maybe it’d help some other queries, too.
Indexes are cool like that.
Back On The Map
If you’ve got subqueries in your select list that lead to poor plan choices, you do have options. Making sure you have the right indexes in place can go a long way.
You may be able to get competitive performance gains by rewriting them as OUTER APPLY. You really do need to use OUTER here though, because it won’t restrict rows and matches the logic of the subquery. CROSS APPLY would act like an inner join and remove any rows that don’t have a match. That would break the results.
Thanks for reading!
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount on to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
- SQL Server 2017 CU 30: The Real Story With SelOnSeqPrj Fixes
- Getting Parameter Values From A SQL Server Query Plan For Performance Tuning
- Getting The Top Value Per Group With Multiple Conditions In SQL Server: Row Number vs. Cross Apply With MAX
- Getting The Top Value Per Group In SQL Server: Row Number vs. Cross Apply Performance