Erik Darling – Darling Data

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Posted on May 6, 2024April 10, 2024 by Erik Darling

Spring Training

This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Indexing SQL Server Queries For Performance: Equality vs. Inequality Searches

Posted on May 3, 2024May 3, 2024 by Erik Darling

Big, Bold Flavor

Since I first started reading about indexes, general wisdom has been to design the key of your indexes to support the most restrictive search predicates first.

I do think that it’s a good starting place, especially for beginners, to get acceptable query performance. The problem is that many databases end up designed with some very non-selective columns that are required for just about every query:

Soft deletes, where most rows are not deleted
Status columns, with only a handful of potential entries

Leaving the filtered index question out for the moment, I see many tables indexed with the “required” columns as the first key column, and then other (usually) more selective columns further along in the key. While this by itself isn’t necessarily a bad arrangement, I’ve seen many local factors lead to it contributing to bad performance across the board, with no one being quite sure how to fix it.

In this post, we’ll look at both an index change and a query change that can help you out in these situations.

Schema Stability

We’re going to start with two indexes, and one constraint.

CREATE INDEX
    not_posts
ON dbo.Badges
    (Name, UserId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    not_badges
ON dbo.Posts
    (PostTypeId, OwnerUserId)
INCLUDE
    (Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

ALTER TABLE
    dbo.Posts
ADD CONSTRAINT
    c_PostTypeId
CHECK
(
      PostTypeId > 0 
  AND PostTypeId < 9
);
GO

The index and constraint on the Posts table are the most important. In this case, the PostTypeId column is going to play the role of our non-selective leading column that all queries “require” be filtered to some values.

You can think of it mentally like an account status, or payment status column. All queries need to find a particular type of “thing”, but what else the search is for is up to the whims and fancies of the developers.

A Reasonable Query?

Let’s say this is our starting query:

SELECT
    DisplayName =
        (
            SELECT
                u.DisplayName
            FROM dbo.Users AS u
            WHERE u.Id = b.UserId
        ),
    ScoreSum = 
        SUM(p.Score)
FROM dbo.Badges AS b
CROSS APPLY
(
    SELECT
        p.Score,
        n =
            ROW_NUMBER() OVER
            (
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p 
    WHERE p.OwnerUserId = b.UserId
    AND   p.PostTypeId < 3
) AS p
WHERE p.n = 0
AND   b.Name IN (N'Popular Question')
GROUP BY
    b.UserId;

Focusing in on the CROSS APPLY section where the Posts table is queried, our developer has chosen to look for PostTypeIds 1 and 2 with an inequality predicate. Doing so yields the following plan, featuring an Eager Index Spool as the villain.

sql server query plan — i came to drop crumbs

SQL Server decided to scan our initial index and create a new one on the fly, putting the OwnerUserId column first, and the Score column second in the key of the index. That’s the reverse of what we did.

Leaving aside all the icky internals of Eager Index Spools, one can visually account for about 20 full seconds of duration spent on the effort.

Query Hints To The Rescue?

I’ve often found that SQL Server’s query optimizer is just out to lunch when it chooses to build an Eager Index Spool, but in this case it was the right choice.

If we change the query slightly to use a hint (FROM dbo.Posts AS p WITH(FORCESEEK)) we can see what happens when we use our index the way Codd intended.

It is unpleasant. I allowed the query to execute for an hour before killing it, not wanting to run afoul of my laptop’s extended warranty.

The big problem of course is that for each “seek” into the index, we have to read the majority of the rows across two boundaries (PostTypeId 1 and PostTypeId 2). We can see that using the estimated plan:

Because our seek crosses range boundaries, the predicate on OwnerUserId can’t be applied as an additional Seek predicate. We’re left applying it as a residual predicate, once for PostTypeId 2, and once for PostTypeId 1.

The main problem is, of course, that those two ranges encompass quite a bit of data.

+------------+------------+
| PostTypeId |    count   |
+------------+------------+
|          2 | 11,091,349 |
|          1 |  6,000,223 |
|          4 |     25,129 |
|          5 |     25,129 |
|          3 |        167 |
|          6 |        166 |
|          7 |          4 |
|          8 |          2 |
+------------+------------+

11 million rows for 2, and 6 million rows for 1.

Changing The Index

If you have many ill-performing queries, you may want to consider changing the order of key columns in your index to match what would have been spooled:

CREATE INDEX
    not_badges_x
ON dbo.Posts
    (OwnerUserId, PostTypeId)
INCLUDE
    (Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
GO

This gets rid of the Eager Index Spool, and also the requirement for a FORCESEEK hint.

At this point, we may need to contend with the Lazy Table Spool in order to get across the finish line, but we may also consider getting a query from ~30 seconds down to ~4 seconds adequate.

Of course, you may just have one query suffering this malady, so let’s look at a query rewrite that also solves the issue.

Optimizer Inflexibility

SQL Server’s query optimizer, for all its decades of doctors and other geniuses working on it, heavily laden with intelligent query processing features, still lacks some basic capabilities.

With a value constraint on the table telling the optimizer that all data in the column falls between the number 1 and 8, it still can’t make quite a reasonable deduction: Less than 3 is the same thing as 1 and 2.

Why does it lack this sort of simple knowledge that could have saved us so much trouble? I don’t know. I don’t even know who to ask anymore.

But we can do it! Can’t we? Yes! We’re basically optimizer doctors, too.

With everything set back to the original two indexes and check constraint, we can rewrite the where clause from PostTypeId < 3 to PostTypeId IN (1, 2).

If we needed to take extraordinary measures, we could even use UNION ALL two query against the Posts table, with a single equality predicate for 1 and 2.

Doing this brings query performance to just about equivalent with the index change:

The main upside here is the ability for the optimizer to provide us a query plan where there are two individual seeks into the Posts table, one for PostTypeId 1, with an additional seek to match OwnerUserId, and then one additional seek for PostTypeId 2, with an additional seek to match OwnerUserId.

This isn’t always ideal, of course, but in this case it gets the job fairly well done.

Plan Examiner

Understanding execution plans is sometimes quite a difficult task, but learning what patterns to look for can save you a lot of standing about gawping at irrelevancies.

The more difficult challenge is often taking what you see in an execution plan, and knowing what options you have available to adjust them for better performance.

In some cases, it’s all about establishing better communication with the optimizer. In this post, I used a small range (less than 3) as an example. Many dear and constant readers might find the idea that someone would write that over a two value IN clause ridiculous, but I’ve seen it. I’ve also seen it in more reasonable cases for much larger ranges.

It’s good to understand that the optimizer doesn’t have infinite tricks available to interpret your query logic into the perfect plan. Today we saw that it was unable to change < 3 to = 1 OR = 2, and you can bet there are many more such reasonable simplifications that it can’t apply, either.

Anyway, good luck out there. If you need help with these things, the links in the below section can help you get it from me.

Thanks for reading!

Going Further

Why Read Committed Queries Can Still Return Bad Results In SQL Server

Posted on May 2, 2024April 11, 2024 by Erik Darling

Why Read Committed Queries Can Still Return Bad Results In SQL Server

Thanks for watching!

Going Further

Join Me And Kendra Little At PASS For Two Days Of SQL Server Performance Tuning Precons

Posted on May 1, 2024May 1, 2024 by Erik Darling

Last Year

Kendra and I both taught solo precons, and got to talking about how much easier it is to manage large crowds when you have a little helper with you, and decided to submit two precons this year that we’d co-present.

Amazingly, they both got accepted. Cheers and applause. So this year, we’ll be double-teaming Monday and Tuesday with a couple pretty cool precons.

You can register for PASS Summit here, taking place live and in-person November 4-8 in Seattle.

Here are the details!

Day One: A Practical Guide to Performance Tuning Internals

Whether you’re aiming to be the next great query tuning wizard or you simply need to tackle tough business problems at work, you need to understand what makes a workload run fast– and especially what makes it run slowly.

Erik Darling and Kendra Little will show you the practical way forward, and will introduce you to the internal subsystems of SQL Server with a practical guide to their capabilities, weaknesses, and most importantly what you need to know to troubleshoot them as a developer or DBA.

They’ll teach you how to use your understanding of the database engine, the storage engine, and the query optimizer to analyze problems and identify what is a nothingburger best practice and what changes will pay off with measurable improvements.

With a blend of bad jokes, expertise, and proven strategies, Erik and Kendra will set you up with practical skills and a clear understanding of how to apply these lessons to see immediate improvements in your own environments.

Day Two: Query Quest: Conquer SQL Server Performance Monsters

Picture this: a day crammed with fun, fascinating demonstrations for SQL Server and Azure SQL.

This isn’t your typical training day; this session follows the mantra of “learning by doing,” with a good dose of the unexpected. Think of this as a SQL Server video game, where Erik Darling and Kendra Little guide you through levels of weird query monsters and performance tuning obstacles.

By the time we reach the final boss, you’ll have developed an appetite for exploring the unknown and leveled up your confidence to tackle even the most daunting of database dilemmas.

It’s SQL Server, but not as you know it—more fun, more fascinating, and more scalable than you thought possible.

Going Further

We’re both really excited to deliver these, and have BIG PLANS to have these sessions build on each other so folks who attend both days have a real sense of continuity.

Of course, you’re welcome to pick and choose, but who’d wanna miss out on either of these with accolades like this?

twitter — pretty, pretty, pretty, pretty good

You can register for PASS Summit here, taking place live and in-person November 4-8 in Seattle.

See you there!

Going Further

Updates To sp_QuickieStore, sp_HumanEventsBlockViewer, and sp_PressureDetector

Posted on April 30, 2024April 8, 2024 by Erik Darling

Chasing Perfection

We here at Darling Data strive to get things right the first time, but sometimes late nights and tired eyes conspire against us.

The nice thing about using these on a wide variety of SQL Servers in various states of disrepair is that bugs get spotted and sorted pretty quickly.

You can download all of the main SQL Server troubleshooting procedures I use in one convenient file.

Here’s a breakdown of changes you can find in the most recent releases!

sp_QuickieStore

Here’s what got fixed and added in this round of changes:

Fixed a big where execution count and other metrics were being underreported
Fixed a bug when checking for AG databases would throw an error in Azure SQLDB and Managed Instance (Thanks to AbhinavTiwariDBA for reporting)
Added the IGNORE_DUP_KEY option to a couple temp table primary keys that could potentially see duplicate values when certain parameter combinations are used (Thanks to ReeceGoding for reporting)
Added support for displaying plan XML when plans have > 128 nested nodes of XML in them (you can’t open them directly, but you can save and reopen them as graphical plans)
Added underscores to the “quotable” search characters, so they can be properly escaped

So now we don’t have to worry about any of that stuff. How nice. How nice for us.

sp_PressureDetector

Here’s what got fixed and added in this round of changes:

Fixed an issue in the disk metrics diffing where some data types weren’t explicit
Fixed a bunch of math issues in the disk diff, too (it turns out I was missing a useful column, doh!)
Fixed a bug in the “low memory” XML where I had left a test value in the final query
Added information about memory grant caps from Resource Governor (with a small hack for Standard Edition)

Turns out I’m not great at math, and sometimes I need to think a wee bit harder. Not at 4am, though.

sp_HumanEventsBlockViewer

Here’s what got fixed and added in this round of changes:

Added a check for COMPILE locks to the analysis output
Added a check for APPLICATION locks to the analysis output
Improved the help section to give blocked process report and extended event commands
Improved indexing for the blocking tree code recursive CTE
Moved contentious object name resolution to an update after the initial insert

The final one was done because when there’s a lot of data in the blocked process report, this query could end up being pretty slow. Why, you might ask? Because calling OBJECT_ID() in the query forces a serial plan.

Fun stuff.

Issues and Contributions

If you’d like to report an issue, request or contribute a feature, or ask a question about any of these procedures, please use my GitHub repo. Specifically, check out the contributing guide.

As happy as I am to get emails about things, it makes support difficult for the one-man party that I am. Personal support costs money. Public support is free. Take your pick, there!

Thanks for reading!

Going Further

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Posted on April 29, 2024February 16, 2024 by Erik Darling

Spring Training

This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further

Lookup Costing Is Really Weird In SQL Server

Posted on April 26, 2024April 8, 2024 by Erik Darling

Lookup Costing Is Really Weird In SQL Server

Thanks for watching!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Conditional Join and Where Clauses

Posted on April 25, 2024April 25, 2024 by Erik Darling

One Way Or Another

The OR operator is a perfectly valid one to use in SQL statements. If you use an IN clause, there’s a reasonable chance that the optimizer will convert it to a series of OR statements.

For example, IN(1, 2, 3) could end up being = 1 OR = 2 OR = 3 without you doing a darn thing. Optimizers are funny like that. Funny little bunnies.

The problem generally isn’t when asking for IN or OR for a single column, with a list of literal values, the problem is usually when you:

Use OR across multiple where clause columns
Use OR in a join clause of any variety
Use OR to handle NULL parameters or variables

Throw in some additional complexity by joining two tables together, and asking for something like:

SELECT
    *
FROM t1 AS t1
JOIN t2 AS t2
  ON t1.id = t2.id
WHERE t1.thing = 'something'
OR    t2.thing = 'something';

Of course, sufficiently complex filtering clauses will likely turn into case expressions, which I’ve seen cause more performance headaches that I’m happy to recall.

The main issue is, of course, performance, and figuring out what a reasonable index, or set of indexes might be.

To some, the idea of needing multiple indexes on a single table to make one query perform well might seem counterintuitive overkill. To anyone who has had to deal with interval queries, it’s a fact of life.

You might hate to hear this, but proper indexing isn’t enough in every case, and the point of this post is to show you how to rewrite queries so that SQL Server can use your indexes well.

The main problem is that certain query transformations, like index intersection and index union, which may be appropriate, might be costed much higher than a clustered index scan, or a seek/scan plus lookup.

Anyway, let’s get a move on. I can only be a young and good looking consultant for so long, here.

Not So Fast

If you have perfect indexes because you’re a perfect indexer, I’m happy for you.

Most people I do work for are not in that camp, nor even in that forest. Some may be blissfully unaware of camps and forests.

That’s okay though, because that’s what I get paid for.

Let’s say you have this index. It’s a good attempt at a good index.

CREATE INDEX 
    p1 
ON dbo.Posts
    (OwnerUserId, LastEditorUserId, Score) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

It’s a good attempt at a good index because this is the query it’s for.

SELECT
    p.*
FROM dbo.Posts AS p
WHERE 
(
     p.OwnerUserId = 22656
  OR p.LastEditorUserId = 22656
)
AND p.Score > 0;

You might look at this arrangement and think: This is my greatest work. I’ll be able to seek to OwnerUserId, and then apply residual predicates on LastEditorUserId and Score.

I mean, maybe not those exact words. You might have other things you’re proud of. Hopefully those turn out better.

Zero Seeks

If you’re the type of person — and I like this type of person — who looks at the query plan, you’ll be just as sore as a sunburn after looking at this one.

If you’re the type of person — and I like this type of person — who tries query hints when they don’t get their way, you might try a FORCESEEK hint here.

Unfortunately, you’ll remain in your sore state.

Msg 8622, Level 16, State 1, Line 28

Query processor could not produce a query plan because of the hints defined in this query.

Resubmit the query without specifying any hints and without using SET FORCEPLAN.

If you’re keen to play along at home and create and index with Score first, you’re going to get a seek, but a very disappointing one. There is no improvement in performance.

Okay, so what happened?

Breakdown

It’s really tough to get a seek with an OR on the two leading columns of an index. Sure, you could put Score first, but Score > 0 is about as selective as you at last call.

Of course, I mean that you’ll take whatever drink your friend comes back from the bar with. Don’t be lewd.

SQL Server done flipped that all upside down, didn’t it? Score is first, the OR clause is second. What a nuisance.

The problem of course is that we have two distinct sets of predicates:

OwnerUserId = 22656 and Score > 0
LastEditorUserId = 22656 and Score > 0

And the index we have doesn’t allow us to access the data in that way. No single index would, but two separate indexes would.

If we express out query in a slightly different way, that’s easier to see.

SELECT
    p.*
FROM 
(
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = 22656
    AND   p.Score > 1000
    
    UNION ALL
    
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.LastEditorUserId = 0
    AND p.Score > 1000
) AS p;

Here’s the query plan for that:

It may make more sense seeing this query plan, because you can see the optimizer’s dilemma a little better now.

While it’s possible to seek to OwnerUserId, and then evaluate Score, it’s not possible to seek to LastEditorUserId (regardless of a predicate on Score).

LastEditorUserId is the second key column, which means it’s not ordered in a helpful way for seeks to happen.

Different Drum

To get both sets of predicates to behave, you’d need two indexes like these:

CREATE INDEX 
    p2 
ON dbo.Posts
    (LastEditorUserId, Score) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    p3 
ON dbo.Posts
    (OwnerUserId, Score) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

This allows for each set of predicates to be evaluated efficiently, even using the original query pattern with an OR predicate.

And we have the elusive index union plan that I mentioned earlier.

The two new indexes are used, sought into, union all-ed together, aggregated, and finally there’s a lookup in the plan to retrieve all the columns in Posts that aren’t covered by them.

The optimizer doesn’t often choose plans like this, and it’s not always good when it does, but these indexes do pretty well here because the predicate combinations are selective.

Of Join And Where And Or

SQL Server’s cost based optimizer is good at implying many things from the clutches of your calamitous queries, or so I’m told.

It’s quite an amazing piece of work, but it has a lot of problems.

I suppose that’s common with models, whether they’re cost models or runway models. Perhaps I should spend the next 15 years of my life studying runway models to be sure.

For example, if I have a query like this:

SELECT TOP (1)
    c.Id
FROM dbo.Posts AS p
JOIN dbo.Comments AS c
  ON p.OwnerUserId = c.UserId
WHERE p.OwnerUserId = 22656;

SQL Server is accurately able to figure out that the only number I care about is 22656, because one of the join columns is in the where clause. The query plan looks about like so:

There’s a seek on both sides of the join specifically to a value in each index, which means only rows with matching values will be projected from either side into the join.

And yet, we have a warning that there’s no join predicate. Curse of the summer intern.

Where the optimizer is somewhat less good at such inferences is, of course, when you get OR clauses involved.

Here’s an example! I love examples. Grand things, those.

SELECT 
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
  ON c.PostId = v.PostId
WHERE c.PostId = 138
OR    v.UserId = 831;

You might laugh and wonder who would ever write a query like this, but it’s far from uncommon to see a game of mix and match across tables and columns in the where clause.

The unbearable weirdness of OR strikes again.

The entire join and where clause is evaluated at the hash join. There’s probably a good reason for it, but it’s late and I’m not feeling overly charitable in researching it.

If we rewrite the where clause a bit, here’s what we end up with:

SELECT 
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
  ON c.PostId = v.PostId
WHERE v.PostId = 138
OR    v.UserId = 831;

The only change is that I’m using the PostId column from the Votes tables instead of the PostId column from the Comments table here.

And would it surprise you that the query plan ends up not being a miserable sofa? I hope not.

The secret of course, is having some good-enough indexes hanging around to help you locate all your precious data. I’m sure I’ve mentioned they’re important a time or two.

Here’s what I have:

CREATE INDEX 
    meh
ON dbo.Votes
    (PostId, UserId)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    eh
ON dbo.Votes
    (UserId, PostId)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    ennui
ON dbo.Comments
    (PostId)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

And yet, even with all those pretty little indexes, the first query stunk it up worse than the easter egg I’m probably gonna find in a couple months behind a radiator.

Pitting good indexes against poorly written queries is just cruel.

Won’t someone think of the indexes?

Of Join And Or

Joins with OR clauses fare no better. It’s an unfortunate convenience for those writing queries, because it’s incredible inconvenient for the optimizer.

There are likely some shortcomings that the optimizer team could address in this space, but some join logic is probably more than they’d like to take a chance on.

A simple two-table join with an OR clause would be easy enough, but let’s look at one that isn’t so straightforward.

First, some very generous indexes:

CREATE INDEX
    v
ON dbo.Votes
    (PostId)
INCLUDE
    (BountyAmount)
WHERE
    (BountyAmount IS NOT NULL)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    p
ON dbo.Posts
    (AcceptedAnswerId, Id) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    p2
ON dbo.Posts
    (PostTypeId, Id) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Look how generous I am with my indexes. The most generous and most humble indexer, they call me.

Query Time, Excellent

Here’s what we’re working with from a query perspective!

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p
JOIN dbo.Posts AS p2
  ON  p.Id = p2.AcceptedAnswerId
JOIN dbo.Votes AS v
  ON p.Id = v.PostId
  OR p2.AcceptedAnswerId = v.PostId
WHERE v.BountyAmount IS NOT NULL
AND   p.PostTypeId = 2
AND   p2.AcceptedAnswerId > 0;

What makes this like more of a challenge is that Posts is joined to itself, and columns from each make up the join predicate to Votes.

What makes this kind of stupid is that this is all inner joins, even taking potential row duplication into account from a many to many relationship, it could be simplified from:

FROM dbo.Posts AS p 
JOIN dbo.Posts AS p2 
  ON p.Id = p2.AcceptedAnswerId 
JOIN dbo.Votes AS v 
  ON p.Id = v.PostId 
  OR p2.AcceptedAnswerId = v.PostId

FROM dbo.Posts AS p 
JOIN dbo.Posts AS p2 
  ON p.Id = p2.AcceptedAnswerId 
JOIN dbo.Votes AS v 
  ON p.Id = v.PostId

FROM dbo.Posts AS p 
JOIN dbo.Posts AS p2 
  ON p.Id = p2.AcceptedAnswerId 
JOIN dbo.Votes AS v 
  ON p2.AcceptedAnswerId = v.PostId

What’s particularly interesting, but somewhat difficult to describe well to the optimizer, is that any given question can only have one accepted answer, so there’s no many to many relationship with that join.

If we were to use the ParentId column instead of AcceptedAnswerId, it’d be a different story. One question can have many answers (but again, only one can be accepted as the answer).

This is a great reason to normalize your data, and not cram them full of parent/child relationships.

Fine Time

The query plan is quite typical of one with a join that contains an OR predicate. It takes just about six seconds.

If you’re keen on understanding the pattern, and it’s a good pattern to understand:

3.8 million rows hit the hash join
One constant scan passes out rows for the Id column
Another constant scan passes out rows for the AcceptedAnswerId column
Merge Interval attempts to group overlapping values

It is a bit confusing at first glance, because of the two nested loops joins.

The outermost nested loops (next to the hash match) is of the apply variety, but SQL Server does all the work described above in order to reduce the number of times the Votes table would have to be accessed.

Since this is another OR condition, let’s think back to the first query we talked about, where a FORCESEEK hint produce an optimizer error because there was no readily seekable predicate.

If you’re the type of person — and I like this type of person — who tries query hints when they don’t get their way, you might try an OPTION(HASH JOIN) hint here.

But since hash joins require at least one equality predicate, you’ll get a rather familiar optimizer error. The OR condition produces a query plan with CompareOp="IS" in it, which is not the same as "EQ".

Fun With Rewrites

Part of the reason why Microsoft likely avoids this specific optimizer search space is that the logic required to get this right is somewhat involved.

Here’s the query I used to get a matching count. For this query, it’s safe to use EXISTS because you don’t need to project any columns out from the Posts table:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.BountyAmount IS NOT NULL
AND   EXISTS
(
    SELECT
        p.Id
    FROM dbo.Posts AS p
    WHERE p.AcceptedAnswerId = v.PostId
    AND   p.AcceptedAnswerId > 0
    AND EXISTS
    (
        SELECT
            1/0
        FROM dbo.Posts AS p2
        WHERE p.Id = p2.AcceptedAnswerId
    )

    UNION ALL

    SELECT
        p.Id
    FROM dbo.Posts AS p
    WHERE p.Id = v.PostId
    AND   p.PostTypeId = 2
    AND EXISTS
    (
        SELECT
            1/0
        FROM dbo.Posts AS p2
        WHERE p.Id = p2.AcceptedAnswerId
    )
);

If you needed to do that, you would use CROSS APPLY.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
CROSS APPLY
(
    SELECT
        p.Id
    FROM dbo.Posts AS p
    WHERE p.AcceptedAnswerId = v.PostId
    AND   p.AcceptedAnswerId > 0
    AND EXISTS
    (
        SELECT
            1/0
        FROM dbo.Posts AS p2
        WHERE p.Id = p2.AcceptedAnswerId
    )

    UNION ALL

    SELECT
        p.Id
    FROM dbo.Posts AS p
    WHERE p.Id = v.PostId
    AND   p.PostTypeId = 2
    AND EXISTS
    (
        SELECT
            1/0
        FROM dbo.Posts AS p2
        WHERE p.Id = p2.AcceptedAnswerId
    )

) AS p
WHERE v.BountyAmount IS NOT NULL;

The query plans are close enough to identical to not show you twice.

Is it bigger? Yes.

Was the query longer? Yes.

But if you’ve been following this series, or my blog posts in general, you’ll know that it’s quite rare that shortcuts and conveniences get you anywhere with the optimizer.

Case Closed

Something I mentioned early on in the post, with an element of obvious foreshadowing, is the use of case expressions in join and where clauses.

While they’re not precisely the same thing as OR clauses, they are conditional logic, which can do a decent job of mangling up performance.

Here’s our setup, index-wise:

CREATE INDEX 
    p
ON dbo.Posts 
    (Score, PostTypeId, OwnerUserId)
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX 
    u
ON dbo.Users 
    (CreationDate, Reputation, AccountId)
INCLUDE 
    (DisplayName)
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

And here’s the query we’ll be looking at:

SELECT
    u.Id,
    u.DisplayName,
    s = SUM(p.Score)
FROM dbo.Posts AS p
JOIN dbo.Users AS u
  ON 1 = CASE     
             WHEN p.PostTypeId IN (1, 2)
             AND  p.OwnerUserId = u.Id
             THEN 1
             WHEN p.PostTypeId > 2
             AND  p.OwnerUserId = u.AccountId
             THEN 1
             ELSE 0
         END
WHERE u.CreationDate >= '20131231 23:00'
AND   1 = CASE 
              WHEN p.Score >= 1
              AND  u.Reputation >= 10
              THEN 1
              ELSE 0
          END
GROUP BY
    u.Id,
    u.DisplayName
ORDER BY 
    s DESC;

I’ve seen enough nightmare-fuel queries like this to include it in the mix, because I never want to see one of these again.

Just kidding! I want to see yours and fix them for you. Show me all of them. Bring out the whole gang. My soul is prepared.

Avert Your Eyes

Here’s the query plan we get from this query.

The only thing that isn’t terrible is the one part of the where clause that isn’t wrapped up in some derived dilemma, which is the predicate on CreationDate.

The TL;DR here is that:

We can seek to CreationDate because it’s the leading index key and not part of the case expression
We can’t seek to anything in the Posts table at all
The entire filtering process occurs at the Nested Loops join

If you’re paying careful attention to row counts, this filthy animal runs for 45 seconds to return a single row.

That’s a terrible row to seconds ratio.

Rewrites for this will be similar to others, so hopefully you’re not sick of those just yet.

Rewrite Your Eyes

Whenever I’m rewriting small, compact, sorta stubby looking queries, I like to think of myself as Willy Wonka stretching out Mike Teavee.

Queries like that remind me a bit of the character himself. Rewriting them into more sensible forms with better attitudes does take a bit of stretching out.

SELECT
    u.Id,
    u.DisplayName,
    s = SUM(p.Score)
FROM
(
    SELECT
        u.Id,
        u.AccountId,
        u.DisplayName
    FROM dbo.Users AS u
    WHERE u.Reputation >= 10 
    AND   u.CreationDate >= '20131231 23:00'
) AS u
CROSS APPLY
(
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId IN (1, 2)
    AND   p.OwnerUserId = u.Id
    AND   p.Score >= 1

    UNION ALL

    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId > 2
    AND   p.OwnerUserId = u.AccountId
    AND   p.Score >= 1
) AS p
GROUP BY
    u.Id,
    u.DisplayName
ORDER BY 
    s DESC;

And of course, when you do better, SQL Server does better. And when SQL Server does better, everyone’s happy.

Why don’t you want to be happy?

The more you do these types of things, the more intuitive they become.

A Shorter Note

At the beginning of the post, I also mentioned a different type of query, using OR to safeguard optional parameters from NULL values.

While I’m not going to delve into that here, because I’d like to do an entire post in this series on dynamic SQL, I’d like to add a single quick note: Don’t be afraid of OPTION(RECOMPILE).

It’s a huge problem solver for so many different types of queries having so many different kinds of problems:

Parameter sensitivity
Local variables
Optional parameters

If you’re ever unsure about query performance, and you’re not quite sure what’s wrong with it, you can do a whole lot worse than seeing if this at least gets you a different execution plan.

While it’s not perfect or practical as a long term solution for some situations, don’t dismiss it as a tool of exploration when you’re working on a problem query.

I’ve apparently tipped my hat a bit that dynamic SQL is a future subject for this series. After that, I’m not sure what else might come up. If there’s anything you’d like to see that you haven’t seen so far, leave a comment.

Thanks for reading!

Going Further

Why Logical Reads Are A Bad Metric For Query Tuning In SQL Server

Posted on April 24, 2024April 5, 2024 by Erik Darling

Why Logical Reads Are A Bad Metric For Query Tuning In SQL Server

To summarize the video a little bit:

High average or total logical reads isn’t a guarantee that a query is slow
Looking for high average CPU and duration queries is a better metric
You may see logical reads go up or down as you make queries faster
For I/O bound workloads, you’re better off looking for queries with a lot of physical reads

The more query tuning work I did, the more I came to realize that logical reads fall into the SQL Server 2008 mindset, like PLE, fragmentation, page splits, and other metrics that don’t necessarily indicate a performance problem.

Thanks for watching!

Going Further

What Else Happens When Queries Try To Compile In SQL Server: COMPILE LOCKS!

Posted on April 23, 2024April 4, 2024 by Erik Darling

What Else Happens When Queries Try To Compile In SQL Server: COMPILE LOCKS!

Thanks for watching!