Common Query Plan Patterns For Joins: Semi Join Problems

But You Should


This post isn’t meant to dissuade you from using EXISTS or NOT EXISTS when writing queries. In fact, most of the time I think they make a lot of sense.

But weird things can happen along the way, especially if you don’t have supporting indexes, or if supporting indexes aren’t chosen by the optimizer for various reasons.

In this post, I’m going to show you a query plan pattern that can occur in semi-join plans, and what you can do about it.

Row Goalies


This can happen without a TOP in the query, I’m only using it because it gets me the plan shape I care about quickly, and returns results “quickly” enough.

SELECT TOP (1)
    c.*
FROM dbo.Comments AS c
WHERE NOT EXISTS
(
    SELECT
        1/0
    FROM dbo.Votes AS v
    WHERE c.PostId = v.PostId
)
ORDER BY c.CreationDate DESC;

The part of the plan that we care about is on the inner side of the semi join, where the TOP is applied in a funny way.

just party

This is a very good example of where query costing can be quite off. The part of the plan that is responsible for most of the runtime is estimated to cost 0% of the total cost.

Closer


Here’s what’s going on in those two operators!

rock show

The optimizer thinks it can find a matching row in three executions, but it takes 70.

It also takes reading 2,630,801,081 rows to return 49 rows. There are 52,928,700 rows in the table.

Yes, if you multiply 52928700 * 49 = 2630801081. Technically it’s 49.7, but the plan shows rounded numbers. Close enough for the optimizer, so they say.

Index Not Exists


Without an index, a hash join plan would have been a much better option, which we can get by adding an OPTION(HASH JOIN) hint to the query.

freeloader

While ~12 seconds is great, we can do better. I’ve heard about these things called indexes.

How do we indexes?

An Index, You Say


It looks straight forward to create an index that should get us the plan we want, if we create it on the PostId column on the Votes table.

CREATE INDEX v ON dbo.Votes(PostId);

 

Assuming that the plan shape otherwise stays the same, we’ll scan the Comments table, Sort by PostId, and then do a Seek into the Votes table on the inner side of the join.

Right?

wrong

While it is quite a bit faster, it’s not at all what we want. And hey, look, the costs and times here are all screwy again.

Stop looking at costs.

Adding a FORCESEEK hint gets us the plan shape we care about, of course:

SELECT TOP (1)
    c.*
FROM dbo.Comments AS c
WHERE NOT EXISTS
(
    SELECT
        1/0
    FROM dbo.Votes AS v WITH(FORCESEEK)
    WHERE c.PostId = v.PostId
)
ORDER BY c.CreationDate DESC;

Along with a much faster runtime:

premium tears

Gosh Wrong


If you’re keeping score at home, the optimizer chose:

  • A TOP > Scan that took 30 seconds instead of a Hash Join that took 11 seconds
  • A Hash Join that took 5.2 seconds instead of a Top > Seek that took 2.2 seconds

But I bet I know what you’re thinking. We need an index on the Comments table. And I bet you’re thinking it should look something like this:

CREATE INDEX c ON dbo.Comments
    (PostId, CreationDate)
INCLUDE
    (Score, UserId, Text);

But, prepare you biggest internet sigh, that just won’t work out.

and tears

Much like above, if we add a FORCESEEK hint, we get a better/faster plan that looks a lot like the plan above.

Flip That Script


The only index that gets us a respectable execution plan without a bunch of hints is this one:

CREATE INDEX c ON dbo.Comments
    (CreationDate, PostId)
INCLUDE
    (Score, UserId, Text);

Which gets us this execution plan:

what i want

Thanks for reading!



3 thoughts on “Common Query Plan Patterns For Joins: Semi Join Problems

Leave a Reply

Your email address will not be published. Required fields are marked *