Common Query Plan Patterns For Joins: No Equality Predicate

Data Dribbling


The longer you work with data, the more weird stuff you see. I remember the first time I saw a join on a LIKE I immediately felt revolted.

But not every query has an available equality join predicate. For various reasons, like poor normalization choices, or just the requirements of the query, you may run into things like I’m about to show you.

If there’s a headline point to this post, it’s that joins that don’t have an equality predicate in them only have one join choice: Nested Loops. You cannot have a Hash or Merge join with this type of query. Even with the best possible indexes, some of these query patterns just never seem to pan out performance-wise.

There Are Pains


First up, some patterns are less awful than others, and can do just fine with a useful index. Sounds like most other queries, right?

SELECT
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON p.LastActivityDate BETWEEN u.CreationDate 
                              AND u.LastAccessDate
WHERE u.Id = 22656;

Say we want to count all the posts that were active during Jon Skeet’s active period. This is a simplified version of some interval-type queries.

Even without an index, this does okay because we’re only passing a single row through the Nested Loops Join.

SQL Server Query Plan
fussy

If our query needs are different, and we threaten more rows going across the Nested Loops Join, things deteriorate quickly.

SELECT
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON p.LastActivityDate BETWEEN u.CreationDate 
                              AND u.LastAccessDate
WHERE u.Id BETWEEN 22656 AND 22657;

The funny thing here is that there is still only one row actually going through. We’re just threatening additional rows.

SELECT
    u.*
FROM dbo.Users AS u
WHERE u.Id BETWEEN 22656 AND 22666;

How about those identity values, kids?

specify

Here’s what happens, though:

SQL Server Query Plan
suspect device

There Are Indexes


Indexing only LastActivityDate “fixes” the multi-row query, but…

CREATE INDEX p ON dbo.Posts(LastActivityDate);
SQL Server Query Plan
regressed

Parallelism being deemed unnecessary slows the access of the Posts table down, which is something we’ve seen before in other entries in this series. That costs us roughly 500 milliseconds in the first query, but saves us about 30 seconds in the second query. I’d probably take that trade.

And Then There Are Pains


Where things get more painful is with correlations like this.

SELECT
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON u.DisplayName LIKE '%' + p.LastEditorDisplayName + '%'
WHERE  p.PostTypeId = 1;

This query runs basically forever. I gave up on it after several minutes, so you’re only getting the estimated plan.

SQL Server Query Plan
ehhh

The query plan asks for an index on the Users table! Surely this will solve all our problems.

CREATE INDEX ud ON dbo.Users(DisplayName);

This query, likewise, doesn’t seem too keen on finishing. But we get a new index request along with it.

SQL Server Query Plans
helipad

This never finishes either. By “never” I mean “I’d like to write about more things so I’m not sitting around waiting for this.”

Sorry.

And Then There Are More Indexes


An index on PostTypeId that includes LastEditorDisplayName. I’m going to flex a little on SQL Server and create this index instead:

CREATE INDEX pl ON dbo.Posts(LastEditorDisplayName)
WHERE PostTypeId = 1

Unfortunately, all this does is change which index is used. It doesn’t improve the performance of the query at all.

But this series is about query plan patterns, so let’s talk about some of those.

Amoral


Like I mentioned at the start of the post, the optimizer only has a single join strategy for queries with no equality predicates on the join condition. This can lead to very bad performance in a number of cases.

In cases where a single row is being passed across, and the query plan shows a single seek or scan on the inner side of the Nested Loops Join, performance will likely be… acceptable.

When you see Spools or Constant Scans on the inner side of the Nested Loops Join, this could be cause for concern. For some details on that, have a look at these links:

After all, even the LIKE predicate is tolerable with appropriate indexes and a single row from the outer side.

SELECT
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON u.DisplayName LIKE '%' + p.LastEditorDisplayName + '%'
WHERE u.Id = 22656
AND   p.PostTypeId = 1;
SQL Server Query Plan
wild

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.