The longer you work with data, the more weird stuff you see. I remember the first time I saw a join on a
LIKE I immediately felt revolted.
But not every query has an available equality join predicate. For various reasons, like poor normalization choices, or just the requirements of the query, you may run into things like I’m about to show you.
If there’s a headline point to this post, it’s that joins that don’t have an equality predicate in them only have one join choice: Nested Loops. You cannot have a Hash or Merge join with this type of query. Even with the best possible indexes, some of these query patterns just never seem to pan out performance-wise.
There Are Pains
First up, some patterns are less awful than others, and can do just fine with a useful index. Sounds like most other queries, right?
SELECT COUNT_BIG(*) AS records FROM dbo.Users AS u JOIN dbo.Posts AS p ON p.LastActivityDate BETWEEN u.CreationDate AND u.LastAccessDate WHERE u.Id = 22656;
Say we want to count all the posts that were active during Jon Skeet’s active period. This is a simplified version of some interval-type queries.
Even without an index, this does okay because we’re only passing a single row through the Nested Loops Join.
If our query needs are different, and we threaten more rows going across the Nested Loops Join, things deteriorate quickly.
SELECT COUNT_BIG(*) AS records FROM dbo.Users AS u JOIN dbo.Posts AS p ON p.LastActivityDate BETWEEN u.CreationDate AND u.LastAccessDate WHERE u.Id BETWEEN 22656 AND 22657;
The funny thing here is that there is still only one row actually going through. We’re just threatening additional rows.
SELECT u.* FROM dbo.Users AS u WHERE u.Id BETWEEN 22656 AND 22666;
How about those identity values, kids?
Here’s what happens, though:
There Are Indexes
Indexing only LastActivityDate “fixes” the multi-row query, but…
CREATE INDEX p ON dbo.Posts(LastActivityDate);
Parallelism being deemed unnecessary slows the access of the Posts table down, which is something we’ve seen before in other entries in this series. That costs us roughly 500 milliseconds in the first query, but saves us about 30 seconds in the second query. I’d probably take that trade.
And Then There Are Pains
Where things get more painful is with correlations like this.
SELECT COUNT_BIG(*) AS records FROM dbo.Users AS u JOIN dbo.Posts AS p ON u.DisplayName LIKE '%' + p.LastEditorDisplayName + '%' WHERE p.PostTypeId = 1;
This query runs basically forever. I gave up on it after several minutes, so you’re only getting the estimated plan.
The query plan asks for an index on the Users table! Surely this will solve all our problems.
CREATE INDEX ud ON dbo.Users(DisplayName);
This query, likewise, doesn’t seem too keen on finishing. But we get a new index request along with it.
This never finishes either. By “never” I mean “I’d like to write about more things so I’m not sitting around waiting for this.”
And Then There Are More Indexes
An index on PostTypeId that includes LastEditorDisplayName. I’m going to flex a little on SQL Server and create this index instead:
CREATE INDEX pl ON dbo.Posts(LastEditorDisplayName) WHERE PostTypeId = 1
Unfortunately, all this does is change which index is used. It doesn’t improve the performance of the query at all.
But this series is about query plan patterns, so let’s talk about some of those.
Like I mentioned at the start of the post, the optimizer only has a single join strategy for queries with no equality predicates on the join condition. This can lead to very bad performance in a number of cases.
In cases where a single row is being passed across, and the query plan shows a single seek or scan on the inner side of the Nested Loops Join, performance will likely be… acceptable.
When you see Spools or Constant Scans on the inner side of the Nested Loops Join, this could be cause for concern. For some details on that, have a look at these links:
- Query Trace Column Values in SQL Server
- Dynamic Seeks and Hidden Implicit Conversions
- Can you explain this execution plan?
After all, even the LIKE predicate is tolerable with appropriate indexes and a single row from the outer side.
SELECT COUNT_BIG(*) AS records FROM dbo.Users AS u JOIN dbo.Posts AS p ON u.DisplayName LIKE '%' + p.LastEditorDisplayName + '%' WHERE u.Id = 22656 AND p.PostTypeId = 1;
Thanks for reading!