I’m going to start this post off sort of like Friday’s post:
Look, I really like EXISTS and NOT EXISTS. I do. They solve a lot of problems.
This post isn’t a criticism of them at all, nor do I want you to stop using them. I would encourage you to use them more, probably.
If you keep your head about you, you’ll do just fine.
The difference here is specific to NOT EXISTS, though, and it has to do with join reordering.
Or rather, the lack of join reordering.
Let’s get after it.
When we write our query like so, things are fine.
The Users and Badges tables are relatively small, and a parallel hash join query makes short work of the situation.
SELECT c = COUNT_BIG(*) FROM dbo.Users AS u WHERE u.Reputation > 1000 AND ( NOT EXISTS ( SELECT 1/0 FROM dbo.Badges AS b WHERE b.UserId = u.Id ) AND NOT EXISTS ( SELECT 1/0 FROM dbo.Comments AS c WHERE c.UserId = u.Id ) );
This query finishes in a shart under a second.
Notice that since no rows pass the first join, the Comments table is left unscathed.
If we write the query like this, the optimizer leaves things alone, and we get a much worse-performing query.
SELECT c = COUNT_BIG(*) FROM dbo.Users AS u WHERE u.Reputation > 1000 AND ( NOT EXISTS ( SELECT 1/0 FROM dbo.Comments AS c WHERE c.UserId = u.Id ) AND NOT EXISTS ( SELECT 1/0 FROM dbo.Badges AS b WHERE b.UserId = u.Id ) );
This one clocks in around 6 seconds, and complains of an excessive memory grant.
The big time suck here is spent hitting the Comments table, which is significantly larger than the Badges table.
The order that you write joins and where clause elements in generally doesn’t matter much, but in the case of NOT EXISTS, it can make a huge difference.
I realize that there are only two NOT EXISTS clauses in these examples, and that hardly makes for a compelling “always” statement. But I did a lot of experimenting with more tables involved, and it really doesn’t seem like the optimizer does any reordering of anti-semi joins.
Thanks for reading!
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount on to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
- Getting Parameter Values From A SQL Server Query Plan For Performance Tuning
- Getting The Top Value Per Group With Multiple Conditions In SQL Server: Row Number vs. Cross Apply With MAX
- Getting The Top Value Per Group In SQL Server: Row Number vs. Cross Apply Performance
- Performance Issues With EXISTS Queries In SQL Server