Software Vendor Mistakes With SQL Server: Joining On OR Conditions

Choose One Once


Everyone has that person in their life. I’m not gonna call them a “friend”, but they might be. Whoever they are, you know them by their indecisiveness. Because when you interact with them, you know you’re gonna have a whole bunch of decisions to make before you come up with a plan.

Those decisions are usually proposed to you in a series of “or we could” statements that you barely have time to assess before the next one hits. It makes life difficult.

Don’t be that person to SQL Server’s optimizer. I mean, don’t be that person in general, but especially don’t be that person to an expensive piece of software that makes business critical transactions and decisions possible for you.

Be Kind, Rewind


Remember a couple things that have been mentioned at various points in this series:

  • Anything that makes your job easier makes the optimizer’s job harder
  • Query data the way it’s stored, and store data the way you query it

What’s a real shame is that most of the things that fall into the first category involve you typing fewer characters once. Isn’t that a funny thing? Normally you type a bunch of code once, and after that it’s a matter of executing it.

Sometimes typing more once really pays off.

Let’s look at an example of that!

Orrible


In this query, we’re just going to use one OR predicate in a join, and as the saying goes, “this is Jackass”.

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
JOIN dbo.Comments AS C
    ON  C.UserId = U.Id
    OR  C.UserId = U.AccountId;

Since I’d like to keep your attention here, I’m just gonna show you the end result query plan, and how long it takes.

You’re not misreading that it takes around 15 seconds.

SQL Server Query Plan
beretta

Most of that time is spent in ridiculous Nested Loops Joins. And the really nice part is that you can’t hint hash or merge joins; you get a query processor error.

Union Dues


Let’s compare that to a query where we keep things simple(r). Sure, we type more, and part of our brain had to wake up and do some thinking.

But think of the optimizer. If we save just one optimizer, people, it’s worth it.

SELECT
    c = SUM(x.c)
FROM 
(
    SELECT
       c = 1
    FROM dbo.Users AS U
    JOIN dbo.Comments AS C
        ON  C.UserId = U.Id

    UNION ALL 

    SELECT
       c = 1
    FROM dbo.Users AS U
    JOIN dbo.Comments AS C
        ON  C.UserId =  U.AccountId
    	AND C.UserId <> U.Id
) AS x;

To shortcut to the important part of the plan, this version runs in a little under two seconds.

SQL Server Query Plan
little under

More typing. Fewer problems. Good and good. You might be okay with accepting this rewrite and explanation, and be off to test how a similar rewrite might improve your queries. If so, great! Get to it, Dear Reader.

If you want to dig further into why, follow me along to the next section.

Why OR Conditions Turn Our Poorly


Here’s the “full” query plan for the OR condition join query. There are a couple operators at the far left side that I’ve omitted because they don’t help with the explanation.

SQL Server Query Plan
by the numbers
  1. We scan the Users table. There nothing we can Seek to, so that’s fine, but note the number of r0ws (2,465,713) that come out
  2. All rows from the Users table are emitted via a constant scan operator for the Id column
  3. All rows from the Users table are emitted via a constant scan operator for the AccountId column

I don’t think the optimizer makes better guesses for constant scans, so even though all rows in the table are escorted forth, none of the estimates reflect anything close to that.

The full set of these rows combined (4,931,426) get sorted together rather than as two separate sets, and then a meager attempt to reduce overlapping values is applied at the merge interval. That set is only reduced to 4,931,419; a whopping 7 rows removed.

All of those rows are fed into an Apply Nested Loops join that hits the Comments table and searches the UserId column for values that fall between lower and upper bounds.

SQL Server Query Plan
surely

Even with a good index to seek to these values in, the sheer number of executions gets us stuck pretty badly. In all, ~14 of the ~15 seconds of execution time is spent in this exercise in futility. This query pattern becomes even worse with less useful indexes on the join columns.

At the moment, SQL Server’s query optimizer isn’t able to unwind OR predicates like this to come up with a better execution plan.

A Stronger Union


In the UNION ALL query, you’ll see far less futility, and a more reasonable set of operators presented for the number of rows being processed.

SQL Server Query Plan
trees

There are of course circumstances where this might be less efficient, like if one or more tables isn’t able to fit in the buffer pool and disk I/O sucks (you’re in the cloud, for example). But in general, splitting queries into factions that express precise logic is a good thing.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.



4 thoughts on “Software Vendor Mistakes With SQL Server: Joining On OR Conditions

Comments are closed.