Why MAX Data Types Are Usually A Bad Choice For SQL Server Columns

Easy Rider


When you’re trying to figure out how to store string data, it often seems easiest to just choose an extra long — even MAX — data type to avoid future truncation errors.

Even if you’re storing strings with a known, absolute length, developers may choose to not enforce that in the application, either via a drop down menu or other form of validation.

And so to avoid errors when users try to put their oh-so-important data in their oh-so-expensive database, we get columns added to tables that can fit a galaxy of data in them, when we only need to store an ashtray worth of data.

While getting data into those columns is relatively easy — most application inserts are single rows — getting data out of those columns can be quite painful, whether it’s searching or just presenting in the select portion of a query.

Let’s look at a couple simple examples of how that happens.

Search Engine


Let’s take a query like this one:

SELECT TOP (20)
    p.Id,
    p.Title,
    p.Body
FROM dbo.Posts AS p
WHERE p.Body LIKE N'SQL Server%';

The Body column in the Posts table is nvarchar and MAX, but the same thing would happen with a varchar column.

If you need a simple way to remember how to pronounce those data types, just remember to Pahk yah (n)vahcah in Hahvahd Yahd.

Moving on – while much has been written about leading wildcard searches (that start with a % sign), we don’t do that here. Also, in general, using charindex or patindex instead of leading wildcard like searching won’t buy you all that much (if anything at all).

Anyway, since you can’t put a MAX datatype in the key of an index, part of the problem with them is that there’s no way to efficiently organize the data for searching. Included columns don’t do that, and so we end up with a query plan that looks some-such like this:

SQL Server Query Plan
ouch in effect

We spend ~13.5 seconds scanning the clustered index on the Posts table, then about two minutes and twenty seven seconds (minus the original 13.5) applying the predicate looking for posts that start with SQL Server.

That’s a pretty long time to track down and return 19 rows.

Let’s change the query a little bit and look at how else big string columns can cause problems.

Memory Bank


Rather than search on the Body column, let’s select some values from it ordered by the Score column.

Since Score isn’t indexed, it’s not sorted in the database. That means SQL Server needs to ask for memory to put the data we’re selecting in the order we’re asking for.

SELECT TOP (200)
    p.Body
FROM dbo.Posts AS p
ORDER BY p.Score DESC;

The plan for this query asks for a 5GB memory grant:

SQL Server Query Pla
quietly

I know what you’re thinking: the Body column probably has some pretty big data in it, and you’re right. In this case, it’s the right data type to use.

The bad news is that SQL Server will makes the same memory grant estimation based on the size of the data we need to sort whether or not it’s a good choice.

I talk more about that in this Q&A on Stack Exchange.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server 2022: Cardinality Estimation Feedback

Quiet As Kept


I’ve been trying to take the general temperature when it comes to SQL Server 2022. At least from a performance perspective, some interesting things have been introduced so far.

There have been a few neat things:

  • Parameter Sensitive Plan optimizations
  • Query Store Hints
  • Memory Grant Feedback improvements
  • DOP Feedback
  • Cardinality Estimation Feedback

I’m not seeing a whole lot out there. I’m not sure why. I follow quite a few SQL bloggers via Feedly.

Perhaps it’s just too new. Maybe everyone is waiting for CTP SP1.

Well, anyway. In this post I want to talk a little bit about what Cardinality Estimation Feedback can do, and what it can’t do.

What It Do


First, you need Query Store enabled to get this to work. It relies on the Query Store Plan hints also introduced for SQL Server 2022.

For queries that execute frequently and retain cached plans, the optimizer will look at some of the assumptions that get made under different Cardinality Estimation models.

Things like:

  • Row Goals
  • Predicate independence/correlation
  • Join containment being simple or base

What each of those things means isn’t terribly important to the post, but all of them are things that are influenced by using the legacy or default cardinality estimators.

As I understand it, this is a bit like Memory Grant Feedback. If estimation issues are detected, a different plan will be attempted. If that plan corrects a performance issue, then the hint will get persisted in Query Store.

Pretty cool, but…

What It Don’t Do


It doesn’t fix things while they’re running, like Adaptive Joins can do. That’s sort of unfortunate! Hear me out on why.

Often, when model errors are incorrect, queries run for a long time. Particularly when row goals are introduced, query plans are quite sensitive to those goals not being met quickly.

It’d be really unfortunate for you to sit around waiting for 15-16 executions of a poor performing query to finish executing before an intervention happens.

I would have either:

  • Reduced, or made this threshold configurable
  • Been more aggressive about introducing Adaptive Joins when CE models influence plan choices

After all, Adaptive Joins help queries at runtime rather than waiting for an arbitrary number of executions and then stepping in.

Perhaps there was a good reason for not doing this, but those were the first two things to cross my mind when looking into the feature.

How It Do


I was able to get the feature to kick in using a familiar query.

Here’s the setup script:

DBCC FREEPROCCACHE;
ALTER DATABASE 
    StackOverflow2010 
SET 
    QUERY_STORE CLEAR;
GO

    CREATE INDEX whatever 
        ON dbo.Votes(CreationDate, VoteTypeId, PostId);
    
    CREATE INDEX apathy
        ON dbo.Posts (PostTypeId)
            INCLUDE (OwnerUserId, Score, Title);
GO

    SELECT TOP (2500) 
        p.OwnerUserId, 
        p.Score, 
        p.Title, 
        v.CreationDate,
        ISNULL(v.BountyAmount, 0) AS BountyAmount
    FROM dbo.Posts AS p
    JOIN dbo.Votes AS v
        ON  p.Id = v.PostId
    WHERE v.VoteTypeId = 1
    AND   p.PostTypeId = 1
    ORDER BY v.CreationDate DESC;
    GO 17

SELECT qspf.* FROM sys.query_store_plan_feedback AS qspf;

SELECT qsqh.* FROM sys.query_store_query_hints AS qsqh;

For the first 16 runs, we get the same query plan that takes about 2 seconds.

SQL Server Query Plan
if you got a problem

Then, magically, on run #17, we get a different query plan!

SQL Server Query Plan
yo i’ll solve it

Pretty cool! The plan totally changed, and clearly got better. I am happy about this. Not so happy that it would have taken 16 executions of a Potentially Painful© query to get here, but you know.

Here we are.

In Query Store


There are a couple views that will detail where hints came from and which were applied:

SQL Server Query Results
clowny clown clown

Since I just cleared out query store prior to this running, we can infer some things:

  • CE Feedback kicked in and gave us a new plan with a hint to disable row goals
  • The second plan generated was identified by the engine as needing memory grant feedback

I suppose this is a good reason to do some work on sp_QuickieStore so y’all can see this stuff in action.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server Performance Mysteries: Why Does Production Get A Bad Plan That Development Doesn’t?

Statistical Legacy


A client question that I get quite a bit is around why queries in production get a bad query plan that queries in dev, QA, or staging don’t get is typically answered by looking at statistics.

Primarily, it’s because of the cardinality estimates that queries get around ascending keys. It usually gets called the ascending key problem, but the gist is that:

  • You have a pretty big table
  • You’re using the legacy cardinality estimator
  • A whole bunch of rows get inserted, but not enough to trigger an auto stats update
  • You’re not using compatibility level >= 130 or trace flag 2371
  • Queries that look for values off an available histogram get a one row estimate using the legacy Cardinality Estimator or a 30% estimate using the default Cardinality Estimator

Which is a recipe for potentially bad query plans.

Reproductive Script


Here’s the full repro script. If you’re using a different Stack Overflow database, you’ll need to adjust the numbers.

USE StackOverflow2013;

/*Figure out the 20% mark for stats updates using legacy compat levels*/
SELECT
    c = COUNT_BIG(*),
    c20 = CEILING(COUNT_BIG(*) * .20)
FROM dbo.Users AS u;

/*Stick that number of rows into a new table*/
SELECT TOP (493143)
    u.*
INTO dbo.Users_Holder
FROM dbo.Users AS u
ORDER BY u.Id DESC;


/*Delete that number of rows from Users*/
WITH 
    del AS
(
SELECT TOP (493143)
    u.*
FROM dbo.Users AS u
ORDER BY u.Id DESC
)
DELETE
FROM del;

/*I'm using this as a shortcut to turn off auto stats updates*/
UPDATE STATISTICS dbo.Users WITH NORECOMPUTE;

/*Put the rows back into the Users Table*/
SET IDENTITY_INSERT dbo.Users ON;

INSERT
    dbo.Users
(
    Id,
    AboutMe,
    Age,
    CreationDate,
    DisplayName,
    DownVotes,
    EmailHash,
    LastAccessDate,
    Location,
    Reputation,
    UpVotes,
    Views,
    WebsiteUrl,
    AccountId
)
SELECT
    uh.Id,
    uh.AboutMe,
    uh.Age,
    uh.CreationDate,
    uh.DisplayName,
    uh.DownVotes,
    uh.EmailHash,
    uh.LastAccessDate,
    uh.Location,
    uh.Reputation,
    uh.UpVotes,
    uh.Views,
    uh.WebsiteUrl,
    uh.AccountId
FROM dbo.Users_Holder AS uh;

SET IDENTITY_INSERT dbo.Users OFF;

/*Figure out the minimum Id we put into the holder table*/
SELECT
    m = MIN(uh.Id)
FROM dbo.Users_Holder AS uh;

/*Compare estimates*/
SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.Id > 2623772
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));


SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.Id > 2623772
OPTION(USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION'));

/*Cleanup*/
UPDATE STATISTICS dbo.Users;

TRUNCATE TABLE dbo.Users_Holder;

Query Plans


Here are the plans for the stars of our show:

SQL Server Query Plan
king push

In these query plans, you can see the legacy cardinality estimator gets a one row estimate, and the default cardinality estimator gets a 30% estimate.

There isn’t necessarily a one-is-better-than-the-other answer here, either. There are times when both can cause poor plan choices.

You can think of this scenario as being fairly similar to parameter sniffing, where one plan choice does not fit all executions well.

Checkout


There are a lot of ways that you can go about addressing this.

In some cases, you might be better off using trace flag 2371 to trigger more frequent auto stats updates on larger tables where the ~20% modification counter doesn’t get hit quickly enough. In others, you may want to force one estimator over the other depending on which gets you a better plan for most cases.

Another option is to add hints to the query in question to use the default cardinality estimator (FORCE_DEFAULT_CARDINALITY_ESTIMATION), or to generate quick stats for the index/statistics being used (ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS). Documentation for both of those hints is available here. Along these lines, trace flags 2389, 2390, or 4139 may be useful as well.

Of course, you could also try to address any underlying query or index issues that may additionally contribute to poor plan choices, or just plan differences. A common problem in them is a seek + lookup plan for the one row estimate that doesn’t actually make sense when the actual number of rows and lookup executions are encountered at runtime.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Local Variables vs Forced Parameterization In SQL Server Queries

Questions, Arise!


I think it was sometime in the last century that I mentioned I often recommend folks turn on Forced Parameterization in order to deal with poorly formed application queries that send literal rather than parameterized values to SQL Server.

And then just like a magickal that, I recommended it to someone who also has a lot of problems with Local Variables in their stored procedures.

They were curious about if Forced Parameterization would fix that, and the answer is no.

But here’s proofs. We love the proofs.

Especially when they’re over 40.

A Poorly Written Stored Procedure


Here’s this thing. Don’t do this thing. Even the index is pretty dumb, because it’s on a single column.

CREATE INDEX
    i
ON dbo.Users
    (Reputation)
WITH
    (SORT_IN_TEMPDB= ON, DATA_COMPRESSION = PAGE);
GO 

CREATE PROCEDURE
    dbo.humpback
(
    @Reputation int
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

    /*i mean don't really do this c'mon*/
    DECLARE 
        @ReputationCopy int = ISNULL(@Reputation, 0);
    
    SELECT
        u.DisplayName,
        u.Reputation,
        u.CreationDate,
        u.LastAccessDate
    FROM dbo.Users AS u
    WHERE u.Reputation = @ReputationCopy;

END;

If we look at the statement parameterization type, available with Query Store enabled, it returns 0. That’s true whether or not we enable simple or forced parameterization:

ALTER DATABASE StackOverflow2013 SET PARAMETERIZATION FORCED;
GO 
EXEC dbo.humpback 
    @Reputation = 11;
GO 

ALTER DATABASE StackOverflow2013 SET PARAMETERIZATION SIMPLE;
GO 

EXEC dbo.humpback 
    @Reputation = 11;
GO 

For now, you’ll have to do a little more work to fix local variable problems.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

One Thing The “New” Cardinality Estimator Does Better In SQL Server

Or “Default”, If That’s Your Kink


Look, I’m not saying there’s only one thing that the “Default” cardinality estimator does better than the “Legacy” cardinality estimator. All I’m saying is that this is one thing that I think it does better.

What’s that one thing? Ascending keys. In particular, when queries search for values that haven’t quite made it to the histogram yet because a stats update hasn’t occurred since they landed in the mix.

I know what you’re thinking, too! On older versions of SQL Server, I’ve got trace flag 2371, and on 2016+ that became the default behavior.

Sure it did — only if you’re using compat level 130 or better — which a lot of people aren’t because of all the other strings attached.

And that’s before you go and get 2389 and 2390 involved, too. Unless you’re on compatibility level 120 or higher! Then you need 4139.

Arduous


Anyway, look, it’s all documented.

2371 Changes the fixed update statistics threshold to a linear update statistics threshold. For more information, see this AUTO_UPDATE_STATISTICS Option.

Note: Starting with SQL Server 2016 (13.x) and under the database compatibility level 130 or above, this behavior is controlled by the engine and trace flag 2371 has no effect.

Scope: global only

2389 Enable automatically generated quick statistics for ascending keys (histogram amendment). If trace flag 2389 is set, and a leading statistics column is marked as ascending, then the histogram used to estimate cardinality will be adjusted at query compile time.

Note: Please ensure that you thoroughly test this option, before rolling it into a production environment.

Note: This trace flag does not apply to CE version 120 or above. Use trace flag 4139 instead.

Scope: global or session or query (QUERYTRACEON)

2390 Enable automatically generated quick statistics for ascending or unknown keys (histogram amendment). If trace flag 2390 is set, and a leading statistics column is marked as ascending or unknown, then the histogram used to estimate cardinality will be adjusted at query compile time. For more information, see this Microsoft Support article.

Note: Please ensure that you thoroughly test this option, before rolling it into a production environment.

Note: This trace flag does not apply to CE version 120 or above. Use trace flag 4139 instead.

Scope: global or session or query (QUERYTRACEON)

4139 Enable automatically generated quick statistics (histogram amendment) regardless of key column status. If trace flag 4139 is set, regardless of the leading statistics column status (ascending, descending, or stationary), the histogram used to estimate cardinality will be adjusted at query compile time. For more information, see this Microsoft Support article.

Starting with SQL Server 2016 (13.x) SP1, to accomplish this at the query level, add the USE HINT ‘ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS’ query hint instead of using this trace flag.

Note: Please ensure that you thoroughly test this option, before rolling it into a production environment.

Note: This trace flag does not apply to CE version 70. Use trace flags 2389 and 2390 instead.

Scope: global or session or query (QUERYTRACEON)

I uh. I guess. ?

Why Not Just Get Cardinality Estimation Right The First Time?


Great question! Hopefully someone knows the answer. In the meantime, let’s look at what I think this new-fangled cardinality estimator does better.

The first thing we need is an index with literally any sort of statistics.

CREATE INDEX v ON dbo.Votes_Beater(PostId);

Next is a query to help us figure out how many rows we can modify before an auto stats update will kick in, specifically for this index, though it’s left as an exercise to the reader to determine which one they’ve got in effect.

There are a lot of possible places this can kick in. Trace Flags, database settings, query hints, and more.

SELECT TOP (1)
    OBJECT_NAME(s.object_id) AS table_name,
    s.name AS stats_name,
    p.modification_counter,
    p.rows,
    CONVERT(bigint, SQRT(1000 * p.rows)) AS [new_auto_stats_threshold],
    ((p.rows * 20) / 100) + CASE WHEN p.rows > 499 THEN 500 ELSE 0 END AS [old_auto_stats_threshold]
FROM sys.stats AS s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS p
WHERE s.name = 'v'
ORDER BY p.modification_counter DESC;

Edge cases aside, those calculations should get you Mostly Accurate™ numbers.

We’re going to need those for what we do next.

Mods Mods Mods


This script will allow us to delete and re-insert a bunch of rows back into a table, without messing up identity values.

--Create a temp table to hold rows we're deleting
DROP TABLE IF EXISTS #Votes;
CREATE TABLE #Votes (Id int, PostId int, UserId int, BountyAmount int, VoteTypeId int, CreationDate datetime);

--Get the current high PostId, for sanity checking
SELECT MAX(vb.PostId) AS BeforeDeleteTopPostId FROM dbo.Votes_Beater AS vb;

--Delete only as many rows as we can to not trigger auto-stats
WITH v AS 
(
    SELECT TOP (229562 - 1) vb.*
    FROM dbo.Votes_Beater AS vb
    ORDER BY vb.PostId DESC
)
DELETE v
--Output deleted rows into a temp table
OUTPUT Deleted.Id, Deleted.PostId, Deleted.UserId, 
       Deleted.BountyAmount, Deleted.VoteTypeId, Deleted.CreationDate
INTO #Votes;

--Get the current max PostId, for safe keeping
SELECT MAX(vb.PostId) AS AfterDeleteTopPostId FROM dbo.Votes_Beater AS vb;

--Update stats here, so we don't trigger auto stats when we re-insert
UPDATE STATISTICS dbo.Votes_Beater;

--Put all the deleted rows back into the rable
SET IDENTITY_INSERT dbo.Votes_Beater ON;

INSERT dbo.Votes_Beater WITH(TABLOCK)
        (Id, PostId, UserId, BountyAmount, VoteTypeId, CreationDate)
SELECT v.Id, v.PostId, v.UserId, v.BountyAmount, v.VoteTypeId, v.CreationDate
FROM #Votes AS v;

SET IDENTITY_INSERT dbo.Votes_Beater OFF;

--Make sure this matches with the one before the delete
SELECT MAX(vb.PostId) AS AfterInsertTopPostId FROM dbo.Votes_Beater AS vb;

What we’re left with is a statistics object that’ll be just shy of auto-updating:

WE DID IT

Query Time


Let’s look at how the optimizer treats queries that touch values! That’ll be fun, eh?

--Inequality, default CE
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Votes_Beater AS vb
WHERE vb.PostId > 20671101
OPTION(RECOMPILE);

--Inequality, legacy CE
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Votes_Beater AS vb
WHERE vb.PostId > 20671101
OPTION(RECOMPILE, USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

--Equality, default CE
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Votes_Beater AS vb
WHERE vb.PostId = 20671101
OPTION(RECOMPILE);

--Equality, legacy CE
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Votes_Beater AS vb
WHERE vb.PostId = 20671101
OPTION(RECOMPILE, USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

For the record, > and >= produced the same guesses. Less than wouldn’t make sense here, since it’d hit mostly all values currently in the histogram.

SQL Server Query Plan
hoodsy

Inside Intel


For the legacy CE, there’s not much of an estimate. You get a stock guess of 1 row, no matter what.

For the default CE, there’s a little more to it.

inequality
SELECT (0.00130115 * 5.29287e+07) AS inequality_computation;

 

equality
SELECT (1.06162e-06 * 5.29287e+07) AS equality_computation;

And of course, the CARD for both is the number of rows in the table:

SELECT CONVERT(bigint, 5.29287e+07) AS table_rows;

I’m not sure why the scientific notation is preferred, here.

A Little Strange


Adding in the USE HINT mentioned earlier in the post (USE HINT ('ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS')) only seems to help with estimation for the inequality predicate. The guess for the equality predicate remains the same.

SQL Server Query Plan
well okay

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Starting SQL: What’s The Difference Between Parameters And Local Variables In SQL Server?

Parameter Positive


There are many good reasons to parameterize a query, but there are also trade-offs. There’s no such thing as a free parameter, as they say.

In this post, we’re going to discuss what is and isn’t a parameter, and some of the pros and cons.

What’s important to keep in mind is that good indexing can help avoid many of the cons, but not all. Bad indexing, of course, causes endless problems.

There are many good reasons to parameterize your queries, too. Avoiding SQL injection is a very good reason.

But then!

What’s Not A Parameter


It can be confusing to people who are getting started with SQL Server, because parameters and variables look exactly the same.

They both start with @, and feel pretty interchangeable. They behave the same in many ways, too, except when it comes to cardinality estimation.

To generalize a bit, though, something is a parameter if it belongs to an object. An object can be an instance of:

  • A stored procedure
  • A function
  • Dynamic SQL

Things that aren’t parameters are things that come into existence when you DECLARE them. Of course, you can pass things you declare to one of the objects above as parameters. For example, there’s a very big difference between these two blocks of code:

DECLARE @VoteTypeId INT = 7;

SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;

DECLARE @sql NVARCHAR(MAX) = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
'
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;

But it’s not obvious until you look at the query plans, where the guess for the declared variable is god awful.

Then again, if you read the post I linked to up there, you already knew that. Nice how that works.

If you’re too lazy to click, I’m too lazy to repeat myself.

SQL Server Query Plan
thanks

What’s the point? Variables, things you declare, are treated differently from parameters, things that belong to a stored procedure, function, or dynamic SQL.

Parameter Problems


The problem with parameterization is one of familiarity. It not only breeds contempt, but… sometimes data just grows apart.

Really far apart.

SELECT 
    v.VoteTypeId, 
    COUNT_BIG(*) AS records
FROM dbo.Votes AS v
GROUP BY v.VoteTypeId
ORDER BY records;
SQL Server Query Results
pattern forming

Natural Selection


When you parameterize queries, you give SQL Server permission to remember, and more importantly, to re-use.

What it re-uses is the execution plan, and what it remembers are cardinality estimates. If we do something like this, we don’t get two different execution plans, or even two different sets of guesses, even though the values that we’re feeding to each query have quite different distributions in our data.

The result is two query plans that look quite alike, but behave quite differently.

SQL Server Query Plan
wist

One takes 23 milliseconds. The other takes 1.5 seconds. Would anyone complain about this in real life?

Probably not, but it helps to illustrate the issue.

Leading Miss


Where this can get confusing is when you’re trying to diagnose a performance problem.

If you look in the plan cache, or in query store, you’ll see the plan that gets cached for the very first parameter. It’ll look simple and innocent, sure. But the problem is with a totally different parameter that isn’t logged anywhere.

You might also face a different problem, where the query recompiles because you restarted the server, updated stats, rebuilt indexes, or enough rows in the table changed to trigger an automatic stats update. If any of those things happen, the optimizer will wanna come up with a new plan based on whatever value goes in first.

If the roles get reversed, the plan will change, but they’ll both take the same amount of time now.

DECLARE @VoteTypeId INT;

SET @VoteTypeId = 16

DECLARE @sql NVARCHAR(MAX) = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
';
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;

SET @VoteTypeId = 7;

SET @sql = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
';
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;
SQL Server Query Plan
totin’

Deal With It ?


In the next few posts, we’ll talk about what happens when you don’t parameterize queries, and different ways to deal with parameter sniffing.

  • A recompile hint can help, it might not always be appropriate depending on execution frequency and plan complexity
  • Optimize for unknown hints will give you the bad variable guess we saw at the very beginning of this post

We’re going to need more clever and current ways to fix the issue. If you’re stuck on those things recompiling or unknown-ing, you’re stuck not only on bad ideas, but outdated bad ideas.

Like duck l’orange and Canadian whiskey.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

How Are Table Variables Different In SQL Server 2019?

WallaWallaBingBang


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

How Bad Cardinality Estimates Lead To Bad Query Plan Choices

Let’s Run A Cruddy Query


We’ve got no supporting indexes right now. That’s fine.

The optimizer is used to not having helpful indexes. It can figure things out.

    SELECT   p.*
    FROM     dbo.Posts AS p
    JOIN     dbo.Votes AS v
        ON p.Id = v.PostId
    WHERE    p.PostTypeId = 2
    AND      p.CreationDate >= '20131225'
    ORDER BY p.Id;
SQL Server Query Plan
Snap City.

So uh. We got a merge join here. For some reason. And a query that runs for 27 seconds.

The optimizer was all “no, don’t worry, we’re good to sort 52 million rows. We got this.”

SQL Server Query Plan
YOU’VE GOT THIS.

[You don’t got this — ED]

Choices, Choices


Since we have an order by on the Id column of the Posts table, and that column is the Primary Key and Clustered index, it’s already in order.

The optimizer chose to order the PostId column from the Votes table, and preserve the index order of the Id column.

Merge Joins expect ordered input on both sides, don’tcha know?

It could have chosen a Hash Join, but then the order of the Id column from the Posts table wouldn’t have been preserved on the other side.

Merge Joins are order preserving, Hash Joins aren’t. If we use a Hash Join, we’re looking at ordering the results of the join after it’s done.

But why?

SQL Server Query Plan
Good Guess, Bad Guess

Going into the Merge Join, we have a Good Guess™

Coming out of the Merge Join, we have a Bad Guess™

Thinking back to the Sort operator, it only has to order the PostId column from the Votes table.

That matters.

Hash It Up


To compare, we need to see what happens with a Hash Join.

SQL Server Query Plan
smh

Okay, ignore the fact that this one runs for 2.6 seconds, and the other one ran for 27 seconds.

Just, like, put that aside.

Here’s why:

SQL Server Query Plan
Neither Nor

This Sort operator is different. We need to sort all of the columns in the Posts table by the Id column.

Remember that the Id column is now out of order after the Hash Join.

Needing to sort all those columns, including a bunch of string columns, along with an NVARCHAR(MAX) column — Body — inflates the ever-weeping-Jesus out of the memory grant.

SQL Server Query Plan
I see.

The Hash Join plan is not only judged to be more than twice as expensive, but it also asks for a memory grant that’s ~3x the size of the Merge Join plan.

Finish Strong


Let’s tally up where we’re at.

Both queries have identical estimated rows.

The optimizer chooses the Merge Join plan because it’s cheaper.

  • The Merge Join plan runs for 27 seconds, asks for 3.3GB of RAM, and spills to disk.
  • The Hash Join plan runs for 3 seconds, asks for 9.7GB of RAM and doesn’t spill, but it only uses 188MB of the memory grant.

That has impacted the reliability.

In a world where memory grants adjust between executions, I’ll take the Hash Join plan any day of the week.

But this is SQL Server 2017, and we don’t get that without Batch Mode, and we don’t get Batch Mode without playing some tricks.

There are lots of solutions if you’re allowed to tune queries or indexes, but not so much otherwise.

In the next couple posts, I’ll look at different ways to approach this.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why Some SQL Server Date Functions Get Better Cardinality Estimates

Date Debate


Searching dates is a common enough task. There are, of course, good and bad ways to do this.

Aaron Bertrand’s article, Bad habits to kick: mis-handling date / range queries, is a good place to start to learn about that subject.

This isn’t quite about the same thing, just about some behavior that I thought was interesting, and how it changes between cardinality estimator versions.

Bad Robot


If you’ve been query tuning for a while, you probably know about SARGability, and that wrapping columns in functions is generally a bad idea.

But just like there are slightly different rules for CAST and CONVERT with dates, the repercussions of the function also vary.

The examples I’m going to look at are for YEAR() and MONTH().

If you want a TL;DR, here you go.

Reality Bites

If you wanna keep going, follow me!

USING


The takeaway here isn’t that doing either of these is okay. You should fully avoid wrapping columns in functions in general.

One of the main problems with issuing queries with non-SARGable predicates is that the people who most often do it are the people who rely on missing index requests to direct tuning efforts, and non-SARGable queries can prevent those requests from surfacing, or ask for an even more sub-optimal index than usual.

If you have a copy of the StackOverflow2013 database, you can replicate the results pretty easily on SQL Server 2017.

They may be slightly different depending on how the histogram is generated, but the overarching theme is the same.

Yarly


If you run these queries, and look at the estimated and actual rows in the Clustered Index scan tooltip, you’ll see they change for every query.

DECLARE @blob_eater DATETIME;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2008;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2009;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2010;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2011;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2012;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2013;
    GO

Here’s a sample from the 2008 and 2009 queries.

Wild For The Night

ED: I took a break from writing this and “went to brunch”.

Any logical inconsistencies will work themselves out eventually.

Cash Your Checks And Come Up


Alright, let’s try that again with by month.

If you hit yourself in the head with a hammer and forgot the TL;DR, here’s what happens:

DECLARE @blob_eater DATETIME;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 1;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 2;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 3;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 4;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 5;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 6;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 7;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 8;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 9;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 10;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 11;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 12;

If you run these, they’ll all have the same guess on the clustered index scan.

To keep things simple, let’s look at the first couple:

BADLY

The difference here is that now every single row estimate will be 205,476.

Lesson learned: The optimizer can make a decent statistical guess at the year portion of a date, but not the month portion.

In a way, you can think of this like a LIKE query.

The optimizer can make a decent guess at ‘YEAR%’, but not at ‘%MONTH%’.

Actual Facts To Snack On And Chew


The same thing happens for both new and old cardinality estimators.

DECLARE @blob_eater DATETIME;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2008
    OPTION(USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION'));

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE YEAR(u.CreationDate) = 2008
    OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
    GO 

DECLARE @blob_eater DATETIME;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 12
    OPTION(USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION'));


    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 12
    OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
    GO

Wouldn’t Get Far


But if we combine predicates, something really different happens between Linda Cardellini estimators.

DECLARE @blob_eater DATETIME;

    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 12
    AND YEAR(u.CreationDate) = 2012
    OPTION(USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION'));


    SELECT @blob_eater = u.CreationDate
    FROM dbo.Users AS u
    WHERE MONTH(u.CreationDate) = 12
    AND YEAR(u.CreationDate) = 2012
    OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
    GO
WRONG

In this case, the old CE (on the right), makes a very bad guess of 1 row.

The new CE (on the left) makes a slightly better, but still not great guess.

Ended


Neither of these is a good way to query date or time data.

You can see in every tooltip that, behind the scenes, the queries used the DATEPART function, which means that also doesn’t help.

The point of this post is that someone may use a function to query the year portion of a date and assume that SQL Server does a good job on any other portion, which isn’t the case.

None of these queries are SARGable, and at no point is a missing index request raised on the CreationDate column, even though if you add one it gets used and reduces reads.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server Join Containment for the Common Man

Lots of smart people have written about join containment, but none of the explanations really made sense to me. I felt like a student memorizing definitions for a test. Sure, I could tell you the definitions of base and simple containment, but what practical difference does it make when it comes to cardinality estimation? The concept finally clicked when working on an Oracle query of all things, and as a result I wrote this blog post. All testing was done on SQL Server 2017 with a CE version of 140.

A Note on Join Cardinality


Join cardinality calculations are incredibly complex in SQL Server. You can get a small taste of that complexity here. I’ve chosen the example data in this blog post to avoid most of the complexity. The formulas and concepts described in this post can’t be used to model join cardinality generally, but I hope that they serve as a good illustration of containment.

Demo Tables


All of the demo tables have identical structures with similar data. The first column, UNIQUE_ID, stores unique integers in the range specified in the table name. For example, TA_1_TO_1000000 is a table that stores integers from 1 to 1000000. The second column, MOD_FILTER, stores integers from 1 to 100 cycling through all rows. The purpose of this column is to make filtering cardinality estimates simple to calculate and predict. For example, MOD_FILTER BETWEEN 1 AND 50 will return 50% of the rows from the table. Full statistics are gathered on all columns, and there are four tables in all.

DROP TABLE IF EXISTS dbo.TA_1_TO_1000000;

CREATE TABLE dbo.TA_1_TO_1000000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TA_1_TO_1000000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TA_1_TO_1000000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TA_1_TO_1000000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TB_1_TO_1000000;

CREATE TABLE dbo.TB_1_TO_1000000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TB_1_TO_1000000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TB_1_TO_1000000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TB_1_TO_1000000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TC_1_TO_100000;

CREATE TABLE dbo.TC_1_TO_100000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TC_1_TO_100000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (100000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TC_1_TO_100000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TC_1_TO_100000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TD_500001_TO_1500000;

CREATE TABLE dbo.TD_500001_TO_1500000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TD_500001_TO_1500000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) 500000 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TD_500001_TO_1500000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TD_500001_TO_1500000 (MOD_FILTER)
	WITH FULLSCAN;

The statistics objects are perfect in that they fully describe the data. Here’s the statistics output for the UNIQUE_ID column:

a21_T2_perfect_stats_1

And here’s the output for the MOD_FILTER column:

a21_T2_perfect_stats_2

This only happened because the table was populated with very simple data that fits well within the framework for generating histograms in SQL Server. Gathering statistics, even with FULLSCAN, will often not perfectly represent the data in the column.

A Simple Model of Join Cardinality Estimation


Consider the following simple query:

SELECT *
FROM TB_1_TO_1000000 b
INNER JOIN dbo.TD_500001_TO_1500000 d
	ON b.UNIQUE_ID = d.UNIQUE_ID;

We know that exactly 500000 rows will be returned, but how might SQL Server estimate the number of rows to be returned? Let’s look at the histograms and try to align their steps:

a21_ex1_not_aligned

That doesn’t exactly work, but we can split up the histogram steps so they align. The assumption of uniformity within the step isn’t even needed here because we know that there aren’t missing any integer values. The histograms below are equivalent to the original ones:

a21_ex1_aligned

Now the RANGE_HI_KEY values align. For the step with a high value of 500001 we can expect only one row to match between tables. For the step with a high value of 1000000 we can expect 499998 + 1 rows to match. This brings the total row estimate to 500000, which happens to match what I get in SQL Server 2017 with the new CE. Remember, what we’re doing here isn’t how the query optimizer does the calculation. This is just a simple model that will be useful later.

Now consider the two queries below:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 51 AND 100;

We know that the first query will return 500k rows and the second query will return 0 rows. However, can SQL Server know that? Each statistics object only contains information about its own column. There’s no correlation between the UNIQUE_ID and MOD_FILTER columns, so there isn’t a way for SQL Server to know that the queries will return different estimates. The query optimizer can create an estimate based on the filters on the WHERE clause and on the histograms of the join columns, but there’s no foolproof way to do that calculation. The presence of the filters introduces uncertainty into the estimate, even with statistics that perfectly describe the data for each column. The containment assumption is all about the modeling assumption that SQL Server has to make to resolve that uncertainty.

Base Containment


Base containment is the assumption that the filter predicates are independent from the join selectivity. The estimate for the join should be obtained by multiplying together the selectivity from both filters and the join. The query optimizer uses base containment starting with CE model version 120, also known as the new CE introduced in SQL Server 2014. It can be used with the legacy CE if trace flag 2301 is turned on. The best reference for trace flag 2301 is a blog post from 2006 which is no longer published.

Let’s go back to this example query:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

The selectivity for the filter on MOD_FILTER is 0.5 for both tables. This is because there are 100 unique values for MOD_FILTER between 1 and 100 and each value matches 1% of the table. We can see this by getting an estimated query plan on just TA_1_TO_1000000:

a21_ex2_filter_selectivity

The table has 1 million rows, so the estimate is 500000 = 0.5 * 1000000.

That leaves the join selectivity. We put the same data into both tables:

a21_ex2_same_histograms

We don’t need highlighters to see that the join selectivity is 1.0.

Putting it all together, the cardinality estimate under base containment for this query should be 1000000 * 1.0 * 0.5 * 0.5 = 250000. This is indeed the estimate:

a21_ex2_base_estimate

Of course, this doesn’t match the actual number of rows which is 500000. But it’s easy to change the filter predicates so that the estimated number of rows and the actual number of rows match.

Simple Containment


Simple containment is the assumption that the filter predicates are not independent. The estimate for the join should be obtained by applying the filter selectivities to the join histograms and joining based on the adjusted histograms. The query optimizer uses simple containment within the legacy CE. Simple containment can be used in the new CE via trace flag or USE HINT.

Let’s go back to the same example query:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

We know that the filter selectivity for both tables is 0.5. How can that be used to adjust the histograms? The simplest method would be to just multiply RANGE_ROWS, EQ_ROWS, and DISTINCT_RANGE_ROWS by the filter selectivity. After doing so we’re left with two still identical histograms:

a21_ex2_simple_histograms

It might seem odd to work with fractions of a row, but as long as everything is rounded at the end why should it matter? With two identical, aligned histograms it seems reasonable to expect a cardinality estimate of 0.5 + 499999 + 0.5 = 500000. This is exactly what we get in SQL Server:

a21_ex2_simple_estimate

The actual row estimate matches the estimated row estimate because the filters are perfectly correlated. Every row left after filtering still has a matching row in the other table.

Just One Filter


What happens if we filter on just a single table? For example:

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 30;

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 30
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

For base containment, we know that the filter selectivity is 0.3 and the join selectivity is 1.0. We can expect a cardinality estimate of 1000000 * 1.0 * 0.3 = 300000 rows.

For simple containment we need to multiply the histogram for TA_1_TO_1000000 by 0.3. Here’s what the two histograms look like after factoring in filter selectivity:

a21_ex3_simple_histograms

What should the estimate be? One approach would be to assume that everything matches between the aligned steps. So we end up with 0.3 rows from the step with a RANGE_HI_KEY of 1 and 299999.4 + 0.3 rows from the step with a RANGE_HI_KEY of 1000000. The combined estimate is 300000 rows, which matches the base containment estimate. Why shouldn’t they match? Without filters on both tables there’s no concept of correlation. If it helps you can imagine a filter of 1 = 1 on TB_1_TO_1000000. For base containment multiplying by 1.0 won’t change the estimate and for simple containment multiplying by 1 won’t change the histogram. That just leaves a single filter selectivity of 0.3 for TA_1_TO_1000000 and both estimates should be the same.

For both queries the estimated number of rows in SQL Server is 300000. Our calculations match the SQL Server query optimizer exactly for this query.

Filtering on the Join Column


What happens if we filter on the join columns of both tables? For example:

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.UNIQUE_ID BETWEEN 1 AND 200000
AND b.UNIQUE_ID BETWEEN 1 AND 200000;

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.UNIQUE_ID BETWEEN 1 AND 200000
AND b.UNIQUE_ID BETWEEN 1 AND 200000
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

Think back to why we need containment in the first place. When there are filters on columns that aren’t the join columns then we need to make an assumption as to how the selectivities all interact with each other. With a filter on the join column we can just adjust the histogram of the join column directly. There isn’t any uncertainty. Here’s what the histograms could look like:

a21_ex4_histograms

In which case, it seems obvious that the final estimate should be 200000 rows. Simple containment does not result in a different estimate here.

Removing Rows


So far the examples have been very simple. We’ve joined tables that contain the exact same data. What if one table has fewer rows than the other table? Consider the following pair of queries:

SELECT *
FROM dbo.TC_1_TO_100000 c
INNER JOIN dbo.TB_1_TO_1000000 b
	ON c.UNIQUE_ID = b.UNIQUE_ID
WHERE c.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

SELECT *
FROM dbo.TC_1_TO_100000 c
INNER JOIN dbo.TB_1_TO_1000000 b
	ON c.UNIQUE_ID = b.UNIQUE_ID
WHERE c.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

It’s important to call out here that TC_1_TO_100000 has just 100000 rows instead of one million. For base containment, we know that the selectivity will be 0.5 for both tables. What about join selectivity? The histogram steps of course aren’t aligned:

a21_ex5_initial_histograms

The data is densely packed, so we can use the same trick as before to split the histogram for the larger table:

a21_ex5_base_aligned_histograms

Every row in histogram for the smaller table has a match in the histogram of the larger table. From the point of view of the smaller table the join selectivity is 1.0. Multiplying together all three selectivities gives a final row estimate of 100000 * 1.0 * 0.5 * 0.5 = 25000. This matches the row estimate within SQL Server exactly.

For simple containment we need to apply the filter selectivities of 0.5 to both tables. We also need to align the histograms by splitting the larger histogram. Both will be done in one step:

a21_ex5_simple_histograms

Every row in the smaller histogram once again matches. Our final estimate is 0.5 + 49999 + 0.5 = 50000 which exactly matches the SQL Server query optimizer.

Unmatched Rows


What happens if the tables have the same number of rows but they clearly don’t contain the same data? Consider the following pair of queries:

SELECT *
FROM dbo.TD_500001_TO_1500000 d
INNER JOIN dbo.TB_1_TO_1000000 b
	ON d.UNIQUE_ID = b.UNIQUE_ID
WHERE d.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 10;

SELECT *
FROM dbo.TD_500001_TO_1500000 d
INNER JOIN dbo.TB_1_TO_1000000 b
	ON d.UNIQUE_ID = b.UNIQUE_ID
WHERE d.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 10
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

The filter predicate for TB_1_TO_1000000 is 0.1 and the filter predicate for TD_500001_TO_1500000 is 0.5. Here are our starting histograms:

a21_ex6_base_initial_histograms

The little man who lives inside the cardinality estimator needs to slice them up so they align. His work is complete:

a21_ex6_base_sliced_histograms

The top histogram has 500000 unmatched rows in the step with a RANGE_HI_KEY of 1500000, so the join selectivity is 500000 / 1000000 = 0.5. Putting all three selectivities together, the cardinality estimate with base containment should be 1000000 * 0.5 * 0.1 * 0.5 = 25000. This exactly matches SQL Server.

You know the drill for simple containment. We need to multiply each sliced histogram by its filter selectivity:

a21_ex6_simple_sliced_histograms

That’s pretty messy. I’m going to assume that every row has a match between the two shared steps, so the estimate should be 0.1 + 49999.8 + 0.1 = 50000. The number of estimated rows reported by SQL Server is 50000.4 :

a21_ex6_simple_estimate

What happened? Did the little man only measure once before cutting? This is one of those examples where there’s other complicated stuff going on under the hood, so the predicted row estimate doesn’t match up exactly. Interestingly, the estimate with the legacy cardinality estimator is exactly 50000.

An Approximate Formula


  • Define T1_CARDINALITY as the number of rows in the first joined table.
  • Define T1_FILTER_SELECTIVITY as the filter selectivity of the filter predicates of the first table. This number ranges from 0.0 to 1.0, with 1.0 for filters that remove no rows.
  • Define T2_CARDINALITY as the number of rows in the second joined table.
  • Define T2_FILTER_SELECTIVITY as the filter selectivity of the filter predicates of the second table. This number ranges from 0.0 to 1.0, with 1.0 for filters that remove no rows.
  • Define JOIN_SELECTIVITY as the selectivity of the two histograms of the joined columns from the point of view of the smaller table. This number ranges from 0.0 to 1.0, with 1.0 meaning that all rows in the smaller table have a match in the larger table.

Based on the tests above, we can model the cardinality estimates for base and simple containment as follows:

Base containment = JOIN_SELECTIVITY * LEAST(T1_CARDINALITY, T2_CARDINALITY) * T1_FILTER_SELECTIVITY * T2_FILTER_SELECTIVITY
Simple containment = JOIN_SELECTIVITY * LEAST(T1_FILTER_SELECTIVITY * T1_CARDINALITY, T2_FILTER_SELECTIVITY * T2_CARDINALITY)

Remember that this isn’t how SQL Server actually does it. However, I think that it shows the difference between base containment and simple containment quite well. For simple containment the filters are applied to the histograms and for base containment all of the selectivities are independent.

A Mathematical Proof?


So far simple containment has always had a higher cardinality estimate than base containment. Looking at the formulas it certainly feels like simple should have a higher estimate. Can we prove that the estimate will always be higher using the above formulas? It’s been quite a few years so I apologize for the proof below, but I believe that it gets the job done.

Definitions:

JS = JOIN_SELECTIVITY
C1 = T1_CARDINALITY
F1 = T1_FILTER_SELECTIVITY
C2 = T2_CARDINALITY
F2 = T2_FILTER_SELECTIVITY

Attempt a proof by contradiction, so assume the opposite of what we want to prove:

JS * LEAST(C1, C2) * F1 * F2 > JS * LEAST(F1 * C1, F2 * C2)

We know that JS > 0, F1 > 0, and F2 > 0, so:

LEAST(C1, C2) > LEAST(C1 / F2, C2 / F1)

The left hand expression can only evalute to C1 or C2. Let’s assume that it evaluates to C1, so C1 <= C2. We know that F1 <= 1, so C2 <= C2 / F1. C1 / F2 > C1, so the only hope of the inequality above being true is if C1 > C2 / F1. Putting it all together:

C1 <= C2 <= C2 / F1 < C1

That is clearly impossible. Very similar logic holds if the left hand expression evaluates to C2 (just flip 1 with c in the above), so we know that the equation that we started out with is not true. Therefore:

JS * LEAST(C1, C2) * F1 * F2 <= JS * LEAST(F1 * C1, F2 * C2)

In other words:

BASE CONTAINMENT <= SIMPLE CONTAINMENT

Here’s my public domain celebration picture:

a21_anniversary-157248_960_720

The details of this stuff within SQL Server are very complicated, so this doesn’t mean that there doesn’t exist a query that has a larger cardinality estimate with base containment. However, it seems to be a safe assumption that in general simple containment will result in a larger or equal estimate compared to base containment.

Why Does Any of This Matter?


I almost created a kind of real life example here, but I ran out of time so you’re eating Zs for dinner again as usual. Let’s introduce a table to cause some trouble:

DROP TABLE IF EXISTS dbo.ROWGOAL_TROUBLES;

CREATE TABLE dbo.ROWGOAL_TROUBLES (
	UNIQUE_EVEN_ID BIGINT NOT NULL,
	PAGE_FILLER VARCHAR(1000) NOT NULL
);

INSERT INTO dbo.ROWGOAL_TROUBLES
	WITH (TABLOCK)
SELECT 2 * t.RN
, REPLICATE('Z', 1000)
FROM
(
	SELECT TOP (50000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) / 100 RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

Consider the following business critical query that I run all the time:

SELECT *
FROM dbo.TA_1_TO_1000000 t1
INNER JOIN dbo.TB_1_TO_1000000 t2
	ON t1.UNIQUE_ID = t2.UNIQUE_ID
WHERE t1.MOD_FILTER = 1
AND t2.MOD_FILTER = 1
AND NOT EXISTS (
	SELECT 1
	FROM dbo.ROWGOAL_TROUBLES rt
	WHERE rt.UNIQUE_EVEN_ID = t1.UNIQUE_ID
)
OPTION (MAXDOP 1);

The plan doesn’t look so hot:

a21_bad_row_goal

There are unmatched rows in the ROWGOAL_TROUBLES table, so we know that the scan on the inner side of the nested loop is going to read a lot of rows. The query took about 60 seconds to finish on my machine and read 499775000 rows from the ROWGOAL_TROUBLES table. Why did this plan seem attractive to SQL Server? The query optimizer thought that only 100 rows would be returned after the join of TA_1_TO_1000000 to TB_1_TO_1000000. The filters are perfectly correlated so 10000 rows will be returned in reality. With perfectly correlated filters we can expect a better estimate if we use simple containment:

SELECT *
FROM dbo.TA_1_TO_1000000 t1
INNER JOIN dbo.TB_1_TO_1000000 t2
	ON t1.UNIQUE_ID = t2.UNIQUE_ID
WHERE t1.MOD_FILTER = 1
AND t2.MOD_FILTER = 1
AND NOT EXISTS (
	SELECT 1
	FROM dbo.ROWGOAL_TROUBLES rt
	WHERE rt.UNIQUE_EVEN_ID = t1.UNIQUE_ID
)
OPTION (
MAXDOP 1,
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

With a better estimate of 10000 rows comes a better query plan:

a21_no_row_goal

The query finishes in under a second on my machine.

Final Thoughts


Hopefully this blog post gives you a better understanding of the difference between base and simple containment. Read some of the other explanations out there if this wasn’t helpful. Containment is a tricky subject and you never know what it’ll take for it to make sense to you.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.