Indexing – Page 4 – Darling Data

Software Vendor Mistakes With SQL Server: Not Indexing For Evolving Queries

Posted on January 6, 2022May 16, 2022 by Erik Darling

Fledgling

Like many things, applications tend to evolve over time. New and improved functionality, Bug Fixed And Performance Improvements™, and ever expanding schema usually lead to new queries or tweaks to existing ones.

When designing new queries, creating or tweaking indexes to help them along is perhaps a bit more intuitive, depending on how comfortable you are with such things. If you’re starting from near-zero there, check the link at the end of my post for 75% off of my video training. I’ll teach you how to design effective indexes.

One of the more common issues I see is that someone tweaked a query to support new functionality, and it just happened to use indexes well-enough in a development environment that’s much smaller than real customer deployments.

In these cases, a better index may not be recommended by SQL Server. If it’s not obvious to the optimizer, it may also not be obvious to you, either. No offense.

All Devils

Let’s say we have this query. It’s nothing magnificent, but it’s enough to prove a couple points.

CREATE INDEX c ON dbo.Comments(CreationDate, UserId);

SELECT TOP (1000)
    C.CreationDate,
    C.Score,
    C.Text,
    DisplayName = 
        (
            SELECT 
                U.DisplayName 
            FROM dbo.Users AS U 
            WHERE U.Id = C.UserId
        )
FROM dbo.Comments AS C
WHERE C.CreationDate >= '20131215'
ORDER BY C.CreationDate DESC;

Our index serves three purposes:

The predicate on CreationDate
The order by on CreationDate
The correlated subquery on UserId

It’s important to keep things like this in mind, that sorted data is useful for making more efficient.

Let’s Go

The query plan for this is quite simple and efficient.

It’s hard to ask for anything faster, here, even if I am running on a VM. Two seeks, a small lookup, and everything done in 5 milliseconds.

Wolves At

But then one day a pull request comes along that changes the query slightly, to let us also filter and order by the Score column.

It looks like this now:

SELECT TOP (1000)
    C.CreationDate,
    C.Score,
    C.Text,
    DisplayName = 
        (
            SELECT 
                U.DisplayName 
            FROM dbo.Users AS U 
            WHERE u.Id = C.UserId
        )
FROM dbo.Comments AS C
WHERE C.CreationDate >= '20131215'
AND   C.Score >= 8
ORDER BY
    C.CreationDate DESC,
    C.Score DESC;

In real life, if you’re smart, your queries will be parameterized. In this blog post, these values are to show you what can happen even with small changes to a query. The query plan looks like this now:

Arf Arf

We still seek into our nonclustered index to search for CreationDates that pass our predicate, but now we need to evaluate the Score predicate when we do our key lookup.

Rather than just get 1000 rows out immediately, we need to keep findings rows that pass the CreationDate predicate, but that also pass the Score predicate.

Since that’s judged by the optimizer to be a much more “expensive” task, and a parallel plan is chosen. Despite that, it still takes 231 milliseconds of duration, with 844 milliseconds of CPU time.

This could have many effects on the workload, depending on how frequently a query executes within the workload. Parallel queries use more CPU threads, which can get tricky under high concurrency, since they’re a finite resource based on CPU count.

We can save a lot of the problems here with a slightly adjusted index, like this:

CREATE INDEX c ON dbo.Comments(CreationDate, Score, UserId);

Delicious

With that index in place, we get a much more efficient plan back, that doesn’t need to go parallel to stay competitively fast. It’s not quite as fast as the original query, but it’s Good Enough™.

Cheese Wants

I know you’re all smart folks out there, and you’re going to have questions about the index I created, and why the columns are in the order they’re in.

If you have a copy of the StackOverflow2013 database, you might do some poking around and find that there are 374,865 rows that pass our CreationDate predicate, and only 122,402 that pass our Score filter, making Score more selective for this version of the query.

But that’s just this one execution, and things could be a lot different depending on what users filter on. The big advantage to keeping the columns in this order is that the order by remains supported. Without that, the optimizer goes back to choosing a parallel plan, and asking for a memory grant.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Not Deduplicating Overlapping Indexes

Posted on January 5, 2022May 16, 2022 by Erik Darling

Pete And Repeat

Building on yesterday’s post about cleaning up old indexes: Once you’ve gotten rid of indexes that aren’t used anymore, you’re gonna have some more work to do. I know, it sucks, but hopefully it won’t be too difficult or confusing. If it is, hit the link at the end of the post to drop me a line for some consulting help.

The next thing I usually do is look for nonclustered indexes that have overlapping columns in the key to merge together.

Here are two basic patterns to look at, in order of how useful they are to us currently:

Key columns are an exact match
Key columns are super/sub-sets of other indexes
Key columns match to a point, but then differ
Key columns are the same, but in a different order

For the second two, I put those aside at first. Remember that we already got rid of indexes that aren’t used at all to make queries go faster, so now we’re left with indexes that do get used (though how much will vary dramatically from database to database).

Order Is Everything

Key column order matters to us, because it defines how queries can access data in the index. Let’s take an imaginary index keyed on columns (A, B, C).

If you want to search on:

A: Fast
A, B: Fast:
A, B, C: Fast
B: Slower
B, C: Slower
C: Slower

Column A being the leading key column in the index means that index data is sorter first by column A. If there are any duplicates in there, then column B will be sorted within that group.

One way to visualize that is like this:

Finding any value(s) in column A is easy, because they’re in sorted order. But finding values in any combination of B/C means we have to scan through all the values to find ones we care about, if we’re not also searching on A.

If we have indexes on

A, B, C
B, A C

Are they identical? Maybe sorta kinda. This is where domain knowledge about your application comes into play, and knowing if queries most often filter on A or B, and which queries are more important to the workload. If you’re not sure, leave’em both alone for now.

Definitional

Let’s say you have a bunch of indexes from the first two categories, where the key columns might look something like this:

Duplicates:

Key: A, B Includes (D, E)
Key: A, B Includes (D, E, F, G)

Super/sub-sets:

Key: D, E Includes (A, B)
Key: D, E, F Includes (A, B, C)

There are some other things we need to consider about the indexes:

Included columns: Can be merged safely; order doesn’t mater
- Only stored at the leaf level, not ordered
Uniqueness: Can only be preserved for exact key matches
- Unique D, E is different from D, E, F
Filters: Look at usage metrics to figure out these
- A filtered may not be useful to all queries, especially when parameterized

Practical

In theory, wider indexes are better indexes, because they’re more useful to more queries.

With indexes that fully cover all the columns our queries reference, we don’t need to worry about the optimizer sometimes choosing our index and sometimes choosing the clustered index depending on how many rows it thinks it’s going to have to deal with.

That’s generally a positive, but there are some caveats:

Indexing columns that are updated frequently can exacerbate locking and deadlocking
Indexing max columns over and over again can really bloat out our database
Indexing to include every column in the table creates whole copies of the table

That being said, you can run into certain locking problems if you’re using a garbage isolation level like Read Committed:

You have to strike a careful balance with indexes. Enough, but not too many. Covering, but not in a counterproductive way.

To find indexes that can be removed because they’re overlapping, a good, free tool is sp_BlitzIndex. It’s part of an open source project that I’ve spent a lot of time on. It’ll warn you about indexes that are exact and borderline duplicates, so that you can start looking at which are safe to merge together.

In some posts to follow, we’ll cover index design strategies that work, and how you can improve on SQL Server’s missing index requests.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Not Letting Customers Add Indexes

Posted on January 3, 2022May 16, 2022 by Erik Darling

I have ESPN

If your developers could intuit every possible way that your application could be used, and were really good at performance turning SQL Server, you probably wouldn’t be reading things like this.

But here I am, and here you are, and here we are, staring at each other desperately searching for a reason to go on.

Many times when I’m trying to help people achieve better performance, indexes are a big part of my analysis. Unless you have a brood of queries that are just so awful they’d thwart any and every index you could possibly throw under them, you can’t avoid this inevitability.

At the end of these calls, what I often get met with is: This is great, we just have to run it by our vendor.

Lemme explain why this is wrong, because I’ve seen that end result. Things don’t get better when you make then wait.

Sycamore Tree

You are a vendor. You vend software. The way clients use that software is up to them. They may customize it in some weird way, or they may have way more data stuck in their local site than other people.

It’s sort of like being a parent: you vend life to your kids, and you can teach them how you think they should use it, but there’s a really good chance they’re gonna make different choices.

When this happens, you need to be okay with the fact that your definition of how the software gets used is no longer applicable. This goes double for vendors who don’t bring much SQL Server performance tuning expertise to the table to begin with, because the indexes you had for your ideal usage pattern probably weren’t great either.

Running the new indexes they need by you for their usage of the software is a lot like you running what you wanna have for breakfast by me. I have no idea how you plan to spend your day. A feedbag of eggs and bacon might be totally reasonable.

The one exception to this is that the vendor might be able to take my suggestions and apply them to other installations — but this stinks of a different problem than hyper-specific customization — it means you have a lot of unhappy customers out there and you just got lucky that one was willing to pay for real help.

Plastic Surgeon

I’m going to spend a number of post talking about index follies I see all the time when looking at software vendor design patterns. For now, watch this video from my paid training to learn more about why you need nonclustered indexes:

Things like not cleaning up old indexes, not adding sufficient new indexes, general index design patterns, clustered indexes, and all the silly index settings I see people changing hoping to make a difference.

This post is to prepare you for the fact that indexes need to change over time, as your application grows. Index tuning is something you need to stay actively engaged with, otherwise you’re leaving a lot of performance on the table.

Especially for folks in the cloud, where hardware size is a monthly bill, this can be an expensive situation to end up in.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Using MAX Datatypes Unnecessarily

Posted on December 29, 2021May 16, 2022 by Erik Darling

Max For The Minimum

In an earlier post, we talked about how strings change the way SQL Server’s optimizer gives memory to queries under certain conditions. In that post, I talked about why MAX datatypes have the MAX problems.

In this post we’re going to look at a couple other issues with them:

You can’t put them in the key of an index
You can’t push predicated down to them

I know what you’re thinking, here. You’d never do that; you’re much smarter. But someday you might have to explain to someone all the reasons why they shouldn’t do that, and you might want some more in-depth reasons other than “it’s bad”.

Trust me, I have to explain this to people all the time, and I wish I had a few great resources for it.

Like these posts, I guess.

Maxamonium

First, we have have this Very Useful™ query.

SELECT c = COUNT_BIG(*) FROM dbo.Posts AS P WHERE P.Body LIKE N'SQL%';

The plan stinks and it’s slow as all get out, so we try to create an index.

CREATE INDEX not_gonna_happen
    ON dbo.Posts(Body);

But SQL Server denies us, because the Body column is nvarchar(max).

Msg 1919, Level 16, State 1, Line 7
Column 'Body' in table 'dbo.Posts' is of a type that is invalid for use as a key column in an index.

Second Chances

Our backup idea is to create this index, which still won’t make things much better:

CREATE INDEX not_so_helpful
    ON dbo.Posts(Id) INCLUDE(Body);

MAX columns can be in the includes list, but includes aren’t very effective for searching, unless they’re part of a filtered index. Since we don’t know what people will search for, we can’t create an explicit filter on the index either.

Even with a smaller index to read from, we spend a full two minutes filtering data out, because searching for N'SQL%' in our where clause can’t be pushed to when we scan the index.

And Sensibility

Let’s contrast that with a similar index and search of a column that’s only nvarchar(150). Sure, it’s not gonna find the same things. I just want you to see the difference in the query plan and time when we’re not hitting a (max) column.

This isn’t gonna help you if you genuinely do need to store data up to ~2GB in size in a single column, but it might help people who used a max length “just to be safe”.

CREATE INDEX different_world
    ON dbo.Posts(Id) INCLUDE(Title);

SELECT c = COUNT_BIG(*) FROM dbo.Posts AS P WHERE P.Title LIKE N'SQL%';

But if you fumbled around and found out, you might be able to downsize your columns to a byte length that actually fits the data, and do a lot better performance-wise. This search only takes about 460 milliseconds, even if we scan the entire index.

You may not like it, but this is what better performance looks like.

Going Further

SQL Server’s Filtered Indexes Don’t Always Have To Be Selective

Posted on December 6, 2021May 16, 2022 by Erik Darling

Somewhere Between

I’ve had quite a bit of luck with filtered indexes over the years. When they work well, they’re absolute aces.

But there’s a bit of trickiness to them, both for getting the optimizer to match the expression in your query to the filter on the index (they have to be an exact match), and for getting it to choose that index (parameterization is a nuisance for this).

You also need to define them correctly, which means any columns that you’re filtering on need to reside in the key or included columns as well. Not doing that risks the index not being used.

One thing I hear quite a bit — I’ve probably even said it myself at times — is that unless the index filters the number of rows down an appreciable amount, the filter might be more trouble than it’s worth.

But what if…

Camel Wides

I see queries that sometimes have a lot of predicates. Where clause, join clause, the works.

When you have queries that reliably filter on some literal values, like bits or a status or something equivalent, you may want to consider filtering your index on those even if they’re not selective in order to accomplish two things:

Reduce the number of candidate key columns
Reduce the effort of applying residual predicates

Designing effective indexes is tough, and it’s something I see a lot of people struggle with. Particularly with complex queries that filter, join, group by, order by, maybe even throw a window function in there for good measure…

You see what I’m getting at. For most OLTP queries, I want my indexes to be where clause centric. The faster I can filter rows down to find the ones I care about the more problems I avoid later on.

Filtering indexes is just another way of doing that.

Thanks for reading!

Going Further

Creating SQL Server Indexes: Stored Options vs. Creation Options

Posted on November 23, 2021May 16, 2022 by Erik Darling

No Lies

When you create an index, there are a lot of options. Recently while working with nice people who pay me, I’ve gotten the same question a few times.

I like to hand off index change scripts to people to help them drop unused indexes, merge duplicative indexes, and add in helpful missing indexes. When I do, I always specify some options along with them to help the create process along, like MAXDOP, ONLINE, and SORT_IN_TEMPDB.

The thing is, those settings aren’t ones that kick in automatically next time you rebuild the index or something; you have to specify them each time.

Here are index creation options:

SORT_IN_TEMPDB
DROP_EXISTING
ONLINE
RESUMABLE
MAX_DURATION
MAXDOP

Here are stored index options:

PAD_INDEX
FILLFACTOR
IGNORE_DUP_KEY
STATISTICS_NORECOMPUTE
STATISTICS_INCREMENTAL
ALLOW_ROW_LOCKS
ALLOW_PAGE_LOCKS
OPTIMIZE_FOR_SEQUENTIAL_KEY
DATA_COMPRESSION

Roll Out

Where you use these options is situational.

For example, only Enterprise Edition can create indexes online, or using a parallel plan. You can sort in tempdb for any of them, though. I might use different DOPs depending on the size of the server, and of course if I’m creating a column store index (for those, DOP 1 is sometimes a good idea).

For the stored options, I leave most of them alone. I always start Fill Factor off at 100, and with page compression turned on. Those are both things you can adjust or remove later if they turn out to not be ideal, but I love testing them out.

Data compression is especially useful on Standard Edition servers with a limited buffer pool (128GB) and large data sets, but can be just as useful on Enterprise Edition when that antichrist VM admin refuses to part with a single more GB of memory.

And hey, maybe in the future as more workloads move to 2019+, I’ll get to spend more time optimizing for sequential keys.

Maybe.

Thanks for reading!

Going Further

EAV Can Be Great In SQL Server, But It Depends On How You Query It

Posted on November 19, 2021May 16, 2022 by Erik Darling

Valued

EAV styled tables can be excellent for certain data design patterns, particularly ones with a variable number of entries.

Some examples of when I recommend it are when users are allowed to specify multiple things, like:

Phone numbers
Physical or email addresses
Contact names

This is a lot better than adding N number of columns to a table, especially when either most people won’t use them, or it adds artificial restrictions.

For example, if you have a large table that was designed 10 years ago, you’re not gonna rush to add a 3rd phone number field to it for a single customer. Changing tables like that can be painful, depending on version and edition of SQL Server.

Careful

Where you need to be careful is how you design them. One particularly rough spot to end up in is with a table like this:

CREATE TABLE 
    dbo.eav
(
    entity int,
    attribute int,
    int_value int,
    date_value date,
    short_string_value nvarchar(100),
    long_string_value nvarchar(max)
);

While it does make data type consistency easier, I have to wonder about the wisdom of making the values “good” for searching. Certainly, indexing this table would be aggravating if you were going to go that route.

A design that I generally like better looks like this:

CREATE TABLE 
    dbo.eav
(
    entity int,
    attribute int,
    value sql_variant
);

While the sql_variant type is certainly not good for much, this is a proper time for it, particularly because this data should only be written to once, and only read from after. That means no searching the sql_variant column, and only allowing lookups via the entity and attribute.

Thanks for reading!

Going Further

Signs You Need Batch Mode To Make Your SQL Server Queries Faster

Posted on November 16, 2021November 22, 2022 by Erik Darling

My Name Is

I speak with a lot of DBAs and developers who have either heard nothing about column store and batch mode, or they’ve only heard the bare minimum and aren’t sure where it can help them.

Here’s a short list of reasons I usually talk through with people.

Your Reporting Queries Hit Big Tables

The bigger your tables get, the more likely you are to benefit, especially if the queries are unpredictable in nature. If you let people write their own, or design their own reports, nonclustered column store can be a good replacement for nonclustered row store indexes that were created specifically for reporting queries.

In row store indexes, index key column order matters quite a bit. That’s not so much the case with column store. That makes them an ideal data source for queries, since they can scan and select from column independently.

Your Predicates Aren’t Always Very Selective

Picture the opposite of OLTP. Picture queries that collect large quantities of data and (usually) aggregate it down. Those are the ones that get along well with column store indexes and batch mode.

If most of your queries grab and pass around a couple thousand rows, you’re not likely to see a lot of benefit, here. You wanna target the ones with the big arrows in query plans.

Your Main Waits Are I/O And CPU

If you have a bunch of waits on blocking or something, this isn’t going to be your solve.

When your main waits are CPU, it could indicate that queries are overall CPU-bound. Batch mode is useful here, because for those “big” queries, you’re passing millions of rows around and making SQL Server send each one to CPU registers. Under batch mode, you can send up to 900 at a time. Just not in Standard Edition.

When your main waits are on I/O — reading pages from disk specifically — column store can be useful because of the compression they offer. It’s easy to visualize reading more compact structures being faster, especially when you throw in segment and column elimination.

Your Query Plans Have Some Bad Choices In Them

SQL Server 2019 (Enterprise Edition) introduced Batch Mode On Row Store, which let the optimizer heuristically select queries for Batch Mode execution. With that, you get some cool unlocks that you used to have to trick the optimizer into before 2019, like adaptive joins, memory grant feedback, etc.

While those things don’t solve every single performance issue, they can certainly help by letting SQL Server be a little more flexible with plan choices and attributes.

The Optimizer Keeps Choosing Cockamamie Parallel Sort Merge Plans That Make No Sense And Force You To Use Hash Join Hints All The Time

🤦‍♂️

Thanks for reading!

Going Further

The Right Way To Check For NULLs In SQL Server Queries

Posted on November 8, 2021May 16, 2022 by Erik Darling

101ers

This is still one of the most common problems I see in queries.

People are terrified of NULLs. People are afraid to merge on freeways in Los Angeles.

What results is this endless stream of poorly performing queries, and some surprise logic bugs along the way.

I don’t have much more else of an intro. The TL;DR is that you should use natural expressions like IS NULL or IS NOT NULL, rather than any of the built in functions available to you in SQL Server, like ISNULL, COALESCE, et al. which are presentation layer functions with no relational meaning whatsoever.

From here on out, we’ll be calling them unnatural expressions. Perhaps that will get through to you.

Tuning Wizard

First is something I’ve covered before, but when you use unnatural expressions, the optimizer will not give you feedback about useful indexes.

The first query generates a missing index request, the second one does not. The optimizer has abandoned all hope with the use of an unnatural expression.

Lethargy

The other issue with unnatural expressions comes down to implicit conversion.

Take this, for instance.

DECLARE 
    @i int = 0;

SELECT 
    c = 
        CASE ISNULL(@i, '') 
             WHEN '' 
             THEN 1 
             ELSE 0 
        END;

This will return a 1, because 0 and ” can be implicitly converted.

Perhaps less obvious, and more rare, is this:

DECLARE 
    @d datetime = '19000101';

SELECT 
    c = 
        CASE ISNULL(@d, '') 
             WHEN '' 
             THEN 1 
             ELSE 0 
        END;

Which will also return 1.

Not many databases have stuff going back to 1900, but I do see people using that as a canary value often enough.

Perfidy

If that’s not enough to get you off the idea, let’s look at how this stuff plays out in the real world.

First, let’s get ourselves an index. Without that, there’s fundamentally no difference in performance.

CREATE INDEX v ON dbo.Votes
    (BountyAmount);

Our gold standard will be these two queries:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.BountyAmount IS NULL;

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.BountyAmount IS NOT NULL;

The first one that checks for NULL values returns a count of 182,348,084.

The second one that checks for NOT NULL values returns a count of 344,070.

Keep those in mind!

The query plans for them both look like this:

Which run, respectively (and respectably), in 846ms and 26ms. Obviously the query with the more selective predicate will have a time advantage, here.

Wrongly

Here’s where things start to go wrong.

This query returns incorrect results, but you’re probably used to that because of all the NOLOCK hints in your queries anyway.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE ISNULL(v.BountyAmount, '') = '';

A count of 182,349,088 is returned rather than 182,348,084, because there are 1004 rows with a bounty of 0.

Even though we have an empty string in our query, it’s implicitly converted to 0.

And you thought you were so clever.

Badly

The exercises in futility that I see people carrying on with often look make use of ISNULL, COALESCE, and CASE expressions.

It’s worth noting here that COALESCE is only a CASE expression underneath, anyway. They are interchangeable in this respect.

For findings NULLs, people will screw up and do this:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE ISNULL(v.BountyAmount, -1) = -1;

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE COALESCE(v.BountyAmount, -1) = -1;

We can use -1 here because it doesn’t naturally occur in the data. Results are correct for both, but performance is comparatively horrible.

We’re looking at 2.5 seconds compared to 900ms. This situation gets worse with the more selective predicates, too.

These both take roughly the same time as the other unnatural forms of this query, but recall the natural version of this query finished in under 30ms.

Deadly

I hope I don’t have to write about this anymore, but at the rate I see people doing this stuff, I kind of doubt it.

Broken Record Enterprises, it feels like sometimes.

I’m not sure why anyone thinks this is a good idea. I’ve heard rumors that it comes from application developers who are used to NULLs throwing errors writing SQL queries, where they don’t pose the same threat.

Who knows, though. Maybe people just really like the festive pink text color that functions turn in SSMS.

Thanks for reading!

Going Further

Dealing With Wide Missing Index Requests In SQL Server

Posted on November 5, 2021May 16, 2022 by Erik Darling

We’ve All Been There

You’re running a query that selects a lot of columns, and you get a missing index request.

For the sake of brevity, let’s say it’s a query like this:

SELECT
    p.*
FROM dbo.Posts AS p
WHERE p.ParentId = 0;

The missing index request I get for this query is about like so:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Posts] ([ParentId])
INCLUDE ([AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],
[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],
[OwnerUserId],[PostTypeId],[Score],[Tags],[Title],[ViewCount])

But that’s laughable, because it’s essentially a shadow clustered index. It’s every column in the table ordered by <some column>.

And Again

Under many circumstances, you can trim all those included columns off and make sure there’s a usable index with ParentId as the leading column.

I’m not a fan of single key column indexes most of the time, so I’d avoid that practice.

But sure, if you have reasonably selective predicates, you’ll get a decent seek + lookup plan. That’s not always going to be the case, though, and for various reasons you may end up getting a poor-enough estimate on a reasonably selective predicate, which will result in a bad-enough plan.

Of course, other times you may not have very selective predicates at all. Take that query up there, for example. There are 17,142,169 rows in the Posts table (2013), and 6,050,820 of them qualify for our predicate on ParentId.

This isn’t a case where I’d go after a filtered index, either, because it’d only be useful for this one query. And it’d still be really wide.

There are four string columns in there, all nvarchar.

Title (250)
Tags (150)
LastEditorDisplayName(40)
Body(max)

Maybe Something Different

If I’m going to create an index like that, I want more out of it than I could get with the one that the optimizer asked for.

On a decently recent version of SQL Server (preferably Enterprise Edition), I’d probably opt for creating a nonclustered column store index here.

You get a lot of benefits from that, which you wouldn’t get from the row store index.

Column independence for searching
High compression ratio
Batch Mode execution

That means you can use the index for better searching on other predicates that aren’t terribly selective, the data source is smaller and less likely to be I/O bound, and batch mode is aces for queries that process a lot of rows.

Column store indexes still have some weird limitations and restrictions. Especially around data types and included columns, I don’t quite understand why there isn’t better parity between clustered and nonclustered column store.

Thanks for reading!