Indexing

Common SQL Server Consulting Advice: Adding Indexes To Replace Multiple Indexes

Posted on April 19, 2022May 16, 2022 by Erik Darling

Buffet Tables

If you’ve ever spent any time reading about relational database theory, you’ve probably read all sorts of things about normal forms and all that good stuff. You’ve also probably immediately forgotten all of it.

At least that’s what I’d guess is what happened based on what I see when I look at database designs out there. Tables are often dozens of columns wide, with all sorts of things in there that should be in their own table.

Columns with a common prefix, like Customer_Name, Customer_Address, etc.
Columns with a common suffix, like PhoneNumber_1, PhoneNumber_2, etc.

Even fairly-well designed tables often get somewhat overloaded with information, especially when years of additional development requirements lead to a choice between adding one more column to a table and normalizing several column out to a new table.

Sound familiar?

Let’s move on.

The more columns you have in a table, the more potential column combinations there are for indexes. Much like columns, indexes tend to get added following the path of least resistance.

Very rarely does someone consider current indexes when deciding to add an index. They might be doing something like:

Evaluating missing index requests in a query plan
Evaluating missing index requests in the DMVs
Running DTA like a darn fool and saying yes to everything

I’ll often run into tables that have indexes like these:

CREATE INDEX p1 ON dbo.Posts(OwnerUserId);
CREATE INDEX p2 ON dbo.Posts(OwnerUserId, PostTypeId);
CREATE INDEX p3 ON dbo.Posts(OwnerUserId, CreationDate, LastActivityDate);
CREATE INDEX p4 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate);
CREATE INDEX p5 ON dbo.Posts(OwnerUserId, PostTypeId, Score);
CREATE INDEX p6 ON dbo.Posts(CreationDate, LastActivityDate, Score);

I’m leaving off includes here, because you can combine those in any order without disturbing much.

Key columns are a different story, of course. Order matters quite a bit, because via the magic of co-dependency key column order defines how queries can efficiently access data.

Merging

Looking at this list of indexes, we know a couple things immediately:

The index on just OwnerUserId isn’t of any obvious value, and we can get rid of it
The index that leads on CreationDate serves a totally different type of query, and we should leave it alone for now

Assuming all of these indexes have a not-crappy read to write ratio, what do we do next?

Make sure none of them are unique
Make sure none of them are filtered

These indexes all have multiple key columns that sort of line up:

CREATE INDEX p2 ON dbo.Posts(OwnerUserId, PostTypeId); 
CREATE INDEX p3 ON dbo.Posts(OwnerUserId, CreationDate, LastActivityDate); 
CREATE INDEX p4 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate); 
CREATE INDEX p5 ON dbo.Posts(OwnerUserId, PostTypeId, Score);

Of these, three of them are on OwnerUserId and PostTypeId, and one isn’t. We can throw that one out.

CREATE INDEX p2 ON dbo.Posts(OwnerUserId, PostTypeId); 
CREATE INDEX p4 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate); 
CREATE INDEX p5 ON dbo.Posts(OwnerUserId, PostTypeId, Score);

Okay, now what? Just looking at the definitions here, we can safely get rid of the first index, because any query that uses it can safely use the other two indexes.

But for the second two, we can’t do much. It’s not necessarily safe to add the third key column in either one as a fourth key column.

We have to look at how they’re used, first. If one index is unused or rarely used, we could make that change.

Things might get more difficult to figure out if we have another index like this:

CREATE INDEX p4 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate); 
CREATE INDEX p5 ON dbo.Posts(OwnerUserId, PostTypeId, Score);
CREATE INDEX p7 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate, LastActivityDate);

Sure, it means we could get rid of the index that ends with CreationDate, but it makes the picture a little less clear about how we should deal with Score.

Ending

To be safe, I’d probably stop after just getting rid of the ones where the key columns are a match as far as order goes, and are a subset of wider indexes.

CREATE INDEX p1 ON dbo.Posts(OwnerUserId); --Drop this one
CREATE INDEX p2 ON dbo.Posts(OwnerUserId, PostTypeId); --Drop this one
CREATE INDEX p3 ON dbo.Posts(OwnerUserId, CreationDate, LastActivityDate); --Keep this one
CREATE INDEX p4 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate); --Drop this one
CREATE INDEX p5 ON dbo.Posts(OwnerUserId, PostTypeId, Score); --Keep this one 
CREATE INDEX p6 ON dbo.Posts(CreationDate, LastActivityDate, Score); --Keep this one
CREATE INDEX p7 ON dbo.Posts(OwnerUserId, PostTypeId, CreationDate, LastActivityDate); --Keep this one

After that, I’d probably look at usage stats, to see if any of them just aren’t holding up their end of the bargain.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Common SQL Server Consulting Advice: Adding And Removing Indexes

Posted on April 18, 2022April 21, 2022 by Erik Darling

A Good Thing

Since you’re here reading this, I’m going to assume you know what food is. Further, I’m going to assume that you also know what salt is, because salt plays a pretty big part in how good or bad food tastes.

Skimp on salt, and you’ll see a lot of disappointed faces
Go overboard on salt, and you’ll see a lot of sour faces

This is a lot like what you’ll see if you do either one of those things with indexes in your database, unless you have Special Medical Exemptions©.

If it’s a data warehouse, you’ll probably want fewer indexes, but you’ll also probably want to use column store indexes
If it’s normal throughput OLTP, you’ll probably want more indexes than a data warehouse, but with narrower key column choices to support seek and lookup plans
If it’s OLTP + Reporting (or analytics or whatever you wanna call it) you’re going to need to make a lot of tough indexing choices for them to live together
If it’s high throughput OLTP, you’ll probably want even fewer overall indexes, and you may need to look into in-memory options for the busiest tables

With all that in mind, there might be some local factors that make some if this stuff not align particularly well with your workload. Things like version, edition, query patterns, and hardware might be holding you back. And of course, being at the mercy of a terrible vendor application would also likely put a damper on many of these plans.

If you want a quick and easy way to find indexes to both remove and add, go ahead and grab sp_BlitzIndex.

Practically Common

In my consulting work, the most common type of workload I see is OLTP with reporting on top of it, followed by analytics workloads. Pure OLTP workloads are pretty rare.

The index strategy for these workloads might look something like this:

Narrow key indexes with very few includes or OLTP queries
Fairly narrow key indexes with wider includes for reporting queries, or…
Nonclustered column store indexes for reporting queries

It’s always tricky figuring out what the best set of indexes to have around for your workload is. There are so many missing index requests, new queries, users complaining, and new tables getting added that it’s hard not to feel like you’re drowning.

You can’t have the perfect index for every query, and you shouldn’t try to. You’d end up with the sour-salt face kind of database. But what I need you to do is make a deal with yourself: Any time you want to add an index, you have to look for an index to get rid of.

Why?

Because without fail, I see these patterns all the time:

Indexes with a single key column along wider indexes that lead with the same column
Indexes with the exact same key columns and key column order (sometimes one has more key columns at the end)
Indexes that are totally unused by read queries but get modified a lot
Indexes with only a few reads and a ton of writes

It’s all very bothersome clutter, and you’ve got to clear these out before you can add in more.

Commonly Practical

There are many good reasons to avoid over-indexing. Everyone says “writes get slow”, but that’s a shabby meme. Your writes are slow because your storage is crappy and possibly because of your poorly thought out Availability Group.

Why having an excessive number of indexes (again, barring Special Medical Exemptions©) can be bad for your database performance:

Locking:
- More objects to lock means more locking generally
- More objects to lock means a greater chance of lock escalation attempts
Buffer pool:
- More indexes means more objects in the buffer pool
- Even indexes unused by reads need to be modified when the table changes
Transaction logging:
- More indexes to modify means more to keep track of in the transaction log
- Even databases in simple recovery log the same amount 99% of the time

So the next time you want to go add an index to a table, I want you to try a few things:

Look for indexes on the table that are totally unused
Look for indexes on the table with overlapping key columns to merge
Look for indexes that could be tweaked rather than creating a whole new one

To do that, sp_BlitzIndex run like so:

EXEC dbo.sp_BlitzIndex
    @DatabaseName = N'YourDatabase',
    @SchemaName = N'ProbablyDBO',  
    @TableName = N'YourIndexingCatastrophe';

Thanks for reading!

Going Further

Getting The Top Value Per Group With Multiple Conditions In SQL Server: Row Number vs. Cross Apply With MAX

Posted on February 24, 2022May 16, 2022 by Erik Darling

Throat Music

In yesterday’s post, we compared a simple situation trying to find the post scoring post for each user.

In today’s post, we’re going to add another condition: we want the highest scoring post for each type of post someone has made.

Now look, most people don’t get involved with any of these things, but whatever. It just poses an interesting and slightly more complicated problem.

Slightly Different Index

Since we’re going to be using PostTypeId in the window function, we need it in the key of our index:

CREATE INDEX p ON dbo.Posts(OwnerUserId, PostTypeId, Score DESC)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Now our query looks like this:

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId,
                    p.PostTypeId --This is new!
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
) AS p
    ON  p.OwnerUserId = u.Id
    AND p.n = 1
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

S’good? S’good. Let’s go.

Row Number Query Plan

Similar to yesterday’s plan, this one is rather slow, rather serial, and generally not how we want to be spending our precious time.

Let’s look at the apply method, because we have to change our query a little bit to accomplish the same thing.

Cross Apply With MAX

Rather than go with TOP (1), we’re going to GROUP BY OwnerUserId and PostTypeId, and get the MAX(Score).

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT 
        p.OwnerUserId,
        p.PostTypeId,
        Score = MAX(p.Score)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    GROUP BY 
        p.OwnerUserId, 
        p.PostTypeId
) AS p
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

This will give us the same results, but a lot faster. Again.

Cross Apply Query Plan

Like I was saying…

Down to ~400ms now. Not bad, right?

Lower Selectivity

If we take those same queries and lower the reputation filter from 50,000 to 1, some interesting changes to the query plans happen.

Repartition streams and I have had some problems in the past. It’s not necessarily “to blame”, it just has a tough time with some data distributions, especially, it seems, when it’s order preserving.

The cross apply with aggregation works really well. It’s kinda neat that both queries get slower by the same amount of time, but the ROW_NUMBER query is still much, much slower.

All of this is interesting and all, but you know what? We haven’t look at batch mode. Batch mode fixes everything.

Sort of. Don’t quote me on that. It’s just really helpful in the kind of BIG QUERY tuning that I end up doing.

Batch Mode

This is the only thing that makes the ROW_NUMBER query competitive in this scenario, owing to the fact that batch mode often removes Repartition Streams, and we’re eligible for the Window Aggregate operator.

I’m dumping these results to #temp tables because I don’t want to wait for SSMS to render the large result set, but you can still see the positive overall effect.

The poorly performing ROW_NUMBER query is now very competitive with the CROSS APPLY query.

Thanks for reading!

Going Further

Getting The Top Value Per Group In SQL Server: Row Number vs. Cross Apply Performance

Posted on February 23, 2022May 16, 2022 by Erik Darling

Introductions

There are many ways to express queries in SQL. How different rewrites perform will largely be a function of:

You not doing anything ridiculous
Queries having good indexes in place
The optimizer not seeing through your tricks and giving you the same query plan

The first rule of rewrites is that they have to produce the same results, of course. Logical equivalency is tough.

In today and tomorrow’s posts I’m going to compare a couple different scenarios to get the top value.

There are additional ways to rewrite queries like this, of course, but I’m going to show you the most common anti-pattern I see, and the most common solution that tends to work better.

Right And Proper Indexing

For today’s post, we’re going to use this index:

CREATE INDEX p ON dbo.Posts(OwnerUserId, Score DESC) INCLUDE(PostTypeId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Which is going to give this query proper support. Sure, we could also add an index to the Users table, but the one scan is trivially fast, and probably not worth it here.

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
) AS p
    ON  p.OwnerUserId = u.Id
    AND p.n = 1
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

The general idea is to find all users with a reputation over 50,000, along with their highest scoring post.

I know, you’re looking at this and thinking “jeez Erik, why are you selecting * here? don’t you know how bad and dumb you are for that?”

Well, SQL Server is smart enough to ignore that and only deal with the columns in the outer select.

The Query Plan

If you create the index and run this, the query plan looks something like this:

The majority of the time spent in this plan is the ~11 seconds between the scan of the Posts table and the Filter operator.

The filter is there to remove rows where the result of the ROW_NUMBER function are greater than 1.

I guess you could say less than 1, too, but ROW_NUMBER won’t produce rows with 0 or negative numbers naturally. You have to make that happen by subtracting.

A Better Query Pattern?

Since only ~2400 rows are leaving the Users table, and we have a good index on the Posts table, we want to take advantage of it.

Rather than scan the entire Posts table, generate the ROW_NUMBER, apply the filter, then do the join, we can use CROSS APPLY to push things down to where we touch the Posts table.

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT TOP (1)
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    ORDER BY p.Score DESC
) AS p
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

This is logically equivalent, but it blows the other query out of the water, performance-wise.

A Better Query Plan?

Here’s the query plan for the cross apply query:

Why Is This better?

In this case, having a good index to use, and a small outer result from the Users table, the cross apply query is way better.

This is also due to the Id column of Users being the Primary Key of the table. For this sort of one to many join, it works beautifully. If it were a many to many scenario, it could be a toss up, or ROW_NUMBER could blow it out of the water.

The way this type of Nested Loops Join works (Apply Nested Loops), is to take each row from the outer input (Users table) and seek to it in the Posts table.

Without that good index up here, this would likely be a disaster with an Eager Index Spool in the plan. We definitely don’t want that, here.

But you know, there are many different types of posts. We might want to know someone’s high score for questions, answers, and more.

In tomorrow’s post, we’ll look at how to do that, and performance tune the query.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Indexing #Temp Tables Incorrectly

Posted on January 20, 2022May 16, 2022 by Erik Darling

Understanding

I’ve covered this sort of information in previous posts, like:

But some points should be made together, so I’m going to combine them a bit, and expand on a few points too.

I know that it’s probably an overly-lofty goal to expect people who don’t seem to have the hang of indexing regular tables down to not repeat those errors with #temp tables.

But hey, hope burns eternal. Like American Spirits (the cigarettes, not some weird metaphorical thing that Americans possess, or ghosts).

Nonclustered Index Follies: Creating Them Before You Insert Data

I’m not saying that you should never add a nonclustered index to a #temp table, but I am saying that they shouldn’t be your first choice. Make sure you have a good clustered index on there first, if you find one useful. Test it. Test it again. Wait a day and test it again.

But more importantly, don’t do this:

CREATE TABLE #bad_idea
(
    a_number int,
    a_date datetime,
    a_string varchar(10),
    a_bit bit
);

CREATE INDEX anu ON #bad_idea(a_number);
CREATE INDEX ada ON #bad_idea(a_date);
CREATE INDEX ast ON #bad_idea(a_string);
CREATE INDEX abi ON #bad_idea(a_bit);

Forget for a minute that these are a bunch of single-column indexes, which I’m naturally and correctly opposed to.

Look what happens when we try to insert data into that #temp table:

You have to insert into the heap (that’s the base table here, since we don’t have a clustered index), and then each of the nonclustered indexes. In general, if you want nonclustered indexes on your #temp tables, you should create them after you insert data, to not mess with parallel inserts and to establish statistics with a full scan of the data.

Nonclustered Index Follies: If You Need Them, Create Them Inline

If for some insane reason you decide that you need indexes on your #temp table up front, you should create everything in a single statement to avoid recompilations.

CREATE TABLE #bad_idea
(
    a_number int,
        INDEX anu (a_number),
    a_date datetime,
        INDEX ada (a_date),
    a_string varchar(10),
        INDEX ast (a_string),
    a_bit bit,
        INDEX abi (a_bit)
);

I don’t have a fantastic demo for that, but I can quote a Great Post™ about #temp tables:

Do not explicitly drop temp tables at the end of a stored procedure, they will get cleaned up when the session that created them ends.
Do not alter temp tables after they have been created.
Do not truncate temp tables
Move index creation statements on temp tables to the new inline index creation syntax that was introduced in SQL Server 2014.

There are some other good points there, too. Pay attention to those as well.

Of course, there is one interesting reason for dropping #temp tables: running out of space in tempdb. I tend to work with clients who need help tuning code and processes that hit many millions of rows or more.

If you’re constantly creating large #temp tables, you may want to clean them up when you’re done with them rather than letting self-cleanup happen at the end of a procedure.

This applies to portions of workloads that have almost nothing in common with OLTP, so you’re unlikely to experience the type of contention that the performance features which apply there also apply here. Reporting queries rarely do.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Thinking Index Rebuilds Solve Every Problem

Posted on January 13, 2022May 16, 2022 by Erik Darling

Salt Flats

It’s hard to believe that this is still a lively debate, given how far technology has come since the original conversations around if and when fragmentation is a problem were a thing, but here we are in 2020 for the third year straight.

I will give people some credit where it’s due, I’ve seen index maintenance habits change a bit over the years:

Reducing frequency to once a week or once a month from every night
Bumping the thresholds that they reorg and rebuild way higher than 5% and 30%, like 50% and 80%
Abandoning it all together when using AGs or other data synchronization technologies
Realizing how powerful and flexible statistics updates are, especially when using Ola’s scripts

It’s a good start, but people still ascribe far too many benefits to doing it. Rather than rehash everything I’ve ever said about it, I’m gonna post a video of Erin Stellato (b|t) and I discussing the pros, cons, whens, wheres, whys, and hows in this video:

Thanks for reading (and watching)!

Going Further

Software Vendor Mistakes With SQL Server: Lowering Fill Factor For Every Index

Posted on January 12, 2022May 16, 2022 by Erik Darling

Times And Places

I can’t remember the last time I saw someone lower fill factor for a good reason. Perhaps those just aren’t the type of folks who need my help, or perhaps no one has done it for a good reason since Windows 2003 dropped and part of Any Good Consultation™ was checking to make sure disks were aligned properly.

What a bad time for disks, that was. Thank the Magnetic Fields that SSD and Flash came along to sit at the other end of a SAN with a shoestring and some Dixie cups between them.

But anyway, the story used to go something like this:

We have a lot of page splits

Are they good or bad?

Aren’t they all bad?

No, but we should look more closely at page density to figure out…

FIXED IT!

What?

I set Fill Factor to 70 for everything. We’re cool.

This is, of course, wrong-headed in the same way that applying anything that works to fix a specific thing across the board is.

What Fill Factor Does

When you change Fill Factor, whether it’s at the database level, or index level, is leave your chosen percent as free space. on each data page at the leaf level of an index. But only when you rebuild or reorganize them. At no point in-between does SQL Server care about that percentage.

At the very best, you’re only buying yourself some time until you have “page splits” again. Depending on how busy a bottom your table is, you might need to do index maintenance quite frequently in order to get that fill factor back.

And you know what? That’s a real bad time. It’s what I refer to as spinning disk mentality.

I can’t imagine how anyone would track bad page splits in a meaningful way, and apply fill factor in a way that would permanently keep them at bay.

The worst part about Fill Factor is that it gets applied to all pages — even ones that are in no danger of facing a split — and every time you apply it, your indexes get bigger as you add free space back to new pages.

Since people always seem to want to shrink the ever lovin’ beet juice out of their databases, I assume they hate big databases that take up a lot of disk space. One way to get a big database really fast is to add a bunch of empty space to all of your tables and indexes.

What Fill Factor Doesn’t Do

Fill Factor doesn’t make read queries faster, especially if you’ve designed them mostly to accommodate Seeks in an OLTP environment. Seeks do not fall victim to these things the way scans do, because they navigate to individual rows.

They do just about the same amount of work no matter what, unless you add more levels to your indexes, but that tends to happen as they get larger, anyway.

And, good news, lowering Fill Factor will make Big Scans way slower. Why? They have to read more pages, because you decided to add a bunch of empty space to them. You’re really racking up the wins here, boss.

Not only is your database artificially huge, but all those reporting queries you’re worried about bringing your transactional queries to a halt take longer and do the dreaded “more reads” 😱 😱 😱

I often call Fill Factor “silent fragmentation”, because it reduces the density of your data pages dramatically, depending on what you lower it to. And it’s the actual bad kind of fragmentation — physical fragmentation — not the stuff your index maintenance scripts look at.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Not Using Filtered Indexes Or Indexed Views

Posted on January 11, 2022May 16, 2022 by Erik Darling

Basic Training

In yesterday’s post, we covered some of the basics of designing nonslustered indexes to make queries go faster, but sometimes those aren’t quite enough.

In today’s post, I’m going to give you more of my paid training about filtered indexes and indexed views.

What I cover here is how to use them correctly, and some of the things they just don’t work well with. Again, if you like what you see, hit the link at the end of the post for 75% off.

Filtered Indexes

Here’s the intro to filtered indexes

Here are the demos:

Indexed Views

Here’s the intro to indexed views:

Here are the demos for indexed views:

Going Further

Software Vendor Mistakes With SQL Server: Designing Nonclustered Indexes Poorly

Posted on January 10, 2022May 16, 2022 by Erik Darling

Jammy Jam

There’s a bit of magic to index tuning, once you move beyond the basics. But we do have to start with the basics. In order to do that quickly, I’m putting a couple videos from my paid training in this post. If you like what you see, hit the link at the end of the post to get 75% off everything.

Going Further

Software Vendor Mistakes With SQL Server: Avoiding Clustered Indexes

Posted on January 7, 2022May 16, 2022 by Erik Darling

Come Clean

Without fail, I seem to have this conversation every week:

Why is this a heap?

The vendor doesn’t like clustered indexes.

Ah, okay, why?

I don’t know.

There’s a nonclustered primary key, though…

Isn’t that the default?

No, you have to go out of your way to choose that.

Wann Ist Es

Heaps aren’t always bad, but you have to be careful when you choose to use them, because the problems that sneak up with them are tricky to detect and annoying to fix. If you never run into them, great.

But if you do, try to keep an open mind. Clustered indexes work wonderfully for a great many people, and it’s unlikely that you fall far out of that category, especially if you have anything within a horseshoe or hand grenade from an OLTP workload.

When I look at client workloads that have problems with heaps, the main things that I call out are either:

Forwarded fetches that result in uneven I/O patterns
Captive pages from tiny deletes that don’t remove empty pages

Rather than go on and on about this stuff, here are a couple videos where I discuss the downsides and upsides of heaps. Normally they’re part of my paid training, but I’m making them available here to you for free:

To find tables that might need clustered indexes, a good, free tool is sp_BlitzIndex. It’s part of an open source project that I’ve spent a lot of time on. It’ll warn you about heaps that have forwarded records in them, and that have been deleted from. From there, it’s up to you to figure out if the table is big and critical enough to warrant adding a clustered index to.