Our job now is to figure out how to even things out. To do that, we’re gonna need to mess with out index a little bit.
Right now, we have this one:
CREATE INDEX whatever ON dbo.Posts(PostTypeId, LastActivityDate) INCLUDE(Score, ViewCount);
Which is fine when we need to Sort a small amount of data.
SELECT TOP (5000) p.LastActivityDate, p.PostTypeId, p.Score, p.ViewCount FROM dbo.Posts AS p WHERE p.PostTypeId = 4 AND p.LastActivityDate >= '20120101' ORDER BY p.Score DESC;
There’s only about 25k rows with a PostTypeId of 4. That’s easy to deal with.
The problem is here:
SELECT TOP (5000) p.LastActivityDate, p.PostTypeId, p.Score, p.ViewCount FROM dbo.Posts AS p WHERE p.PostTypeId = 1 AND p.LastActivityDate >= '20110101' ORDER BY p.Score DESC;
Theres 6,000,223 rows with a PostTypeId of 1 — that’s a question.
Don’t get me started on PostTypeId 2 — that’s an answer — which has 11,091,349 rows.
What a lot of people try first is an index that leads with Score. Even though it’s not in the WHERE clause to help us find data, the index putting Score in order first seems like a tempting fix to our problem.
CREATE INDEX whatever ON dbo.Posts(Score DESC, PostTypeId, LastActivityDate) INCLUDE(ViewCount)
The result is pretty successful. Both plans are likely fast enough, and we could stop here, but we’d miss a key point about B-Tree indexes.
What’s a bit deceptive about the speed is the amount of reads we do to locate our data.
We only need to read 15k rows to find the top 5000 Questions — remember that these are very common.
We need to read many more rows to find the top 5000… Er… Whatever a 4 means.
Nearly the entire index is read to locate these Post Types.
Meet In The Middle
The point we’d miss if we stopped tuning there is that when we add key columns to a B-Tree index, the index is first ordered by the leading key column. If it’s not unique, then the second column is ordered within each range of values.
Putting this together, let’s change our index a little bit:
CREATE INDEX whatever ON dbo.Posts(PostTypeId, Score DESC, LastActivityDate) INCLUDE(ViewCount) WITH (DROP_EXISTING = ON);
With the understanding that seeking to a single PostTypeId column will bring us to an ordered Sort column for that range of values.
Now our plans look like this:
Which allows us to both avoid the Sort and keep reads to a minimum.
When designing indexes, it’s important to keep the goal of queries in mind. Often, predicates should be the primary consideration.
Other times, we need to take ordering and grouping into account. For example, if we’re using window functions, performance might be unacceptable without indexing the partition by and order by elements, and we may need to move other columns to parts of the index that may not initially seem ideal.
Thanks for reading!