Let’s Design An Index Together Part 1

Just One Index


I want both of these queries to be fast.

SELECT TOP (5000)
       p.LastActivityDate,
       p.PostTypeId,
       p.Score,
       p.ViewCount
FROM dbo.Posts AS p
WHERE p.PostTypeId = 4
AND   p.LastActivityDate >= '20120101'
ORDER BY p.Score DESC;


SELECT TOP (5000)
       p.LastActivityDate,
       p.PostTypeId,
       p.Score,
       p.ViewCount
FROM dbo.Posts AS p
WHERE p.PostTypeId = 1
AND   p.LastActivityDate >= '20110101'
ORDER BY p.Score DESC;

Get to work.



24 thoughts on “Let’s Design An Index Together Part 1

  1. Create an Index on PostTypeID(if it is not clustered index) and lastactivitydate

    I don’t see the table design (sorry if I missed that)

  2. Alright I’ll take a shot, I am guessing LastActivityDate is more selective than PostTypeId. And Score probably more so than LastActivityDate, although Score and PostTypeId are possibly skewed a lot more? Are your stats up to date? 🙂

    CREATE NONCLUSTERED INDEX IX_ShotInTheDark ON [dbo].[Posts] ([LastActivityDate], [PostTypeId], [Score] DESC)
    INCLUDE ([ViewCount]);

    I have a feeling Score first may be the faster though.

  3. LastActivityDate (the way we’re using it) is not very selective; PostTypeId is pretty good. For PostTypeId = 1, there are 1825 distinct scores. For PostTypeId = 5, there is only 1. There are lots of duplicate scores, and the vast majority of them are low scores. The lowest score that would be included is 391, with about ~20 that could be chosen from arbitrarily.

    SEDE link (don’t have the database on my machine and too lazy to download it): https://data.stackexchange.com/stackoverflow/query/1177637/erik-darling-design-an-index-together-v1?opt.withExecutionPlan=true#resultSets

    We could maybe try a filtered index, although I strongly suspect that we couldn’t get one that would match these predicates (and we’d have to make it on the underlying table anyway, and then it definitely wouldn’t be used).

    I think a CCI is actually going to be the best index, but if we want to avoid columnstore I’d probably guess something like this:

    CREATE NONCLUSTERED INDEX PlaytimeWithErik
    ON dbo.Posts ( PostTypeId, Score DESC, LastActivityDate DESC )
    INCLUDE ( ViewCount );

    PostTypeId is first because I think it is always going to be the most selective.

    I’m uncertain about Score vs LastActivityDate; intuitively I could imagine SQL Server scanning the Score list in descending order, checking for LastActivityDate >= our filter, and quitting once 5000 rows are returned. I’m reasonably confident it won’t do that, in which case LastActivityDate should probably be second.

    LastActivityDate DESC so we can scan forwards instead of backwards (again, intuition that SQL Server can figure this out, not empirical).

    Score DESC to match the ORDER BY, but this only matters if it lets us avoid the sort. It’ll only avoid the sort if my intuition above works, because LastActivityDate is not equality but a range, so I’ll move it to be in the INCLUDE if we change the order.

    ViewCount in the INCLUDE because we don’t need it for anything fancy.

  4. I would create an index

    CREATE NONCLUSTERED INDEX IX_PostTypeID_LastActivityDate_Score_includes ON [dbo].[Posts] (PostTypeId, LastActivityDate, Score) INCLUDE (ViewCount);

  5. I guess I am the weirdo in the group, but putting score at the front of the index allows it to get rid of the sort and grab the necessary records much quicker:

    CREATE INDEX IX_UR_MOM on posts(Score,PostTypeID,LastActivityDate ) INCLUDE(ViewCount)

  6. I don’t really like designing indexes for specific queries only. I much prefer to look at the entire workload.
    But assuming these are the CEO’s pet reports, I’ll give it a stab.

    Equality filters are more effective then inequality, so the index leads with PostTypeId, followed by LastActivityDate. There is no need to put Score as in index column, after an inequality filter the remaining columns are no longer sorted in the result set. (Had the query used equality on LastActivityDate, my suggestion would have been different). So Score is not indexed, and neither is ViewCount. But they do need to be included, else we’ll need a lookup.
    (If any of the two included columns is in the clustered index, then including them makes me look silly but doesn’t hurt).

    CREATE INDEX DontTellEriksBossHeDidntDoThisHimself
    ON dbo.Posts(PostTypeID, LastActivityDate)
    INCLUDE(Score, ViewCount);

    If you really want to go overboard (but I expect that it will save milliseconds at best, at the price of being an overall far less usable index), then add a filter too:

    CREATE INDEX DontTellEriksBossHeDidntDoThisHimself
    ON dbo.Posts(PostTypeID, LastActivityDate)
    INCLUDE(Score, ViewCount)
    WHERE PostTypeID IN (1,4) AND LastActivityDate >= ‘20110101’;

      1. It took you THIS long to figure that out? Guess where that places you in the smartness ranking?

        Seriously: You are right that my answer was incorrect. I realized that after posting. But you are not correct if you claim that my answer is not the fastest possible index. It might be, and the question doesn’t contain enough data to know for sure.

        Using the second query as example, let’s say the PostTypeID filter reduces the data to a million rows. If 990,000 of them are from before 2011, then I want “my” index to not have to read those 990,000 rows and I’ll happily pay the price of sorting the remaining 10,000 rows.
        But if 990,000 of those posts are recent, then “your” index prevents a huge sort of almost a million rows, while only reading 1% more data.

        So as usual, the only correct answer is “it depends”.
        (Perhaps I should use ix_ItDependsButOnWhat as my future index name … for all indexes)

  7. Because we have:
    – TOP(5000)
    – Order by Score DESC
    – WHERE p.PostTypeId =

    i will create one index

    create nonclustered index idx_Posts___PostType_Score_LastActivity_I ON dbo.Posts
    (
    PostTypeId
    ,Score DESC
    ,LastActivityDate
    )
    Include (ViewCount)

Leave a Reply

Your email address will not be published. Required fields are marked *