Distinct Rows, And The Columns You Choose

Footnote

I have two queries. They return the same number of rows.

The only difference is one column in the select list.

This query has the Id column, which is the primary key and clustered index of the Posts table.

The query plan for it looks like this:

Eligible

Notice that no operator in this plan performs any kind of aggregation.

There’s no Hash Match Aggregate, no Stream Aggregate, no Distinct Sort, NADA!

It runs for ~1.9 seconds to return about 25k rows.

Lessen

Watch how much changes when we remove that Id column from the select list.

This is what the query plan now looks like:

What’s wrong with you.

Zooming in a bit…

Woof.

After we Scan the Posts table, we sort about 47k rows.

After the join to Votes, we aggregate data twice. There are two Stream Aggregate operators.

What do we sort?

Boogers.

We Sort every column in the table by every column in the table.

In other words, we order by every column we’ve selected.

What do we aggregate?

Everything. Twice.

What Does It All Mean?

When selecting distinct rows, it can be beneficial to include a column that the optimizer can guarantee is unique in the set of selected columns. Think of a primary key, or another column with a uniqueness constraint on it.

Without that, you can end up doing a lot of extra work to create a distinct result set.

Of course, there are times when that changes the logic of the query.

Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *