When I was checking out early builds of SQL Server 2019, I noticed a new DMV called dm_db_missing_index_group_stats_query, that I thought was pretty cool.
It helped you tie missing index requests to the queries that requested them. Previously, that took a whole lot of heroic effort, or luck.
With this new DMV, it’s possible to combine queries that look for missing indexes with queries that look for tuning opportunities in the plan cache or in Query Store.
It seems to tie back to dm_db_missing_index_groups, on the index_group_handle column in this DMV joined to the group handle column in the new DMV.
If you’re wondering why I’m not giving you any code samples here, it’s because I’m going to get some stuff built into sp_BlitzIndex to take advantage of it, now that it’s documented.
Special thanks to William Assaf (b|t) for helping to get this done.
Plus, you get a whole lot more information about what went on with all the different query metrics. As far as a source of “what happened and when”, it’s much more reliable. It’s almost like having a monitoring tool in SQL Server.
Do you trust your plan cache to tell you what was slow more than a couple days ago?
One really big bonus is that it’s portable, and it’s per-database. You can backup or clone a database so someone else can easily look at it somewhere else.
On top of that, it’s more usable: there are built in reports, a graphical interface, settings you can tweak, and plans you can force with a couple clicks.
You could solve a lot of frustrations by moving things here.
A nice to have would be predicate normalization for identical and morally equivalent queries, with totals and averages for resource usage, so you could group things together. Especially for morally equivalent queries, this would allow plan forcing where the queries get different ids.
Some monitoring tools have this in their interface, and the number of times I’ve been able to track down outliers within a group of executions to solve a problem is, well, at least memorable to me.
Back when I first wrote sp_BlitzQueryStore, I was totally enamored with Query Store.
Like the plan cache, but better. History. Different plans for the same query. Not disturbed by memory pressure or restarts.
Then I waited patiently to find a client on 2016 using it.
And waited, and waited, and waited.
And finally, some came along.
Slow Pokes And No Pokes
When I ran it, it took forever. Not even the XML part. The XML part was fast.
Gathering the initial set of data was slow.
With some time to experiment and dig in, I found that the IN_MEM tables cause significant performance issues when:
Query Store is actively logging data
Query Store is > 25 MB or so
Yes, children, in memory tables can be slow, too.
The Problem
Let’s take a couple simple queries against Query Store tables:
SELECT TOP 10 *
FROM sys.query_store_runtime_stats AS qsrs
WHERE qsrs.avg_cpu_time >= 500000
AND qsrs.last_execution_time >= DATEADD(DAY, -1, GETDATE())
ORDER BY qsrs.avg_cpu_time DESC;
SELECT TOP 10 *
FROM sys.query_store_plan AS qsp
WHERE qsp.query_plan IS NOT NULL
AND qsp.last_execution_time >= DATEADD(DAY, -1, GETDATE())
ORDER BY qsp.last_execution_time DESC;
The first query runs for 10 seconds, with the entirety of the time spent filtering data out of the IN_MEM table:
Ho hum.
The second query is even worse, at nearly 2 minutes:
Filtering on the 1Fingerling on the floor
“Unrealistic”
I know, this configuration is probably unsupported because I used SELECT * or something.
I wrote this query hoping to quickly get the worst plans by a specific metric.
WITH the_pits
AS
(
SELECT TOP ( 101 )
qsrs.plan_id,
qsp.query_id,
qsrs.avg_duration / 100000. AS avg_duration_s,
qsrs.avg_cpu_time / 100000. AS avg_cpu_time_s,
qsrs.avg_query_max_used_memory,
qsrs.avg_logical_io_reads,
qsrs.avg_logical_io_writes,
qsrs.avg_tempdb_space_used,
qsrs.last_execution_time,
/*
You can stick any of the above metrics in here to
find offenders by different resource abuse
*/
MAX(qsrs.avg_cpu_time) OVER
(
PARTITION BY
qsp.query_id
ORDER BY
qsp.query_id
ROWS UNBOUNDED PRECEDING
) AS n
FROM sys.query_store_runtime_stats AS qsrs
JOIN sys.query_store_plan AS qsp
ON qsp.plan_id = qsrs.plan_id
WHERE qsrs.avg_duration >= ( 5000. * 1000. )
AND qsrs.avg_cpu_time >= ( 1000. * 1000. )
AND qsrs.last_execution_time >= DATEADD(DAY, -7, GETDATE())
AND qsp.query_plan IS NOT NULL
/*
Don't forget to change this to same thing!
*/
ORDER BY qsrs.avg_cpu_time DESC
)
SELECT p.plan_id,
p.query_id,
p.avg_duration_s,
p.avg_cpu_time_s,
p.avg_query_max_used_memory,
p.avg_logical_io_reads,
p.avg_logical_io_writes,
p.avg_tempdb_space_used,
p.last_execution_time,
qsqt.query_sql_text,
TRY_CONVERT(XML, qsp.query_plan) AS query_plan
FROM sys.query_store_plan AS qsp
JOIN the_pits AS p
ON p.plan_id = qsp.plan_id
JOIN sys.query_store_query AS qsq
ON qsq.query_id = qsp.query_id
JOIN sys.query_store_query_text AS qsqt
ON qsq.query_text_id = qsqt.query_text_id
ORDER BY p.n DESC;
It works pretty well. Sometimes.
Other times, it runs for 4.5 minutes.
I know what you’re thinking: “Erik, you’re doing all sorts of crazy stuff in there. You’re making it slow.”
But none of the crazy stuff I’m doing is where the slowdown is.
It’s all in the same stuff I pointed out in the simpler queries.
12.5 seconds…FOUR MINUTES
Testing, testing
I can’t stress how much I want Query Store to be successful. I absolutely love the idea.
But it just wasn’t implemented very well. Simple filtering against the data takes forever.
And yes, you can have NULL query plans for some reason. That’s rich.