Why Plan Cache Scripts Can Be Misleading

A Long Way From Home


Most scripts (even ones I’ve worked on!) that look at the plan cache, have had the ability to sort it by different metrics: CPU, reads, writes, duration, etc.

A lot of people are very interested in long running queries — and I am too!

Heck, they’re how I make money. Blogging pays like crap.

But there’s a slight problem with only looking at query duration.

Fake Locks


Let’s say I have a stored procedure that looks like this:

CREATE OR ALTER PROCEDURE dbo.dblock (@id INT)
AS
BEGIN
    BEGIN TRAN
        UPDATE u
        SET u.Reputation = 2147483647 
        FROM dbo.Users AS u
        WHERE u.Id = @id
        
        WAITFOR DELAY '00:00:07.000'
    ROLLBACK
END;
GO

It’s there to simulate a long running modification query. You know, the kind of thing that’ll lead to blocking.

The kind of blocking that can make other queries feel like they take a long time.

Street Team


Here’s another stored procedure:

CREATE OR ALTER PROCEDURE dbo.the_lox (@id INT)
AS
BEGIN
    SELECT u.Id
    FROM dbo.Users AS u
    WHERE u.Id = @id
END;
GO

This will finish instantly. The Id column of the Users table is the Primary Key, as well as the Clustered Index.

There is literally nothing to tune here.

But it may look like it, sometimes…

D’evils


If I run these in different windows, the lox will be blocked by dblock.

--One Window
EXEC dbo.dblock @id = 22656;
GO 

--Other window
EXEC dbo.the_lox @id = 22656;
GO

If you were wondering why I had a 7 second wait, it’s because it generally takes me two seconds to flip windows and run a query.

When the situation resolves, this is what the metrics look like:

SELECT OBJECT_NAME(deps.object_id) AS stored_procedure_name,
       ( deps.last_worker_time / 1000. ) AS cpu_time,
       ( deps.last_elapsed_time / 1000. ) AS run_time
FROM   sys.dm_exec_procedure_stats AS deps
WHERE  deps.object_id = OBJECT_ID('dbo.the_lox');
GO
Everything You Got

The query barely used any CPU, but it ran for 5 seconds.

If you order your plan cache by elapsed time, you might see blameless queries up at the top.

There are a lot of reasons why they could end up there, and blocking is one of them.

Unfortunately, there’s currently nothing to tell you why. If you’re just getting started with query tuning, this could look befuddling.

Are Averages Better?


Sometimes it helps to look at what runs for a long time on average — you can figure that out by looking at the number of executions.

That can make better sense of things, but not if a query has only run once, or if it gets blocked a lot.

It may make more sense to factor in metrics that represent physical work, like CPU, reads, or writes, rather than just relying purely on wall clock time.

Thanks for reading!



2 thoughts on “Why Plan Cache Scripts Can Be Misleading

  1. this is where using a Median function is a lot better than using Average. A few out-liers will not skew the result.

Leave a Reply

Your email address will not be published. Required fields are marked *