Columnstore Precon Details

I thought it could be helpful to go into more detail for what I plan to present at the columnstore precon that’s part of SQL Saturday New York City. Note that everything here is subject to change. I tried to include all of the main topics planned at this time. If you’re attending and really want to see something covered, let me know and I’ll be as transparent as possible about what I can and cannot do.

  • Part 1: Creating your table
    • Definitions
    • Delta rowgroups
    • How columnstore compression works
    • What can go wrong with compression
    • Picking the right data types
    • Deciding on partitioning
    • Indexing
    • When should you try columnstore?
  • Part 2: Loading data quickly
    • Serial inserts
    • Parallel inserts
    • Inserts into partitioned tables
    • Inserts to preserve order
    • Deleting rows
    • Updating rows
    • Better alternatives
    • Trickle insert – this is a maybe
    • Snapshot isolation and ADR
    • Loading data on large servers
  • Part 3: Querying your data
    • The value of maintenance
    • The value of patching
    • How I read execution plans
    • Columnstore/batch mode gotchas with DMVs and execution plans
    • Columnstore features
    • Batch mode topics
    • When is batch mode slower than row mode?
    • Batch mode on rowstore
    • Bad ideas with columnstore
  • Part 4: Maintaining your data
    • Misconceptions about columnstore
    • Deciding what’s important
    • Evaluating popular maintenance solutions
    • REORG
    • REBUILD
    • The tuple mover
    • Better alternatives

I hope to see a few of you there!

Join Joe Obbish In NYC On October 4th To Learn Everything About Columnstore

Second To None


Prior to this year’s SQL Saturday in NYC, we’re running one very special precon, with Joe Obbish:

Joe Obbish – Clustered Columnstore For Performance

Buy Tickets Here!

Clustered columnstore indexes can be a great solution for data warehouse workloads, but there’s not a lot of advanced training or detailed documentation out there. It’s easy to feel all alone when you want a second opinion or run into a problem with query performance or data loading that you don’t know how to solve.

In this full day session, I’ll teach you the most important things I know about clustered columnstore indexes. Specifically, I’ll teach you how to make the right choices with your schema, data loads, query tuning, and columnstore maintenance. All of these lessons have been learned the hard way with 4 TB of production data on large, 96+ core servers. Material is applicable from SQL Server 2016 through 2019.

Here’s what I’ll be talking about:

– How columnstore compression works and tips for picking the right data types
– Loading columnstore data quickly, especially on large servers
– Improving query performance on columnstore tables
– Maintaining your columnstore tables

This is an advanced level session. To get the most out of the material, attendees should have some practical experience with columnstore and query tuning, and a solid understanding of internals such as wait stats analysis. You don’t need to bring a laptop to follow along.

Buy Tickets Here!

Wanna save 25%? Use coupon code “actionjoe” at checkout — it’s good for the first 10 seats, so hurry up and get yours today.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

The Hardest Part Of SQL Server Query Tuning: Getting The Right Results

Grey Matter


Often when query tuning, I’ll try a change that I think makes sense, only to have it backfire.

It’s not that the query got slower, it’s that the results that came back were wrong different.

Now, this can totally happen because of a bug in previously used logic, but that’s somewhat rare.

And wrong different results make testers nervous. Especially in production.

Here’s a Very Cheeky™ example.

Spread’em


This is my starting query. If I run it enough times, I’ll get a billion missing index requests.

WITH topusers AS
(
    SELECT   TOP (1)
             u.Id, u.DisplayName
    FROM     dbo.Users AS u
    ORDER BY u.Reputation DESC
)
SELECT   u.Id,
         u.DisplayName,
         SUM(p.Score * 1.0) AS PostScore,
         SUM(c.Score * 1.0) AS CommentScore,
         COUNT_BIG(*) AS CountForSomeReason
FROM     topusers AS u
JOIN     dbo.Posts AS p
    ON p.OwnerUserId = u.Id
JOIN     dbo.Comments AS c
    ON c.UserId = u.Id
WHERE    p.Score >= 5
AND      c.Score >= 1
GROUP BY u.Id, u.DisplayName;

For the sake of argument, I’ll add them all. Here they are:

CREATE INDEX ix_tabs
    ON dbo.Users ( Reputation DESC, Id )
    INCLUDE ( DisplayName );

CREATE INDEX ix_spaces
    ON dbo.Users ( Id, Reputation DESC )
    INCLUDE ( DisplayName );

CREATE INDEX ix_coke 
    ON dbo.Comments ( Score) INCLUDE( UserId );

CREATE INDEX ix_pepsi
    ON dbo.Posts ( Score ) INCLUDE( OwnerUserId );

CREATE NONCLUSTERED INDEX ix_tastes_great
    ON dbo.Posts ( OwnerUserId, Score );

CREATE NONCLUSTERED INDEX ix_less_filling
    ON dbo.Comments ( UserId, Score );

With all those indexes, the query is still dog slow.

Maybe It’s Me


I’ll take my own advice. Let’s break the query up a little bit.

DROP TABLE IF EXISTS #topusers;

WITH topusers AS
(
    SELECT   TOP (1)
             u.Id, u.DisplayName
    FROM     dbo.Users AS u
    ORDER BY u.Reputation DESC
)
SELECT *
INTO #topusers
FROM topusers;

CREATE UNIQUE CLUSTERED INDEX ix_whatever 
    ON #topusers(Id);

SELECT   u.Id,
         u.DisplayName,
         SUM(p.Score * 1.0) AS PostScore,
         SUM(c.Score * 1.0) AS CommentScore,
         COUNT_BIG(*) AS CountForSomeReason
FROM     #topusers AS u
JOIN     dbo.Posts AS p
    ON p.OwnerUserId = u.Id
JOIN     dbo.Comments AS c
    ON c.UserId = u.Id
WHERE    p.Score >= 5
AND      c.Score >= 1
GROUP BY u.Id, u.DisplayName;

Still dog slow.

Variability


Alright, I’m desperate now. Let’s try this.

DECLARE @Id INT, 
        @DisplayName NVARCHAR(40);

SELECT   TOP (1)
            @Id = u.Id, 
		    @DisplayName = u.DisplayName
FROM     dbo.Users AS u
ORDER BY u.Reputation DESC;


SELECT   @Id AS Id,
         @DisplayName AS DisplayName,
         SUM(p.Score * 1.0) AS PostScore,
         SUM(c.Score * 1.0) AS CommentScore,
         COUNT_BIG(*) AS CountForSomeReason
FROM dbo.Posts AS p 
JOIN  dbo.Comments AS c 
    ON c.UserId = p.OwnerUserId
WHERE    p.Score >= 5
AND      c.Score >= 1
AND      (c.UserId = @Id OR @Id IS NULL)
AND      (p.OwnerUserId = @Id OR @Id IS NULL);

Let’s get some worst practices involved. That always goes well.

Except here.

Getting the right results seemed like it was destined to be slow.

Differently Resulted


At this point, I tried several rewrites that were fast, but wrong.

What I had missed, and what Joe Obbish pointed out to me, is that I needed a cross join and some math to make it all work out.

WITH topusers AS
(
    SELECT   TOP (1)
             u.Id, u.DisplayName
    FROM     dbo.Users AS u
    ORDER BY u.Reputation DESC
)
SELECT     t.Id AS Id,
           t.DisplayName AS DisplayName,
           p_u.PostScoreSub * c_u.CountCSub AS PostScore,
           c_u.CommentScoreSub * p_u.CountPSub AS CommentScore,
           c_u.CountCSub * p_u.CountPSub AS CountForSomeReason
FROM       topusers AS t
JOIN       (   SELECT   p.OwnerUserId, 
                        SUM(p.Score * 1.0) AS PostScoreSub, 
						COUNT_BIG(*) AS CountPSub
               FROM     dbo.Posts AS p
               WHERE    p.Score >= 5
               GROUP BY p.OwnerUserId ) AS p_u
			   ON p_u.OwnerUserId = t.Id
CROSS JOIN (   SELECT   c.UserId, SUM(c.Score * 1.0) AS CommentScoreSub, COUNT_BIG(*) AS CountCSub
               FROM     dbo.Comments AS c
               WHERE    c.Score >= 1
               GROUP BY c.UserId ) AS c_u
               WHERE c_u.UserId = t.Id;

This finishes instantly, with the correct results.

The value of a college education!

Realizations and Slowness


After thinking about Joe’s rewrite, I had a terrible thought.

All the rewrites that were correct but slow had gone parallel.

“Parallel”

Allow me to illustrate.

SQL Server Query Plan
In a row?

Repartition Streams usually does the opposite.

But here, it puts all the rows on a single thread.

“For correctness”

Which ends up in a 236 million row parallel-but-single-threaded-cross-hash-join.

Summary Gates Are On My Side


Which, of course, is nicely summarized by P. White.

SQL Server uses the correct join (inner or outer) and adds projections where necessary to honour all the semantics of the original query when performing internal translations between apply and join.

The differences in the plans can all be explained by the different semantics of aggregates with and without a group by clause in SQL Server.

What’s amazing and frustrating about the optimizer is that it considers all sorts of different ways to rewrite your query.

In milliseconds.

It may have even thought about a plan that would have been very fast.

But we ended up with this one, because it looked cheap.

Untuneable


The plan for Joe’s version of the query is amazingly simple.

SQL Server Query Plan
Bruddah.

Sometimes giving the optimizer a different query to work with helps, and sometimes it doesn’t.

Rewriting queries is tough business. When you change things and still get the same plan, it can be really frustrating.

Just know that behind the scenes the optimizer is working hard to rewrite your queries, too.

If you really want to change the execution plan you end up with, you need to present the logic to the optimizer in different ways, and often with different indexes to use.

Other times, you just gotta ask Joe.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Thoughts On Setting MAXDOP Correctly In SQL Server

Microsoft recently published new guidance on setting server level MAXDOP. I hope to help the community by analyzing the new guidance and offering some of my own thoughts on query parallelism.

Line by line


Documentation is meant to be shared after all, so hopefully no one minds if I quote most of it:

Starting with SQL Server 2016 (13.x), during service startup if the Database Engine detects more than eight physical cores per NUMA node or socket at startup, soft-NUMA nodes are created automatically by default. The Database Engine places logical processors from the same physical core into different soft-NUMA nodes.

This is true and one of the bigger benefits of auto soft-NUMA as far as I’ve been able to tell.

The recommendations in the table below are aimed at keeping all the worker threads of a parallel query within the same soft-NUMA node.

SQL Server is not designed to keep all worker threads in a single soft-NUMA node. That might have been true in SQL Server 2008, but it changed in 2012. The only semi-official documentation that I know of is here and I looked into the behavior here. Read through both if you’re interested in how scheduling of parallel worker threads is performed by SQL Server, but I’ll provide a quick summary via example here.

Suppose you have two soft-NUMA nodes of 6 schedulers each and the server just restarted.NUMA node 0 has positions 0-5 and NUMA node 1 has positions 6-11. The global enumerator starts at position 0. If I run a MAXDOP 4 query then the enumerator advances by 4. The parallel workers are allowed in positions 0-3 which means that any four out of six schedulers can be chosen from NUMA node 0. All parallel worker threads are in NUMA node 0 for the first query. Suppose I run another MAXDOP 4 query. The enumerator advances by 4 and the allowed positions are 4-7. That means that any two schedulers can be chosen from NUMA node 0 and any two schedulers can be chosen from NUMA node 1. The worker threads are split over two soft-NUMA nodes even though query MAXDOP is less than the size of the soft-NUMA nodes.

Unless you’re on a server with a single soft-NUMA node it is difficult to guarantee that all worker threads end up on the same soft-NUMA node. I strongly recommend against aiming for that as a goal. There are more details in the “Preventing hard NUMA worker splits” section of this blog post.

This will improve the performance of the queries and distribution of worker threads across the NUMA nodes for the workload. For more information, see Soft-NUMA.

I’ve heard some folks claim that keeping all parallel workers on a single hard NUMA nodes can be important for query performance. I’ve even seen some queries experience reduced performance when thread 0 is on a different hard NUMA node than parallel worker threads. I haven’t heard of anything about the importance of keeping all of a query’s worker threads on a single soft-NUMA node. It doesn’t really make sense to say that query performance will be improved if all worker threads are on the same soft-NUMA node. Soft-NUMA is a configuration setting. Suppose I have a 24 core hard NUMA node and my goal is to get all of a parallel query’s worker threads on a single soft-NUMA node. To accomplish that goal the best strategy is to disable auto soft-NUMA because that will give me a NUMA node size of 24 as opposed to 8. So disabling auto soft-NUMA will increase query performance?

Starting with SQL Server 2016 (13.x), use the following guidelines when you configure the max degree of parallelism server configuration value:

Server with single NUMA node [and] Less than or equal to 8 logical processors: Keep MAXDOP at or below # of logical processors

I don’t understand this guidance at all. If MAXDOP is set to above the number of logical processors then the total number of logical processors is used. This is even mentioned earlier on the same page of documentation. This line is functionally equivalent to “Set MAXDOP to whatever you want”.

Server with single NUMA node [and] Greater than 8 logical processors: Keep MAXDOP at 8

This configuration is only possible with a physical core count between 5 and 8 and with hyperthreading enabled. Setting MAXDOP above the physical core count isn’t recommended by some folks, but I suppose there could be some scenarios where it makes sense. Keeping MAXDOP at 8 isn’t bad advice for many queries on a large enough server, but the documentation is only talking about small servers here.

Server with multiple NUMA nodes [and] Less than or equal to 16 logical processors per NUMA node: Keep MAXDOP at or below # of logical processors per NUMA node

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node, so this covers all server configurations with more than 8 physical cores. Soft-NUMA scheduler counts per node can range from 4 to 16. If you accept this advice then in some scenarios you’ll need to lower MAXDOP as you increase the number of physical cores per socket. For example, if I have 24 schedulers per socket without hyperthreading then auto soft-NUMA gives me three NUMA nodes of 8 schedulers, so I might set MAXDOP to 8. But if the scheduler count is increased to 25, 26, or 27 then I’ll have at least one soft-NUMA node of 6 schedulers. So I should lower MAXDOP from 8 to 6 because the physical core count of the socket increased?

Server with multiple NUMA nodes [and] Greater than 16 logical processors per NUMA node: Keep MAXDOP at half the number of logical processors per NUMA node with a MAX value of 16

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node. I believe that this is impossible. At the very least, if it possible I can tell you that it’s rare. This feels like an error in the documentation. Perhaps they were going for some kind of hyperthreading adjustment?

NUMA node in the above table refers to soft-NUMA nodes automatically created by SQL Server 2016 (13.x) and higher versions.

I suspect that this is a mistake and that some “NUMA node” references are supposed to refer to hard NUMA. It’s difficult to tell.

Use these same guidelines when you set the max degree of parallelism option for Resource Governor workload groups.

There are two benefits to using MAXDOP at the Resource Governor workload group level. The first benefit is that it allows different workloads to have different MAXDOP without changing lots of application code. The guidance here doesn’t allow for that benefit. The second benefit is that it acts as a hard limit on query MAXDOP as opposed to the soft limit provided with server level MAXDOP. It may also be useful to know that the query optimizer takes server level MAXDOP into account when creating a plan. It does not do so for MAXDOP set via Resource Governor.

I haven’t seen enough different types of workloads in action to provide generic MAXDOP guidance, but I can share some of the issues that can occur with query parallelism being too low or too high.

What are some of the problems with setting MAXDOP too low?


  1. Better query performance may be achieved with a higher MAXDOP. For example, a well-written MAXDOP 8 query on a quiet server may simply run eight times as quickly as the MAXDOP 1 version. In some scenarios this is highly desired behavior.
  2. There may not be enough concurrent queries to get full value out of the server’s hardware without increasing query MAXDOP. Unused schedulers can be a problem for batch workloads that aim to get a large, fixed amount of work done as quickly as possible.
  3. Row mode bitmap operators associated with hash joins and merge joins only execute in parallel plans. MAXDOP 1 query plans lose out on this optimization.

What are some of the problems with setting MAXDOP too high?


  1. At some point, throwing more and more parallel queries at a server will only slow things down. Imagine adding more and more cars to an already gridlocked traffic situation. Depending on the workload you may not want to have many active workers per scheduler.
  2. It is possible to run out of worker threads with many concurrent parallel queries that have many parallel branches each. For example, a MAXDOP 8 query with 20 branches will ask for 160 parallel workers. When this happens parallel queries can get downgraded all the way to MAXDOP 1.
  3. Row mode exchange operators need to move rows between threads and do not scale well with increased query MAXDOP.
  4. Some types of row mode exchange operators evenly divide work among all parallel worker threads. This can degrade query performance if even one worker thread is on a busy scheduler. Consider a server with 8 schedulers. Scheduler 0 has two active workers and all other schedulers have no workers. Suppose there is 40 seconds of CPU work to do, the query scales with MAXDOP perfectly, and work is evenly distributed to worker threads. A MAXDOP 4 query can be expected to run in 40/4 = 10 seconds since SQL Server is likely to pick four of the seven less busy schedulers. However, a MAXDOP 8 query must put one of the worker threads on scheduler 0. The work on schedulers 1 – 7 will finish in 40/8 = 5 seconds but the worker thread on scheduler 0 has to yield to the other worker threads. It may take 5 * 3 = 15 seconds if CPU is shared evenly, so in this example increasing MAXDOP from 4 to 8 increases query run time from 10 seconds to 15 seconds.
  5. The query memory grant for parallel inserts into columnstore indexes increases with MAXDOP. If MAXDOP is too high then memory pressure can occur during compression and the SELECT part of the query may be starved for memory.
  6. The query memory grant for memory-consuming operators on the inner side of a nested loop is often not increased with MAXDOP even though the operator may execute concurrently once on each worker thread. In some uncommon query patterns, increasing MAXDOP will increase the amount of data spilled to tempdb.
  7. Increasing MAXDOP increases the number of queries that will have parallel workers spread across multiple hard NUMA nodes. If MAXDOP is greater than the number of schedulers in a hard NUMA node then the query is guaranteed to have split workers. This can degrade query performance for some types of queries.
  8. Worker threads may need to wait on some type of shared resource. Increasing MAXDOP can increase contention without improving query performance. For example, there’s nothing stopping me from running a MAXDOP 100 SELECT INTO, but I certainly do not get 100X of the performance of a MAXDOP 1 query. The problem with the below query is the NESTING_TRANSACTION_FULL latch:

SQL Server Query Plan Properties

Preventing hard NUMA worker splits


It generally isn’t possible to prevent worker splits over hard NUMA nodes without changing more than server level and query level MAXDOP. Consider a server with 2 hard NUMA nodes of 10 schedulers for each. To avoid a worker split, an administrator might try setting server level MAXDOP to 10, with the idea being that each parallel query spreads its workers over NUMA node 0 or NUMA node 1. This plan won’t work if any of the following occur:

  • Any query runs with a query level MAXDOP hint other than 0, 1, 10, or 20.
  • Any query is downgraded in MAXDOP but still runs in parallel.
  • A parallel stats update happens. The last time I checked these run with a query level MAXDOP hint of 16.
  • Something else unexpected happens.

In all cases the enumerator will be shifted and any MAXDOP 10 queries that run after will split their workers. TF 2467 can help, but it needs to be carefully tested with the workload. With the trace flag, as long as MAXDOP <= 10 and automatic soft-NUMA is disabled then the parallel workers will be sent to a single NUMA node based on load. Note that execution context 0 thread can still be on a different hard NUMA node. If you want to prevent that then you can try Resource Governor CPU affinity at the Resource Pool level. Create one pool for NUMA node 0 and one pool for NUMA node 1. You may experience interesting consequences when doing that.

The most reliable method by far is to have a single hard NUMA node, so if you have a VM that fits into a single socket of a VM host and you care about performance then ask your friendly VM administrator for some special treatment.

Final thoughts


I acknowledge that it’s difficult to create MAXDOP guidance that works for all scenarios and I hope that Microsoft continues to try to improve their documentation on the subject. Thanks for reading!

SQL Server T-SQL Protip: watch those TOPs without ORDER BY

In the documentation for TOP, the following is listed as a best practice:

In a SELECT statement, always use an ORDER BY clause with the TOP clause. Because, it’s the only way to predictably indicate which rows are affected by TOP.

Let’s work through a real world example.

The good


One of the great things about the “Top Resource Consuming Queries” query store SSMS report is that it is always able to render the query plan, even for very complex queries. I’m not aware of a pure T-SQL solution that can avoid requiring the end user to save xml to files in all cases. The report nearly always takes a long time to run, so it’s easy to capture the T-SQL that powers the grid details version:

DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_duration DESC;

Note the presence of the ORDER BY. I get exactly the results that I was expecting:

The bad


If I ask for extra details (who doesn’t want more details?), a significantly more complex query is generated:

-- grid format query with additional details
-- grid format query with additional details
DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
With wait_stats AS
(
SELECT
    ws.plan_id plan_id,
    ws.execution_type,
    ROUND(CONVERT(float, SUM(ws.total_query_wait_time_ms)/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))*1,2) avg_query_wait_time,
    ROUND(CONVERT(float, SQRT( SUM(ws.stdev_query_wait_time_ms*ws.stdev_query_wait_time_ms*(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms)))*1,2) stdev_query_wait_time,
    CAST(ROUND(SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms),0) AS BIGINT) count_executions,
    MAX(itvl.end_time) last_execution_time,
    MIN(itvl.start_time) first_execution_time
FROM sys.query_store_wait_stats ws
    JOIN sys.query_store_runtime_stats_interval itvl ON itvl.runtime_stats_interval_id = ws.runtime_stats_interval_id
WHERE NOT (itvl.start_time > @interval_end_time OR itvl.end_time < @interval_start_time)
GROUP BY ws.plan_id, ws.runtime_stats_interval_id, ws.execution_type ),
top_wait_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(ws.avg_query_wait_time*ws.count_executions))*1,2) total_query_wait_time,
    SUM(ws.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM wait_stats ws
    JOIN sys.query_store_plan p ON p.plan_id = ws.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (ws.first_execution_time > @interval_end_time OR ws.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
),
top_other_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    ROUND(CONVERT(float, SUM(rs.avg_cpu_time*rs.count_executions))*0.001,2) total_cpu_time,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_reads*rs.count_executions))*8,2) total_logical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_writes*rs.count_executions))*8,2) total_logical_io_writes,
    ROUND(CONVERT(float, SUM(rs.avg_physical_io_reads*rs.count_executions))*8,2) total_physical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_clr_time*rs.count_executions))*0.001,2) total_clr_time,
    ROUND(CONVERT(float, SUM(rs.avg_dop*rs.count_executions))*1,0) total_dop,
    ROUND(CONVERT(float, SUM(rs.avg_query_max_used_memory*rs.count_executions))*8,2) total_query_max_used_memory,
    ROUND(CONVERT(float, SUM(rs.avg_rowcount*rs.count_executions))*1,0) total_rowcount,
    ROUND(CONVERT(float, SUM(rs.avg_log_bytes_used*rs.count_executions))*0.0009765625,2) total_log_bytes_used,
    ROUND(CONVERT(float, SUM(rs.avg_tempdb_space_used*rs.count_executions))*8,2) total_tempdb_space_used,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
)
SELECT TOP (@results_row_count)
    A.query_id query_id,
    A.object_id object_id,
    A.object_name object_name,
    A.query_sql_text query_sql_text,
    A.total_duration total_duration,
    A.total_cpu_time total_cpu_time,
    A.total_logical_io_reads total_logical_io_reads,
    A.total_logical_io_writes total_logical_io_writes,
    A.total_physical_io_reads total_physical_io_reads,
    A.total_clr_time total_clr_time,
    A.total_dop total_dop,
    A.total_query_max_used_memory total_query_max_used_memory,
    A.total_rowcount total_rowcount,
    A.total_log_bytes_used total_log_bytes_used,
    A.total_tempdb_space_used total_tempdb_space_used,
    ISNULL(B.total_query_wait_time,0) total_query_wait_time,
    A.count_executions count_executions,
    A.num_plans num_plans
FROM top_other_stats A LEFT JOIN top_wait_stats B on A.query_id = B.query_id and A.query_sql_text = B.query_sql_text and A.object_id = B.object_id
WHERE A.num_plans >= 1
ORDER BY total_duration DESC
)

Now we have not 1, not 2, but THREE TOP operators! But only one of them has an ORDER BY. The results are completely different, and are pretty much useless:

The ugly


This has nothing to do with TOP as far as I know, but I included it just for fun:

Final thoughts


All of you developers out there should watch your TOPs and make sure you’re using ORDER BY as needed. Otherwise, you might end up with annoyed end users writing blog posts about your code.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why is SYSDATETIME() slower than SYSUTCDATETIME() In SQL Server?

Consider the following code that calls SYSDATETIME() 10 million times in a loop:

GO

CREATE OR ALTER PROCEDURE #p_local AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @dummy DATETIME2(7), @loops INT = 0;

	WHILE @loops <= 10000000
	BEGIN
		SET @dummy = SYSDATETIME();

		SET @loops = @loops + 1;
	END;
END;

GO

EXEC #p_local;

On my machine the code takes about 11.6 seconds to execute. Replacing SYSDATETIME() with SYSUTCDATETIME() makes the code take only 4.3 seconds to execute. Why is SYSUTCDATETIME() so much faster than SYSDATETIME()?

It’s always a CPU problem


Both while loops drive a CPU core to 100% while executing. Within SQL server, elapsed time nearly equals CPU. For this code, I wouldn’t expect waits, latches, or spinlocks to reveal any clues. Query plans, execution stats, and other DMVs are unlikely to help as well. One thing that could help is information about the internals of SYSDATETIME() and SYSUTCDATETIME(). There’s a little bit on books online:

SQL Server obtains the date and time values by using the GetSystemTimeAsFileTime() Windows API. The accuracy depends on the computer hardware and version of Windows on which the instance of SQL Server is running. The precision of this API is fixed at 100 nanoseconds. The accuracy can be determined by using the GetSystemTimeAdjustment() Windows API.

Probably all of the useful information is likely locked behind the vault at Microsoft. Fortunately, both SQL Server and the guest OS are happy to give us information about what SQL Server is using its CPU to do if we have the right public symbols and ask nicely. I know of four primary ways of doing this:

  1. A True Professional could use WinDbg (the cool kids pronounce it “wind bag”) or another debugger to step through the code of both functions. I don’t have the skills or patience to do that.
  2. DBCC STACKDUMP or sqldumper.exe can be used to create a filtered memory dump on demand. The problem with this approach is that it only provides a single snapshot and our functions execute very quickly.
  3. Extended events can provide callstacks for many events. I’m not aware of an extended event that would fire at the right times for this code.
  4. ETW tracing can provide a summary of call stacks during a sampled time period. Common choices are Windows Performance Recorder or PerfView.

Option 4 is exactly what we need. We can run a single query at a time that performs uniform work and we want to see where CPU time is spent during the entire query’s execution. My tool of choice is PerfView.

Revealing SQL Server’s secrets


I used PerfView’s default settings and collected ETW profile data separately during both while loops. After resolving symbols and changing folding to 0, there’s a clear difference between SYSDATETIME() (on the left) and SYSUTCDATETIME() (on the right):

kernelbase!GetTimeZoneInformation shows up with a high percentage for exclusive time for SYSDATETIME(). It does not show up at all for SYSUTCDATETIME(). In case it helps, here’s documention for a few of the column names:

Exc – The amount of cost (msec of CPU time) that can be attributed to the particular method itself (not any of its callees)
Exc % – The exclusive cost expressed as a percentage of the total cost of all samples.
Inc – The cost associated with this node as well as all its children (callees) recursively. The inclusive cost of the ROOT contains all costs.
Inc % – The inclusive cost expressed as a percentage of the total cost of all samples (will be 100% for the ROOT node)

sqltses!CXVariant::GetSysDatetime sounds an awful lot like SYSDATETIME() to me. It also shows up in the right place in the stack to be a match. Drilling into the methods are called by it:

Note that the call to kernelbase!GetTimeZoneInformation takes up nearly all CPU time. There’s also a call to kernelbase!GetSystemTimeAsFileTime that barely shows up as well.

Sharp-eyed readers might be wondering if SYSDATETIMEOFFSET() results in calls to sqltses!CDatetimeOffset::SetSystemDatetimeOffset. The answer is yes, but this is left as an exercise for the reader.

Again, sqltses!CXVariant::GetSysUTCDatetime sounds like awful lot like SYSUTCDATETIME() to me. Drilling into the methods are called by it:

Note the low percentage of inclusive time for the call. SYSUTCDATETIME() is so cheap to call that most of the CPU time is spent executing the loop code instead of that function. kernelbase!GetTimeZoneInformation does not show up at all. There’s is still a call to kernelbase!GetSystemTimeAsFileTime that barely shows up.

This is probably an oversimplification, but I think it’s fair to say that each call to SYSDATETIME() needs to get time zone information. This results in a call to kernelbase!GetTimeZoneInformation that is relatively expensive. There is no need to call that method for SYSUTCDATETIME().

Does this matter? It depends™


Database folks like answering questions with “it depends”. Regrettably, they don’t always provide sufficient or accurate details about what it depends on. I strive to not fall into that trap. For many queries the performance difference between the two functions won’t matter. Consider the following two batches:

DROP TABLE IF EXISTS #BENCHMARK_TIME_FUNCTIONS;

CREATE TABLE #BENCHMARK_TIME_FUNCTIONS (SAVED_DATETIME DATETIME2(7) NULL);

INSERT INTO #BENCHMARK_TIME_FUNCTIONS WITH (TABLOCK)
SELECT TOP (10000000) SYSDATETIME()
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

GO

DROP TABLE IF EXISTS #BENCHMARK_TIME_FUNCTIONS;

CREATE TABLE #BENCHMARK_TIME_FUNCTIONS (SAVED_DATETIME DATETIME2(7) NULL);

INSERT INTO #BENCHMARK_TIME_FUNCTIONS WITH (TABLOCK)
SELECT TOP (10000000) SYSUTCDATETIME()
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

SQL Server doesn’t need to execute the datetime functions once per row. If it did, we’d expect a run time difference of about 7 seconds between the two queries. Instead, they execute in nearly exactly the same amount of time. I think that the details about guarantees of this behavior were found on a Connect (RIP) comment.

The performance difference can make a difference when the function is used to profile code that executes row by row. Suppose that we have related SQL statements and we want to measure the execution time of just one of the statements by calculating the time difference before and after execution. I’m not saying that this is the best way of doing this analysis, but I suspect it to be a common technique. Below is an example stored procedure with one copy using SYSDATETIME() and the other using SYSUTCDATETIME():

GO

CREATE OR ALTER PROCEDURE #USE_SUPERIOR_AMERICAN_TIME AS
BEGIN
	DECLARE @start_time DATETIME2(7),
	@end_time DATETIME2(7),
	@total_microseconds BIGINT = 0,
	@loops INT = 0;

	SET NOCOUNT ON;

	DROP TABLE IF EXISTS dbo.OFFICIAL_STATEMENTS;
	CREATE TABLE dbo.OFFICIAL_STATEMENTS (OFFICIAL_STATEMENT VARCHAR(30));
	
	WHILE @loops <= 50000 BEGIN INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO COLLUSION');
		IF @@ROWCOUNT > 0	
		BEGIN
			SET @start_time = SYSDATETIME();

			INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO OBSTRUCTION');

			SET @end_time = SYSDATETIME();

			SET @total_microseconds = @total_microseconds + DATEDIFF(MICROSECOND, @start_time, @end_time);
		END;

		SET @loops = @loops + 1;
	END;

	SELECT @total_microseconds;
END;

GO

EXEC #USE_SUPERIOR_AMERICAN_TIME;

GO

CREATE OR ALTER PROCEDURE #USE_INFERIOR_EUROPEAN_TIME AS
BEGIN
	DECLARE @start_time DATETIME2(7),
	@end_time DATETIME2(7),
	@total_microseconds BIGINT = 0,
	@loops INT = 0;

	SET NOCOUNT ON;

	DROP TABLE IF EXISTS dbo.OFFICIAL_STATEMENTS;
	CREATE TABLE dbo.OFFICIAL_STATEMENTS (OFFICIAL_STATEMENT VARCHAR(30));
	
	WHILE @loops <= 50000 BEGIN INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO COLLUSION');
		IF @@ROWCOUNT > 0	
		BEGIN
			SET @start_time = SYSUTCDATETIME();

			INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO OBSTRUCTION');

			SET @end_time = SYSUTCDATETIME();

			SET @total_microseconds = @total_microseconds + DATEDIFF(MICROSECOND, @start_time, @end_time);
		END;

		SET @loops = @loops + 1;
	END;

	SELECT @total_microseconds;
END;

GO

EXEC #USE_INFERIOR_EUROPEAN_TIME;

The second insert query takes a total of 4381219 microseconds with SYSDATETIME() and 4164658 microseconds with SYSUTCDATETIME(). That’s a 5% difference just from changing the datetime function. Of course, given that this code is measuring the difference of times SYSUTCDATETIME() is the correct choice because it remains accurate even if there’s a daylight savings switch during code execution. My own approach is trying to avoid running one-off stored procedures around 2:00 AM.

Final thoughts


There’s a large performance difference between SYSDATETIME() and SYSUTCDATETIME(). It isn’t necessary to guess which one is slower and why. ETW tracing tools such as PerfView can reveal exactly where the performance difference comes from. Thanks for reading!

Breaking Columnstore Delta Rowgroups

The documentation on delta rowgroups says:

A delta rowgroup is a clustered index that’s used only with columnstore indexes. It improves columnstore compression and performance by storing rows until the number of rows reaches a threshold and are then moved into the columnstore.

This clustered index isn’t chosen by the person who creates the table. It’s a hidden internal column automatically added to delta rowgroups. Perhaps the implementation is similar to that of table spools:

The worktable is structured as if it was defined with a clustered index on zero columns. This means that a 4-byte uniqueifier is added to each row stored except the first.

You know how programmers are: they love reusing old code. If a similar implementation is used for delta rowgroups then it should be possible to see SQL Server errors with the right pattern of data loading. More investigation is required.

Defining the rules


It might be possible to find evidence of this clustered index by using DBCC PAGE. In the T-SQL code below, I create a table clustered columnstore index, insert 500 rows, delete 499 rows, insert 500 more rows, and delete 499 rows again:

CREATE TYPE dbo.SEAN_GALLARDY_INT FROM SMALLINT NOT NULL;

DROP TABLE IF EXISTS dbo.view_hidden_clustered_index;

CREATE TABLE dbo.view_hidden_clustered_index (
	ID SEAN_GALLARDY_INT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

GO

CREATE OR ALTER PROCEDURE #p AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @value_to_insert INT = 1;

	WHILE @value_to_insert <= 500
	BEGIN
		INSERT INTO dbo.view_hidden_clustered_index VALUES (@value_to_insert);

		SET @value_to_insert = @value_to_insert + 1;
	END;

	DELETE FROM dbo.view_hidden_clustered_index
	WHERE ID < 500;

	WHILE @value_to_insert <= 1000
	BEGIN
		INSERT INTO dbo.view_hidden_clustered_index VALUES (@value_to_insert);

		SET @value_to_insert = @value_to_insert + 1;
	END;

	DELETE FROM dbo.view_hidden_clustered_index
	WHERE ID > 500 AND ID < 1000;
END;

GO

EXEC #p;

There’s only a single page that stores the 2 rows currently held by the table. This can be viewed with the undocumented DBCC PAGE, TF 3604, and the undocumented sys.dm_db_database_page_allocations:

DECLARE @file_id SEAN_GALLARDY_INT;
DECLARE @page_id INT;

SELECT @file_id = allocated_page_file_id, @page_id = allocated_page_page_id
FROM sys.dm_db_database_page_allocations 
    (DB_ID(), OBJECT_ID('dbo.view_hidden_clustered_index'),NULL, NULL, 'DETAILED')
WHERE is_allocated = 1 AND allocation_unit_type = 1 AND is_iam_page = 0 and page_type = 1;

DBCC TRACEON(3604);
DBCC PAGE('TEST',@file_id,@page_id,3) WITH TABLERESULTS;
DBCC TRACEOFF(3604);

The information that we’re looking for is near the bottom of the result set:

A few things are worth calling out. The 500th row that was inserted into the table has a value of 499 for the “CSILOCATOR” field. This value appears to be stored in little-endian format in the memory dump for Slot 0 Offset 0x60. You can decode the raw value to 499 in T-SQL if desired:

SELECT CAST(0x000001F3 AS INT);

The 1000th row that was inserted into the table into the table has a value of 999 for the CSILOCATOR field. Most importantly, this CSILOCATOR field has a length of four bytes. A typical four byte int in SQL Server has a maximum value of 2147483647. If it’s possible to load billions of rows into a single delta rowgroup then we may run out of values for the CSILOCATOR field.

Of course, a single delta rowgroup cannot hold more than 1048576 rows. As you can see in this example, SQL Server does not always reuse freed up values for the CSILOCATOR field. The table currently has two rows, yet the field has advanced to 999. The right pattern of deletes, inserts, and updates should allow the CSILOCATOR to continue to grow without running out of room for rows in the delta rowgroup.

As a final note, I cannot prove that the CSILOCATOR field corresponds to the clustered index, but it certainly seems to serve the function of a uniqueifier that would be needed for said clustered index.

Running up the score


I need to find a relatively efficient way to advance the CSILOCATOR because I need to do it over 2 billion times, if my theory is correct about the maximum allowed value. Both updating all of the rows in a delta rowgroup and deleting and reinserting advance the CSILOCATOR. I expected that small batch sizes would work best, and they did. For my table’s schema, the sweet spot for updates is about 275 rows and the sweet spot for delete/inserts is about 550 rows. Delete/inserts appeared to be faster than updates for the purpose of constantly reloading the same rows over and over.

Strategies that use multiple CPU cores are possible, but I wanted to do other work on this PC and didn’t want to listen to a loud fan all day. Here’s what the final testing code looked like:

DROP TABLE IF EXISTS dbo.delta_store_test;

CREATE TABLE dbo.delta_store_test (
	ID TINYINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);


DROP TABLE IF EXISTS dbo.LOG_TABLE;

CREATE TABLE dbo.LOG_TABLE (
	log_time DATETIME,
	loop_count INT,
	PRIMARY KEY (log_time)
);

GO

DROP TABLE IF EXISTS dbo.delta_store_source;
CREATE TABLE dbo.delta_store_source (
	ID TINYINT NOT NULL
);

INSERT INTO dbo.delta_store_source
SELECT TOP (550) 1
FROM master..spt_values t1
OPTION (MAXDOP 1);

GO

CREATE OR ALTER PROCEDURE #p AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @loops INT = 0;

	WHILE @loops <= 8000000
	BEGIN
		DELETE FROM dbo.delta_store_test

		INSERT INTO dbo.delta_store_test
		SELECT 1
		FROM dbo.delta_store_source WITH (TABLOCK);

		SET @loops = @loops + 1;

		IF @loops % 10000 = 0
		BEGIN
			INSERT INTO dbo.LOG_TABLE
			VALUES (GETDATE(), @loops);
		END;
	END;
END;

GO

EXEC #p;

If you’re wondering about the temporary stored procedure creation, it’s a habit that I’ve gotten into whenever I write a T-SQL while loop. While not applicable here, very fast loops can incur ASYNC_NETWORK_IO overhead due to the passing of DONE tokens to the client.

Winning the game


After about seven hours the code reaches its glorious end:

Msg 666, Level 16, State 2, Procedure #p, Line 11 [Batch Start Line 26]
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 72057596406595584. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.

I loaded a few more rows into the table until I was unable to insert even one row. Using DBCC PAGE and friends again, we can see that the CSILOCATOR has a very large value of 2147483646.

Issuing a REBUILD does resolve the issue because it wipes away our sins, as REBUILDs often do. Naturally using another clustering key is not an option.

Joe: 1

Microsoft: 0

The cost of playing


My desktop computer has an intel core i5-4670 processor. Intel claims a TDP of 84 watts. Using CPUID HWMonitor, it looks like my CPU uses about 15 W of additional power when running the workload. It’s a single core workload, so I feel that 15 W is reasonable. 15 watts is 0.015 kW, and when used over 7 hours it translates to 0.105 kWh. My last electric bill has a rate of $0.11663 per kWh, so the total cost of this test comes out to about 1.2 cents. I will be sending Erik an invoice.

Final thoughts


It is possible to hit error 666 when maliciously loading data into a columnstore index. It requires loading over 2 billion rows into the same delta rowgroup, so it’s difficult to think of a production scenario that would lead to this outcome. For those of you with eccentric workloads that give cause for concern, you can roughly check if you are running of IDs by running a query similar to the following for each columnstore table:

SELECT 2147483647 - 1 - MAX(CAST(SUBSTRING(%%physloc%%, 5, 4) AS INT)) REMAINING_CSILOCATORS
FROM dbo.delta_store_test;

Thanks for reading!

An Extra Special Guest Blogger: Joe Obbish

Joe, colorized c.2019

Friend Zone


Today I’m thrilled to announce that I’ll be hosting all of Joe Obbish’s past and future blogging material on my site.

Not because Joe was all like “hey man, I really like what you’re doing there” or anything.

He just didn’t wanna deal with hosting his own content anymore.

Well, fine.

Joe is brilliant beyond words, and I’m happy to have him here. Hopefully it’ll raise the website IQ a little bit.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server 2017 CU14 Change to CCI Insert Memory Time-out Behavior

A KB article was recently published that suggested that memory time-out behavior was changed for CCI inserts:

When you try to create a Columnstore Index on a table in Microsoft SQL Server 2016 or 2017, the request may fail after 25 seconds with memory time-out error 8645, depending on how many memory consuming queries are running concurrently and/or how much memory is available at the time of the request. The memory time-out error may occur even when you configure longer memory request time-out at an instance level or at a resource pool level.

I’ve never seen that error occur for serial inserts. Serial inserts time out after 25 seconds and execute with minimum required memory. They write directly to the delta store instead of attempting compression. So it looked like this change affects parallel inserts but the details weren’t at all clear to me.

Time out for time-outs


Why should we care about memory grant timeouts for CCI insert queries? Simply put, lots of bad things can happen when those queries can time out, both for serial and for parallel inserts. For serial insert queries, I’ve observed deadlocks, extremely poor performance along with long SLEEP_TASK waits, and extremely long rollbacks. For parallel insert queries, I’ve observed queries that run seemingly forever, poor performance of the SELECT part, and error 8645. You probably don’t want any of that occurring in production. It would be very helpful if it was possible to extend the 25 second time-out for queries that insert into columnstore tables.

Time out for time-outs


I’ve found it extremely useful to keep around a 2017 RTM environment with no CUs to figure when issues related to columnstore were introduced into the product. First we’ll take a look at the behavior on 2017 RTM to see error 8645 in action. I want a relatively simple parallel insert query that will run for longer than 25 seconds and will take a large memory grant. I decided on the following query:

INSERT INTO dbo.TARGET_CCI_1 WITH (TABLOCK)
SELECT ca.SLOW, ca.INFLATE_GRANT
FROM (VALUES (0), (1), (2), (3)) v(x)
CROSS APPLY (
	SELECT TOP (1048576) sc.SLOW, sc.INFLATE_GRANT
	FROM dbo.SLOW_TO_COMPRESS sc
	WHERE sc.FOR_SEEKING = v.x
) ca
OPTION (MAXDOP 2);

I get the properties that I’m after with a few undocumented tricks. The inflate grant column is a VARCHAR(8000) column. That data type significantly increases the memory grant for parallel inserts even with all NULLs. For the SLOW column, I’m loading integers evenly distributed between 0 and 7999. That data pattern can take longer than expected to compress. The worst case is with around 16000 distinct evenly distributed integers. If you’d like to understand why check out this answer by Paul White. Finally, the CROSS APPLY pattern means that I’ll get demand-based parallelism with each nested loop execution reading exactly enough rows from SLOW_TO_COMPRESS to fill up one rowgroup. It may be helpful to look at the query plan:

a38_query_plan

The query takes about 40 seconds to execute on my machine. If you’d like to follow along at home, set max server memory to 8000 MB and run the following code:

DROP TABLE IF EXISTS dbo.SLOW_TO_COMPRESS;

CREATE TABLE dbo.SLOW_TO_COMPRESS (
	FOR_SEEKING INT NULL,
	SLOW BIGINT NULL,
	INFLATE_GRANT VARCHAR(8000) NULL
);

CREATE CLUSTERED INDEX CI ON dbo.SLOW_TO_COMPRESS
(FOR_SEEKING);

INSERT INTO dbo.SLOW_TO_COMPRESS WITH (TABLOCK)
SELECT q.RN / 1048576, RN % 10000, NULL
FROM
(
	SELECT TOP (4 * 1048576) -1 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

GO

DECLARE @table_id INT = 1,
@sql NVARCHAR(4000);

WHILE @table_id <= 6
BEGIN
	SET @sql = N'DROP TABLE IF EXISTS dbo.TARGET_CCI_'
	 + CAST(@table_id AS NVARCHAR(2))
	 + N'; CREATE TABLE dbo.TARGET_CCI_'
	 + CAST(@table_id AS NVARCHAR(2))
	 + N'(
		SLOW BIGINT NULL,
		INFLATE_GRANT VARCHAR(8000) NULL,
		INDEX CCI1 CLUSTERED COLUMNSTORE
	)';

	EXEC sp_executesql @sql;

	SET @table_id = @table_id + 1;
END;

GO

CREATE OR ALTER PROCEDURE dbo.INSERT_INTO_TARGET_CCI
(@table_id INT)
AS
BEGIN

DECLARE @sql NVARCHAR(4000) = N'INSERT INTO dbo.TARGET_CCI_'
+ CAST(@table_id AS NVARCHAR(2))
+ N' WITH (TABLOCK)
SELECT ca.SLOW, ca.INFLATE_GRANT
FROM (VALUES (0), (1), (2), (3)) v(x)
CROSS APPLY (
	SELECT TOP (1048576) sc.SLOW, sc.INFLATE_GRANT
	FROM SLOW_TO_COMPRESS sc
	WHERE sc.FOR_SEEKING = v.x
) ca
OPTION (MAXDOP 2)';

EXEC sp_executesql @sql;

END;

Error code 8645


Batch files that call sqlcmd are a convenient way to kick off lots of queries. For example:

START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 1;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 2;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 3;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 4;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 5;" > nul

Note that I do not have Resource Governor enabled. If I kick off five queries at once using the batch file I don’t get an error. After 25 seconds two of the five queries are able to execute with the same memory grant as others:

a38_memory_grant_RTM

It does make me uncomfortable to see query memory grants exceed the target memory for the semaphore by so much, but at least it’s not over max server memory:

a38_semaphore_RTM

I ran the same test but kicked off a sixth query in SSMS. After 25 seconds I saw the following error for the sixth query:

Msg 8645, Level 17, State 1, Line 1
A timeout occurred while waiting for memory resources to execute the query in resource pool ‘default’ (2). Rerun the query.

I want my lawyer


On SQL Server 2017 CU14 I ran a variety of tests by changing the memory time-out settings at the Resource Governor query level or at the instance level. I tried different Resource Governor pools and even serial queries. I still saw a timeout of 25 seconds no matter what I did. I contacted the attorney that I keep on retainer to help me interpret SQL Server KB articles. Let’s review the relevant text again:

When you try to create a Columnstore Index on a table in Microsoft SQL Server 2016 or 2017, the request may fail after 25 seconds with memory time-out error 8645, depending on how many memory consuming queries are running concurrently and/or how much memory is available at the time of the request. The memory time-out error may occur even when you configure longer memory request time-out at an instance level or at a resource pool level.

He pointed out that the article doesn’t actually say that the time-out is now configurable. Just that it wasn’t configurable in the past. The symptom may strictly describe error 8645. So perhaps the adjustment was very narrow and has to do with avoiding that error only. Fair enough. I ran the same test that say error 8645 on RTM and the sixth query still hit error 8645.

Two of these things aren’t like the others


Let’s kick off five queries on CU14 and take another look at sys.dm_exec_query_memory_grants:

a38_memory_grant_new

That’s odd. The two queries that hit the 25 second timeout have lower values for max_used_memory_kb than the queries that didn’t time out, even though the memory grants are the same. Looking at sys.dm_db_column_store_row_group_physical_stats for one of the tables with the lower memory grant:

a38_rowgroup_stats

All rows were written to delta stores even though each thread got over 2 million rows. The query still takes its required memory grant but it doesn’t use most of the part reserved for columnstore compression. My best guess is that this is the change described in the KB article. A superficial glance suggests that the internal implementation is the same one used for writing to a delta store in serial:

a38_call_stacks

I think that I can understand the argument for making this change. However, I see intermittent intra-query parallel deadlocks when queries time out in this way:

Msg 1205, Level 13, State 78, Line 1
Transaction (Process ID 61) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

I’ve also seen this error as well when going too far above target memory:

Msg 701, Level 17, State 123, Line 1
There is insufficient system memory in resource pool ‘default’ to run this query.

Personally, I’d like to see clearly defined behavior that focus on stability. When writing code that does parallel inserts into columnstore indexes it may be desirable to catch errors 8645, 1205, and 701 and to retry the insert after rolling back, perhaps at MAXDOP 1.

Final thoughts


Kb articles for SQL Server fixes can sometimes be misleading because they may focus on how the problem was described in a support ticket even if the scope of the fix or the solution have little to nothing to do with said description. After reading KB article 4480641 you could easily think that error code 8645 no longer occurs with parallel columnstore inserts or that it’s possible to override the 25 second memory timeout for columnstore inserts. Neither one of those is true. Parallel inserts into columnstore tables can still exceed target memory, but they write to multiple delta stores if they hit their 25 second timeout. Thanks for reading!

Extra Documentation for KB Article 4462481

SQL Server KB 4462481 is a bit light on details:

Assume that you use Microsoft SQL Server data warehousing. When you run parallel insert queries to insert data into a clustered columnstore index, the query operation freezes.

To quote one of the Top Men of SQL Server product support:

Hope this helps!

How to freeze your query


Consider a query execution that meets all of the following criteria:

  1. A parallel INSERT INTO… SELECT into a columnstore table is performed
  2. The SELECT part of the query contains a batch mode hash join
  3. The query can’t immediate get a memory grant, hits the 25 second memory grant timeout and executes with required memory

The query may appear to get stuck. It no longer uses CPU and the parallel worker threads appear to wait on each other. Let’s go through an example on SQL Server 2017 RTM. First create all of the following tables:

CREATE PARTITION FUNCTION PART_FUN_REP_SGM(BIGINT)AS RANGE LEFTFOR VALUES (1, 2, 3);CREATE PARTITION SCHEME PART_SCHEME_REP_SGMAS PARTITION PART_FUN_REP_SGMALL TO ( [PRIMARY] );DROP TABLE IF EXISTS dbo.SOURCE_DATA_FOR_CCI;CREATE TABLE dbo.SOURCE_DATA_FOR_CCI (PART_KEY BIGINT NOT NULL,ID BIGINT NOT NULL,FAKE_COLUMN VARCHAR(4000) NULL) ON PART_SCHEME_REP_SGM (PART_KEY);INSERT INTO dbo.SOURCE_DATA_FOR_CCI WITH (TABLOCK)SELECT TOP (1048576) 1, ROW_NUMBER()OVER (ORDER BY (SELECT NULL)) % 16000, NULLFROM master..spt_values t1CROSS JOIN master..spt_values t2OPTION (MAXDOP 1);INSERT INTO dbo.SOURCE_DATA_FOR_CCI WITH (TABLOCK)SELECT TOP (1048576) 2, ROW_NUMBER()OVER (ORDER BY (SELECT NULL)) % 16000, NULLFROM master..spt_values t1CROSS JOIN master..spt_values t2OPTION (MAXDOP 1);DROP TABLE IF EXISTS dbo.LARGE_BATCH_MODE_MEMORY_REQUEST;CREATE TABLE dbo.LARGE_BATCH_MODE_MEMORY_REQUEST (ID VARCHAR(4000),INDEX CCI CLUSTERED COLUMNSTORE);INSERT INTO dbo.LARGE_BATCH_MODE_MEMORY_REQUESTWITH (TABLOCK)SELECT TOP (2 * 1048576) CAST(ROW_NUMBER()OVER (ORDER BY (SELECT NULL)) AS VARCHAR(8000))+ 'DYEL'FROM master..spt_values t1CROSS JOIN master..spt_values t2;DROP TABLE IF EXISTS dbo.CCI_SLOW_TO_COMPRESS_TARGET_1;CREATE TABLE dbo.CCI_SLOW_TO_COMPRESS_TARGET_1 (ID BIGINT NULL,INDEX CCI CLUSTERED COLUMNSTORE);DROP TABLE IF EXISTS dbo.CCI_SLOW_TO_COMPRESS_TARGET_2;CREATE TABLE dbo.CCI_SLOW_TO_COMPRESS_TARGET_2 (ID BIGINT NULL,INDEX CCI CLUSTERED COLUMNSTORE);

Consider the following query:

INSERT INTO dbo.CCI_SLOW_TO_COMPRESS_TARGET_1WITH (TABLOCK)SELECT LEFT(t1.ID, 1)FROM LARGE_BATCH_MODE_MEMORY_REQUEST t1INNER JOIN LARGE_BATCH_MODE_MEMORY_REQUEST t2ON t1.ID = t2.IDUNION ALLSELECT IDFROM dbo.SOURCE_DATA_FOR_CCIOPTION (MAXDOP 2);

Here’s what the plan looks like:a37_planI set Max Server Memory to 8000 MB and executed two queries with a maximum allowed memory grant of 60% via Resource Governor. The first query finished in about 40 seconds. The second query hasn’t finished after 30 minutes. During those 30 minutes the second query has only used 1184 ms of CPU time. The COLUMNSTORE_BUILD_THROTTLE wait type shows up in sys.dm_os_waiting_tasks:a37_waitExecution context id 2 is waiting on execution context id 1 with a wait type of HTDELETE. Execution context id 1 has a wait type of COLUMNSTORE_BUILD_THROTTLE. I don’t think that this wait is supposed to show up for parallel inserts. It can show up by design when creating or rebuilding a columnstore index in parallel:

When a columnstore index is being built, the memory grant estimate is based on a segment size of one million rows. The first segment of the index is built using a single thread so the real, required per-thread memory grant is found. Then the memory grants are given per thread and the other segments are built multi-threaded. Although all the threads for the operation are allocated at the start of the build, only one thread is used for the first segment and all the others incur a COLUMNSTORE_BUILD_THROTTLE wait.

The important point is that a wait type of COLUMNSTORE_BUILD_THROTTLE means that worker thread is waiting on another thread to do something. But so does a wait time of HTDELETE. There are only two worker threads and both of them are waiting on another thread to do something. As a result, the query makes no progress. I’ve let similar queries run for 72 hours before giving up and killing them.

How to unfreeze your query


Upgrading to SQL Server 2017 CU11 or higher fixes the issue. With that said, if you’re seeing this issue that means that you have columnstore insert queries waiting at least 25 seconds for a memory grant. That should be considered to be a problem even if the queries didn’t get stuck. If you find yourself in this situation, I strongly consider increasing memory available to SQL Server or limiting memory grants via Resource Governor or some other method.

Final Thoughts


If you have parallel insert queries into CCis that appear to be stuck with no CPU usage and long COLUMNSTORE_BUILD_THROTTLE wait times check your version of SQL Server. Upgrading to at least CU11 will resolve the immediate issue, but consider a more complete fix will include addressing the RESOURCE_SEMAPHORE waits that must be present on your system.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.