Columnstore Loading on Eight Socket Servers

I had a brief opportunity to do SQL Server workload testing on an eight socket server. It didn’t go well.

Sockets

I’ll give an extremely brief introduction to NUMA and sockets because I have a bit more free time these days. You’re probably reading this blog post on a machine with a single socket. All that means is that there’s a single CPU chip plugged into the motherboard. All of the RAM for the machine is local to that CPU chip. The recent king of single socket performance for SQL Server is the Intel 8280. 28 cores at 2.7 GHz with the latest instruction sets is quite a lot of processing power. There’s a newer chip out there now but it’s not clear to me if anyone can buy it yet.

What can you do if a workload can’t be scaled out and it needs more than the CPU power available with a single socket solution? A two socket solution could be the answer. A two socket server has two CPU chips plugged into different slots on a motherboard and each CPU has a local bank of RAM. Any part of memory can be access by either CPU but there is a performance penalty for foreign memory access, which is just accessing memory from a different CPU. See the simple diagram below:

As mentioned earlier, it is more expensive for CPU0 to access memory local to CPU1 compared to its own local memory. The performance penalty of NUMA will depend on the workload as well as the hardware. It can range from not noticeable at all to a serious issue that needs to be addressed in some way. In some situations this can be as easy as convincing your VM admin to change how the VM is set up. In others, your best bet may be to figure out a way to keep the workload on a single socket. Some folks advocate for single socket solutions compared to two socket solutions for OLTP workloads due to the NUMA penalty.

Perhaps you are wondering what happens if a workload needs more power available from a two socket server? Enter the four socket server. Four socket servers have four slots for CPU chips. An Intel 8280 processor in a four socket configuration gives you a total of 112 physical cores. That’s a lot. Four socket servers are more expensive than two socket servers and are significantly less common to find in data centers. Often, they are a custom order which makes months for the hardware vendor to build and deliver. In addition, the connections between CPUs are more complicated. Below are two common configurations:

Under configuration A, it takes two hops for CPU0 to access memory local to CPU3. That will increase latency more than going to CPU1 or CPU2. However, memory latency between CPUs won’t be the same for all pairs even in configuration B. In addition, it’s more complicated for SQL Server to try to manage memory on a four socket box. Even in the best case scenario for memory management, by design everything stored in SQL Server owned memory isn’t NUMA aware, such as the columnstore object pool. Nonetheless, if you need more CPU than a two socket server or VM can provide then you’re stuck living with the complexities of a four socket solution.

What if you need even more power? Hardware manufacturers once again have you covered with eight socket servers. These are even more expensive and uncommon than four socket solutions. Different vendors handle NUMA differently. There are simply more possible permutations with eight CPUs compared to four. I’m told that the following is an example layout for an eight socket machine:

I’m not a NUMA expert by any means. However, I think it’s fair to say that it looks a lot more complicated. There are many more instances where accessing foreign memory will take two hops instead of one. I also don’t see a way to carve out a four socket VM with all CPUs just one hop away. All in all, that diagram makes me nervous. As a final note, eight socket machines are not the limit of what’s possible, but I’ll stop here because I’ve never run code on anything larger than eight sockets.

Test Setup

I elected to use a high concurrency CCI insert workload to compare performance between a four socket VM and an eight socket VM. Quite conveniently, I already had a test columnstore workload that I knew pushed the SQL Server scalability limits in terms of memory management. To perform the threading I used the SQL Server Multi Thread open source framework. I wanted all sessions to go to their own schedulers. That could have been tough to manage with tests up to 200 threads but the threading framework handles that automatically.

For those following along at home, testing was done with SQL Server 2019 with LPIM and TF 876 enabled. Guest VMs were built with VMware with Windows Server 2019 installed. The four and eight socket VMs were created on the same physical host with about 5.5 TB of RAM available to the guest OS in both configurations.

Test Results

In case you didn’t click the earlier link, each job inserts about 10.4 million rows into a table with a clustered columnstore index. The smallest test ran just a single thread. The largest test on the four socket VM ran 100 concurrent jobs for a total insert volume of a billion rows. The largest test on the eight socket VM ran 200 concurrent jobs for a total insert volume of two billion rows. In all test scenarios, the four socket VM performed more work per second than the eight socket VM:

Quite a disappointing result. Double your SQL Server license fees for a slower overall system! There are the expected memory related wait stats for the 200 thread result:

I grabbed a sample of callstacks with PerfView:

Drilling into the top stack shows that decommitting memory is a scalability issue:

The stacks for the 100 thread test on the four socket VM results are a bit surprising by comparison:

Overall CPU reported by PerfView is the same. Perhaps this is a limitation of the tool. However, it really does seem like all of those extra cores simply aren’t helping under the eight socket configuration. Throughput goes down instead of up. I had very limited time with this machine and don’t want to repeat much of a previous blog post, so I’ll stop the technical analysis here. It’s likely that nothing can really be done to improve the workload except large configuration or hardware changes.

If I had to make this workload perform better, I would directly engage with the vendors. If that didn’t work, I would try a bare metal configuration, SQL Server on Linux, or running on an eight socket machine from a different vendor. I think that it’s fair to say that the test results would be different, but I can’t speculate as to whether they would be better or worse or what the difference would be.

Final Thoughts

We live in a wonderful era of hardware prosperity. With a big enough checkbook, you can run your SQL Server workloads on servers with hundreds of CPU cores. However, proceed with extreme caution if you’re considering moving a SQL Server workload to a server or VM with more than four sockets. It might not scale as well as you’re hoping. Thanks for reading!

The Trillion Row Operator

SQL Server friend and excellent webmaster Mr. Erik C. Darling showed me a query plan gone wrong:

48 billion rows for a single operator is certainly a large number for most workloads. I ended up completely missing the point and started wondering how quickly a query could process a trillion rows through a single operator. Eventually that led to a SQL performance challenge: what is the fastest that you can get an actual plan with at least one operator processing a trillion rows? The following rules are in play:

  1. Start with no user databases
  2. Any query can run up to MAXDOP 8
  3. Temp tables may be created and populated but all such work needs to finish in 30 seconds or less
  4. A RECOMPILE query hint must be present in the final query
  5. Undocumented features and behavior are fair game

Unless stated otherwise here, all testing was done on SQL Server 2017 CU20. The processor of my local machine is an Intel i7-9700K. If you’re following along at home you may see different results depending on your hardware. The queries are extremely CPU intensive. If you’d like to skip ahead to the winning query then scroll down the bottom.

Loop Join

A natural way to generate a lot of rows quickly is to cross join together two equally sized tables. Throwing a million rows into a temporary table takes very little time. Finding the count of the result set should meet the challenge requirements as long as the right optimizer transforms are not available. An initial attempt for data prep:

DROP TABLE IF EXISTS #loop_test;

CREATE TABLE #loop_test (
ID TINYINT NOT NULL
);

INSERT INTO #loop_test WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #loop_test (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8);

The query returns the expected result of 1 trillion, but there is no operator that processes one trillion rows:

The query optimizer uses the LocalAggBelowJoin rule to perform an aggregate before the join. This brings down the total query estimated cost from 1295100 units to 6.23274 units for quite a savings. Trying again after disabling the transform:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

Results in a new query plan:

This query meets the requirements of the challenge. The stream aggregate, nested loops, and row count spool operators will process 1 trillion rows each. However, it is unsatisfactory from a performance point of view. The row mode stream aggregate has a significant profiling penalty compared to a batch mode hash aggregate alternative. It is simple enough to create a dummy table to add batch mode eligibility for the query plan:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The new plan has a nice batch mode operator to do the counting:

With an actual plan request, the new query processes rows at roughly twice the rate as the row mode query. Perhaps query profiling does its work once per batch for batch mode operators and once per row for row mode operators. Each batch has around 900 rows so profiling might be much cheaper per row for the hash aggregate compared to the stream aggregate. Strictly speaking, the goal of this challenge is to optimize the amount of time it takes to get an actual plan. This might lead to different tuning results than optimizing for query completion time without an actual plan request.

The final time for the batch mode query is 2 hours and 12 minutes. I uploaded an actual plan to pastetheplan.com. The query has a CPU to elapsed time ratio of only 5.37. That means on that average during query execution, more than 2.5 CPU cores were not doing useful work. Getting all eight CPU cores to perform work as much as possible during query execution should allow the query to finish faster. The CPU imbalance in this case is caused by how rows are distributed to threads on the outer side of the nested loop operator:

The query engine has different algorithm choices to perform a parallel scan of a table. In this case, it doesn’t pick an algorithm that leads to optimal performance. The nested loop operator in total drives 42717 seconds of CPU work. It’s important to have that work as balanced between threads as possible. In this situation it is helpful to reduce the number of rows per page. To understand why I recommend watching this Pass Summit presentation by Adam Machanic. This results in better row distribution between threads and theoretically better performance. The following code only allows for a single row per data page:

DROP TABLE IF EXISTS #wide_1_million_rows;

CREATE TABLE #wide_1_million_rows (
ID TINYINT NOT NULL,
FILLER VARCHAR(4200) NOT NULL
);

INSERT INTO #wide_1_million_rows WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID, REPLICATE('Z', 4200) FILLER
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #wide_1_million_rows (ID) WITH FULLSCAN, NORECOMPUTE;

The new query plan generates in 1:29:40 with a CPU ratio of the query is 7.977 which is near perfect. Rows are quite balanced between threads:

I don’t know of a method to further optimize the nested loop join query. Most of the query’s execution time is spent on profiling. Sampled stacks from 15 seconds of query execution:

ETW tracing tools such as PerfView along with internals knowledge can be used to more finely break down how CPU is used by this query. Almost 75% of the query’s execution time is spent on sqlmin!CQScanProfileNew:GetRowImp and on methods called by it. The query only takes 23 minutes to complete without an actual plan which is quite close to 25% of the runtime with an actual plan. If you’re interested in more examples of using PerfView to draw conclusiosn about SQL Server consider checking out the list of links in this stack exchange answer.

Merge Join

A cross join can only be implemented as a nested loop join but the same end result can be emulated with a merge join. Let’s start with testing a MAXDOP 1 query. Setup:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #merge_serial;

CREATE TABLE #merge_serial (
ID TINYINT NOT NULL
);

INSERT INTO #merge_serial WITH (TABLOCK)
SELECT TOP (1000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE CLUSTERED INDEX CI ON #merge_serial (ID);

The query to run:

SELECT COUNT_BIG(*)
FROM #merge_serial t1 WITH (TABLOCK)
INNER JOIN #merge_serial t2 WITH (TABLOCK) ON t1.ID = t2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MERGE JOIN, MAXDOP 1, QueryRuleOff LocalAggBelowJoin);

I did not allow the query to run to completion because it would have taken around 25 hours to complete. On an 8 core machine with merge join, there’s no reason to expect that a parallel query could do any better than 25/8 = 3.125 hours. This makes merge join an inferior option compared to loop join so there isn’t a good reason to pursue further approaches that use a merge join.

For completeness, the best merge plan time I was able to get was 5 hours and 20 minutes. The plan avoids the scalability issues found with parallel merge and order preserving repartition streams. The end result is still poor. The merge join simply consumes a lot of CPU to do its work. The additional overhead added by the loop join is not helpful.

Row Mode Hash Join

Perhaps we’ll have more luck with a hash join compared to a merge join. After all, the query optimizer assigns a lower estimated cost for a row mode hash join compared to a merge join for the trillion row scenario. What could go wrong? The first thing that goes wrong is that the addition of the empty CCI will lead to a batch mode hash join for this query. Avoiding that requires changing the join to a type that doesn’t allow for batch mode execution. Among other methods, this can be accomplished by putting the hash join on the inner side of a nested loop or changing the join to also match on NULL. The second technique will be used at first because it leads to significantly less overhead compared to the loop join approach.

Starting again with a MAXDOP 1 test:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_maxdop_1;
CREATE TABLE #hash_maxdop_1 (
	ID TINYINT NULL
);

INSERT INTO #hash_maxdop_1 WITH (TABLOCK)
SELECT TOP (1000000) 0 
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_maxdop_1 (ID) WITH FULLSCAN, NORECOMPUTE;

The test query has a more complex join condition:

SELECT COUNT_BIG(*)
FROM #hash_maxdop_1 t1 WITH (TABLOCK)
INNER HASH JOIN #hash_maxdop_1 t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (MAXDOP 1, RECOMPILE)

Note how the join now includes rows where both sides are NULL. There are no such rows in the table but the query optimizer does not know that so a row mode join is used. The query completed in 5.5 hours. This means that a parallel hash join approach could be competitive with the nested loop join. An optimistic time estimate for a parallel query is 41 minutes.

Getting a parallel query isn’t as simple as changing MAXDOP from 1 to 8 will not have the desired effect. Take a look at the following query plan born from such an attempt:

The repartition stream operators have a partitioning type of hash. That means that a hashing algorithm is used to assign rows to different threads. All of the rows from the table have the same column value, so they will all hash to the same value and will end up on the same thread. Seven CPUs will have no work to do even though the query technically executes at DOP 8. This will not lead to performance that’s better than executing at DOP 1.

Through testing I found that the following values for a nullable TINYINT all go to different threads: 0, 1, 4, 5, 7, 9, 14, and 17. Note that I’m relying on undocumented details here that might change between releases. Inserting 250k rows of each value into one table and 500k rows of each value into another table should result in 8 * 250000 * 500000 = 1 trillion rows total. Code to do this is below:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #row_hash_join_repartition_8_build;
CREATE TABLE #row_hash_join_repartition_8_build (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_build
SELECT TOP (2000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_build (ID) WITH FULLSCAN, NORECOMPUTE;


DROP TABLE IF EXISTS #row_hash_join_repartition_8_probe;
CREATE TABLE #row_hash_join_repartition_8_probe (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_probe
SELECT TOP (4000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_probe (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #row_hash_join_repartition_8_build t1 WITH (TABLOCK)
INNER JOIN #row_hash_join_repartition_8_probe t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, HASH JOIN, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The query plan appeared after 42 minutes with a CPU ratio of 7.7. This was a better result than I was expecting. The repartition stream operators don’t seem to have much of a negative effect on overall runtime.

For completeness, I also tested a parallel hash join plan without the repartition stream operators. All of the hash join work is on the inner side of a nested loop join. The loop join adds significant overhead so this ends up being a losing method. If you’d like to see for yourself, here is one way to set up the needed tables:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_join_demand_62500_rows;
CREATE TABLE #hash_join_demand_62500_rows (
	ID TINYINT NOT NULL
);

INSERT INTO #hash_join_demand_62500_rows WITH (TABLOCK)
SELECT TOP (62500) ISNULL(CAST(0 AS TINYINT), 0) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_join_demand_62500_rows (ID);

As well as one way to generate such a query:

SELECT COUNT_BIG(*)
FROM
(VALUES
	  (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
) v(v)
CROSS JOIN (
	SELECT 1 c
	FROM #hash_join_demand_62500_rows t1 WITH (TABLOCK)
	INNER HASH JOIN #hash_join_demand_62500_rows t2 WITH (TABLOCK) ON t1.ID = t2.ID
) ca
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, FORCE ORDER, MAXDOP 8, MIN_GRANT_PERCENT = 99, NO_PERFORMANCE_SPOOL);

It might be difficult to understand what’s going on here, so here’s a picture of the query plan:

Work is split up into 256 buckets. Each row fed into the outer side of the nested loop join results in 62500 rows hash joined with 62500 rows.

The disappointing final result takes about 2 hours of runtime. This query might perform better on a busy system compared to the simple parallel hash join because work is assigned to threads somewhat on demand. In this case, the repartition streams query query requires equal work to be completed on all threads. However, the testing environment only has a single query running at a time.

The query pattern here is similar to a parallel apply. Perhaps you’ve heard good things about that pattern and are wondering why it didn’t work out here. In my experience, that query pattern does best when an order preserving repartition streams operator can be avoided, there’s a supporting index on the inner side, and the row count coming out of the inner side is relatively low. The above query is 0 for 3.

Batch Mode Hash Join

I’ll be honest. I expected batch mode hash join to be the winner. I was wrong. Batch mode parallelism works in a fundamentally different way than row mode parallelism. There is no repartition streams operator which assigns rows to different threads based on a hash or other algorithm. Rather, each operator takes care of parallelism by itself. In my experience this is generally quite advantageous for both performance and scalability, but there can be drawbacks.

We intend to get a batch mode hash join, so it seems logical enough to join together a CCI with one million rows. Sample code to do that:

DROP TABLE IF EXISTS #batch_mode_hash_equal_sized;

CREATE TABLE #batch_mode_hash_equal_sized (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #batch_mode_hash_equal_sized WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #batch_mode_hash_equal_sized (ID) WITH FULLSCAN, NORECOMPUTE;

The query to be run no longer needs the join to the empty CCI table:

SELECT COUNT_BIG(*) 
FROM #batch_mode_hash_equal_sized c1
INNER JOIN #batch_mode_hash_equal_sized c2 ON c1.ID = c2.ID
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

An actual plan is generated on my machine in about 27.5 minutes. While clearly better than all other attempts so far, the CPU to elapsed ratio suggests that significant improvement is possible: it is only 4.6 out of a possible 8.0. Batch mode processing is generous enough to provide a helpful wait stat. There is about 5607 seconds of the HTDELETE wait. Put simply, some of the threads of the hash join ran out of work to do and waited for the other threads to finish their work. As I understand it with batch mode joins, the probe side is what matters in terms of work imbalance. Row count per thread of the probe side:

The million row CCI table is split up into 9 chunks of rows which are divided between 8 threads which leads to many threads having no work to do for a while. Quite unfortunate. Oddly enough, switching to MAXDOP 9 could lead to much better performance for this query. That isn’t an available option on my local machine so I switched to testing on a VM with Intel 8280 sockets. The theory works out: the MAXDOP 8 query takes 50 minutes to complete and the MAXDOP 9 query takes 25 minutes to complete. At MAXDOP 9 the row distribution is quite nice on the probe side:

The rules for the different algorithms available to do parallel scans on CCIs are not documented. I previously investigated it here. Interestingly enough, this is an example where an exchange operator could make a positive difference. However, the batch mode hash join doesn’t have that as an option. While there are various tricks to achieve even row distribution from a CCI, it seems simplest to just replace the tables with row store. Converting 2 million rows from row mode to batch mode won’t add any measurable overhead. Using the same table that was used for the best loop join time:

SELECT COUNT_BIG(*) 
FROM #wide_1_million_rows c1
INNER JOIN #wide_1_million_rows c2 ON c1.ID = c2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin)

The new query plan finishes in just under 16 minutes. Row distribution per thread on the probe side is much more even:

You might be thinking that 16 minutes is a pretty good result. I too initially thought that.

The Winner

All of the previous approaches spent the vast majority of their time performing a join to generate rows from smaller tables. The scans of the underlying tables themselves were quite fast. Perhaps a better approach would be to avoid the join entirely. For example, consider inserting a billion rows into a table and scanning that table 1000 times with UNION ALL. That would send one trillion rows to a single operator without the expense of a join. Of course, the downside is that SQL Server would need to read one trillion rows total from a table. It would be important to make the reading of rows as efficient as possible. The most efficient method I know is to create a columnstore table with extreme compression.

The title of this section gives it away, but let’s start with a small scale test to see if the approach has merit. Code to load 100 million rows into a CCI:

DROP TABLE IF EXISTS #count_test;

CREATE TABLE #count_test (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #count_test WITH (TABLOCK)
SELECT TOP (100000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #count_test (ID);

SQL Server is able to compress this data quite well because there’s just a single unique value. The table is only 1280 KB in size. The data prep code does exceeds the 30 second time limit on my machine due to the statistics creation, but I’ll ignore that for now because it’s just a trial run. I can avoid all types of early aggregation with the following syntax:

SELECT COUNT_BIG(*)
FROM
(
	SELECT ID
	FROM #count_test
	UNION ALL
	SELECT ID
	FROM #count_test
) q
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER)

Picture of the query plan:

On my machine, that query takes about 41 ms to process 200 million rows. Based on that, it seems reasonable to expect that a query that processes a trillion rows could finish in under four minutes. There is a balancing act to perform here. Putting more rows into the table makes it harder to complete data prep within 30 seconds, but cuts down on the number of UNION ALLs performed which reduces query compilation time.

The approach that I settled on was to load 312.5 million rows into a table. Reading that table 3200 times results in a trillion total rows. 3200 is a convenient number to achieve with nested CTEs and I’m able to load 312.5 million rows quite reliably under 30 seconds even with MAXDOP 4. I can’t say that this is the best possible approach but it seems to work fairly well.

The last detail in terms of data prep is to deal with statistics creation. SQL Server forces a sample rate of 100% for very small tables. Unfortunately, extremely compressed columnstore indexes can fall into the 100% range even if they have a high row count. Creating a statistic object with a sample of 312.5 million rows will take longer than 30 seconds. I don’t need good statistics for query performance so I work around the issue by creating a statistic with NORECOMPUTE after an initial load of 500k rows. Code to create and populate a CCI with 312.5 million rows:

DROP TABLE IF EXISTS #CCI_1M_ONE_COLUMN;

CREATE TABLE #CCI_1M_ONE_COLUMN (ID TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

DECLARE @insert_count TINYINT = 0;
WHILE @insert_count < 8
BEGIN
	INSERT INTO #CCI_1M_ONE_COLUMN
	SELECT TOP (125000) 0
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	OPTION (MAXDOP 1);

	SET @insert_count = @insert_count + 1;
END;

DROP TABLE IF EXISTS #C;

CREATE TABLE #C (I TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

INSERT INTO #C WITH (TABLOCK)
SELECT TOP (500000) 0
FROM #CCI_1M_ONE_COLUMN
OPTION (MAXDOP 1);

CREATE STATISTICS S ON #C (I) WITH NORECOMPUTE, FULLSCAN;

WITH C13 AS (
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
)
, C312 AS
 (
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
)
INSERT INTO #C WITH (TABLOCK)
SELECT 0
FROM C312
OPTION (MAXDOP 4, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);


DBCC TRACEON(10204);

ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

DBCC TRACEOFF(10204);

The above code isn’t as efficient as possible but it gets the job done. It also provides a preview of the syntax that is used to query the table 3200 times. A nested CTE approach is used instead of writing out UNION ALL 3199 times. Trace flag 10204 is used because there’s no need to act on rows outside of the delta rowgroups.

Query compile time can be quite significant for this query. On an older PC while running these queries, I could determine when the query compile ended by a change in fan noise. The new additions to the query hints are there to minimize query compile time as much as possible. This is important when a table is referenced 3200 times in a query. I’m told that there are undocumented trace flags which can reduce query compile time even more but I’ll leave that as an exercise to the reader. Here’s the query text of the UNION ALL approach:

WITH C8 AS (
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I  FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
)
, C160 AS
 (
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
)
, C3200 AS
(
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
)
SELECT COUNT_BIG(*)
FROM C3200
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);

The good news is that generating an actual plan takes about 3 minutes on my machine. The bad news is that uploading a 27 MB query plan is a lot more difficult than I thought. I ended up loading it to my personal website. You can view all of the XML in its glory here.

For a method more friendly to my personal bandwidth, you can look at the screenshot below:

I don’t know of a more faster way to complete the challenge. Without an actual plan, the query takes about 63 seconds to complete. However, the query takes about 3 minutes to complete on SQL Server 2019, even without requesting an actual plan. The difference is caused by lightweight query profiling in SQL Server 2019, which is turned on by default. I find query profiling to be very useful in general for evaluating performance issues in real time and for estimating how long a query will take to complete. However, it can still add significant overhead for certain types of queries. The difference is quite apparent in the call stacks:

This is why I did all of my testing for this blog post on SQL Server 2017, in case you’ve been wondering that.

Final Thoughts

Summary of all attempts:

The results are quite specific to the test scenario: they should not be used to draw conclusions about general join performance. As mentioned earlier, in a few cases the bottleneck is the query profiling infrastructure instead of typical aspects of query performance. If anyone out there knows of a faster technique than those tested here, I would be delighted to learn about it. Thanks for reading!

Using Exchange Demand Partitioning to Improve Parallel Query Scalability

One of our SQL Server workloads runs many concurrent parallel queries on a possibly already busy server. Other work occurring on the server can have dramatic effects on parallel query runtimes. One way to improve scalability of parallel query runtimes is to achieve query plans with operators that allow a dynamic amount of work to be completed by each parallel worker thread. For example, exchange operators with a partitioning type of hash or round robin force a typically even amount of work to be completed by each worker thread. If a worker thread happens to be on a busy scheduler then query runtime may increase due to the longer runtime of the worker thread that is competing for CPU resources. Serial zones in parallel query plans can of course have the same problem. Batch mode operators, exchange operators with a partitioning type of broadcast or demand, and the parallel page supplier are all examples of operators which can do a dynamic amount of work per thread. Those are the operators that I prefer to see in query plans for this workload.

Very little has been written about exchange operators with a partitioning type of demand, so I forgive you for not hearing of it before today. There is a brief explanation available here, an example of using demand partitioning to improve some query plans involving partitioned tables, and a Stack Exchange answer for someone comparing round robin and demand partitioning. You have the honor of reading perhaps the fourth blog post about the subject.

Start


The demos are somewhat dependent on hardware so you may not see identical results if you are following along. I’m testing on a machine with 8 CPU and with max server memory was set to 6000 MB. Start with a table with a multi-column primary key and insert about 34 million rows into it:

DROP TABLE IF EXISTS #;
CREATE TABLE # (
ID BIGINT NOT NULL,
ID2 BIGINT NOT NULL,
STRING_TO_AGG VARCHAR(MAX),
PRIMARY KEY (ID, ID2)
);

INSERT INTO # WITH (TABLOCK)
SELECT RN, v.v, REPLICATE('REPLICATE', 77)
FROM (
SELECT TOP (4800000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values
CROSS JOIN master..spt_values t2
) q
CROSS JOIN (VALUES (1), (2), (3), (4), (5), (6), (7)) v(v);

The query tuning exercise is to insert the ID column and a checksum of the concatenation of all the STRING_TO_AGG values for each ID value ordered by ID2. This may seem like an odd thing to do but it is based upon a production example with an adjustment to not write as much data as the real query. Not all of us have SANs in our basements, or even have a basement. Use the following for the target table:

DROP TABLE IF EXISTS ##;

CREATE TABLE ## (
ID BIGINT NOT NULL,
ALL_STRINGS_CHECKSUM INT
);

Naturally we use SQL Server 2019 CU4 so the STRING_AGG function is available to us. Here is one obvious way to write the query:

INSERT INTO ## WITH (TABLOCK)
SELECT ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2))
FROM #
GROUP BY ID
OPTION (MAXDOP 8);

The above query takes about 63 seconds on my machine with a cost of 2320.57 optimizer units. The query optimizer decided that a serial plan was the best choice:

This is a rather lame result for a query tuning exercise so I will assume that I know better and force a parallel query plan with undocumented trace flag 8649. SQL Server warns us that the estimated cost is 4816.68 optimizer units but surely doesn’t expect a detail like that to stop me. The adjusted query executes in 75 seconds:

My arrogance is checked. The parallel query is slower than the serial version that the query optimizer wanted us to have. The problem is the hash exchange operator. It is an order preserving exchange with a high MAXDOP and wide rows. This is the worst possible situation for exchange performance. How else can we write the query? Batch mode is not available for STRING_AGG so that’s out. Does anyone remember anything about tuning row mode queries?

The Dark Side


Query hints along with carefully constructed T-SQL are the pathway to many abilities, some considered to be unnatural. We can give the classic Parallel Apply query pattern made famous by Adam Machanic a shot to solve this problem. Perhaps you are thinking that there is no driving table for the outer side of a nested loop join, but we can create one by sampling the clustered index of the base table. I’ll skip that part here and just use what I know about the data to divide it into 96 equal ranges:

DROP TABLE IF EXISTS ###;
CREATE TABLE ### (s BIGINT NOT NULL, e BIGINT NOT NULL);

INSERT INTO ###
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000);

I can now construct a query that gets a parallel apply type of plan:

INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM ### driver
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN driver.s AND driver.e
	GROUP BY ID
) ca
OPTION (MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

This is an unnatural query plan. The query optimizer assigned it a cost of 36248.7 units. I had to add the TOP to get a valid query plan with an index seek. Removing the TOP operator results in error 8622. Naturally such things won’t stop us and running the query results in an execution time between 15 – 19 seconds on my machine which is the best result yet.

This query plan has an exchange partitioning type of round robin. Recall such exchange types can lead to trouble if there’s other work executing on one of the schedulers used by a parallel worker thread. So far I’ve been testing these MAXDOP 8 queries with nothing else running on my 8 core machine. I can make a scheduler busy by running a MAXDOP 1 query that has no real reason to yield before exhausting its 4 ms quantum. Here is one way to accomplish that:

SELECT TOP (1) t1.high + t2.high + t3.high + t4.high
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
CROSS JOIN master..spt_values t4
ORDER BY t1.high + t2.high + t3.high + t4.high
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

Wait stats for this query if you don’t believe me:

Running this query at the same time as the parallel query can apply a large performance penalty to the parallel query. The parallel query can take up to 48 seconds to execute if even a single worker thread has to share time on a scheduler with another. That is, the query ran 3 times slower when I added a single MAXDOP 1 query to the workload. Looking at thread details for the parallel query:

As you can see, one of the worker threads took a significantly longer amount of time to complete its work compared to the other threads. There is no logged wait statistic for this kind of performance problem in which the other parallel worker threads complete their work much earlier than when the query finishes. If there’s no worker thread then there is no wait associated with the query. The only way to catch this is to look at actual row distribution or the CPU time to elapsed time ratio.

You may be wondering why the query is worse than twice as slow as before. After all, if all workers do an equal amount of work but one now gets access to half as much CPU as before it seems reasonable to expect the runtime to double instead of triple. The workers of the parallel query have many reasons they might yield before exhausting their full 4 ms quantum – an I/O wait for example. The MAXDOP 1 SELECT query is designed to not yield early. What is very likely happening is that the MAXDOP 1 query gets a larger share of the scheduler’s resources than 50%. SQL Server 2016 made adjustments to try to limit this type of situation but by its very nature I don’t see how it could ever lead to a perfect sharing of a scheduler’s resources.

Demanding Demand Partitioning


We can get an exchange operator with demand based partitioning by replacing the driving temp table with a derived table. Full query text below:

INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM (
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000)
) driver( s, e)
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN CAST(driver.s AS BIGINT) AND CAST(driver.e AS BIGINT)
	GROUP BY ID
) ca
OPTION (MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

Query performance is effectively random. The query was observed to execute as quickly as 15 seconds and as slowly as 45 seconds. In some situations there was an incredible amount of skew in row distributions between threads:

This is an unexpected situation if there are no other queries running on the server. Query performance is most disappointing.

Is there a trace flag?


Yes! The problem here is that the nested loop join uses the prefetch optimization. Paul White writes:

One of the available SQL Server optimizations is nested loops prefetching. The general idea is to issue asynchronous I/O for index pages that will be needed by the inner side — and not just for the current correlated join parameter value, but for future values too.

That sounds like it might be wildly incompatible with a demand exchange operator. Querying sys.dm_exec_query_profiles during query execution proves that the demand exchange isn’t working as expected: worker threads no longer fully process the results associated with their current row before requesting the next one. That is what can lead to wild skew between the worker threads and as a result query performance is effectively random.

Documented trace flag 8744 disables this optimization. Adding it to the query using QUERYTRACEON results in much more stable performance. The query typically finishes in about 15 seconds. Here is an example thread distribution:

If you fight the ISV fight like I do, you may not be able to enable trace flags for individual queries. If you’re desperate you could try artificially lowering the cardinality estimate from the derived table. An OPTIMIZE FOR query hint with a direct filter is my preferred way to accomplish this. I like to set the cardinality estimate equal to MAXDOP but I have no real basis for doing this. Here is the full query text:

DECLARE @filter BIGINT = 987654321987654321;
INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM (
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000)
) driver( s, e)
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN CAST(driver.s AS BIGINT) AND CAST(driver.e AS BIGINT)
	GROUP BY ID
) ca
WHERE driver.s <= @filter
OPTION (OPTIMIZE FOR (@filter = 350001), MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

Query performance is the same as with TF 8744:

 

Does this query do better than round robin partitioning when there is a busy MAXDOP 1 query running at the same time? I ran it a few times and it completed in about 15-16 seconds every time. One of the worker threads does less work and the others cover for it:

In this example the nested loop join only gets the prefetch optimization if the cardinality estimate is more than 25 rows. I do not know if that number is a fixed part of the algorithm for prefetch eligibility but it certainly feels unwise to rely on this behavior never changing in a future version of SQL Server. Note that prefetching is a trade-off. For some workloads you may be better off with round robin partitioning and prefetching compared to demand without prefetching. It’s hard to imagine a workload that would benefit from demand with prefetching but perhaps I’m not being creative enough in my thinking.

Final Thoughts


In summary, the example parallel apply query that uses demand partitioning performs 2-3 times better the query that uses round robin partitioning when another serial query is running on the server. The nested loop prefetch optimization does not work well witth exchange operator demand partitioning and should be avoided via a trace flag or other tricks if demand partitioning is in use.

There are a few things that Microsoft could do to improve the situation. A USE HINT that disables nested loop prefetching would be most welcome. I think that there’s also a fairly strong argument to make that a query pattern of a nested loop join with prefetching with a first child of a demand exchange operator is a bad pattern that the query optimizer should avoid if possible. Finally, it would be nice if there was a type of wait statistic triggered when some of a parallel query’s workers finish their work earlier than others. The problematic queries here have very little CXPACKET waits and no EXECSYNC waits. Imagine that I put a link to UserVoice requests here and imagine yourself voting for them.

Thanks for reading!

Trace flag 3656 is not sufficient for symbol resolution on SQL Server 2019

You may have noticed that TF 3656 appears to no longer work in SQL Server 2019 RC1. Symbols are not resolved in Extended Events event data even with that trace flag enabled. Trace flag 2592 must also be enabled to resolve symbols. This was recently added by Microsoft to the documentation. This concludes the shortest blog post I will ever write.

What is the PVS_PREALLOCATE wait type?

I was workload testing on SQL Server 2019 RC1 when I ran into a wait type I’d never noticed before: PVS_PREALLOCATE. Wait seconds per second was about 2.5 which was pretty high for this workload. Based on the name it sounded harmless but I wanted to look into it more to verify that.

The first thing that I noticed was that total signal wait time was suspiciously low at 12 ms. That’s pretty small compared to 55000000 ms of resource wait time and suggested a low number of wait events. During testing we log the output of sys.dm_os_wait_stats every ten seconds so it was easy to graph the deltas for wait events and wait time for PVS_PREALLOCATE during the workload’s active period:

This is a combo chart with the y-axis for the delta of waiting tasks on the left and the y-axis for the delta of wait time in ms on the right. I excluded rows for which the total wait time of PVS_PREALLOCATE didn’t change. As you can see, there aren’t a lot of wait events in total and SQL Server often goes dozens of minutes, or sometimes several hours, before a new wait is logged to the DMV.

This pattern looked like a single worker that was almost always in a waiting state. To get more evidence for that I tried comparing the difference in logging time with the difference in wait time. Here are the results:

Everything matches within a margin of error of 10 seconds. Wait stats are logged every 10 seconds so everything fits. The data looks exactly as it should if a single session was almost always waiting on PVS_PREALLOCATE. I was able to find said session:

I did some more testing on another server and found that all waits were indeed tied to a single internal session id. The PVS_PREALLOCATOR process starts up along with the SQL Server service and has a wait type of PVS_PREALLOCATE until it wakes up and has something to do. Blogging friend Forrest found this quote about ADR:

The off-row PVS leverages the table infrastructure to simplify storing and accessing versions but is highly optimized for concurrent inserts. The accessors required to read or write to this table are cached and partitioned per core, while inserts are logged in a non-transactional manner (logged as redo-only operations) to avoid instantiating additional transactions. Threads running in parallel can insert rows into different sets of pages to eliminate contention. Finally, space is pre-allocated to avoid having to perform allocations as part of generating a version.

That’s good enough for me. This wait type appears to be benign from a waits stats analysis point of view and I recommend filtering it out from your queries used to do wait stats analysis. Thanks for reading!

Columnstore Precon Details

I thought it could be helpful to go into more detail for what I plan to present at the columnstore precon that’s part of SQL Saturday New York City. Note that everything here is subject to change. I tried to include all of the main topics planned at this time. If you’re attending and really want to see something covered, let me know and I’ll be as transparent as possible about what I can and cannot do.

  • Part 1: Creating your table
    • Definitions
    • Delta rowgroups
    • How columnstore compression works
    • What can go wrong with compression
    • Picking the right data types
    • Deciding on partitioning
    • Indexing
    • When should you try columnstore?
  • Part 2: Loading data quickly
    • Serial inserts
    • Parallel inserts
    • Inserts into partitioned tables
    • Inserts to preserve order
    • Deleting rows
    • Updating rows
    • Better alternatives
    • Trickle insert – this is a maybe
    • Snapshot isolation and ADR
    • Loading data on large servers
  • Part 3: Querying your data
    • The value of maintenance
    • The value of patching
    • How I read execution plans
    • Columnstore/batch mode gotchas with DMVs and execution plans
    • Columnstore features
    • Batch mode topics
    • When is batch mode slower than row mode?
    • Batch mode on rowstore
    • Bad ideas with columnstore
  • Part 4: Maintaining your data
    • Misconceptions about columnstore
    • Deciding what’s important
    • Evaluating popular maintenance solutions
    • REORG
    • REBUILD
    • The tuple mover
    • Better alternatives

I hope to see a few of you there!

Thoughts on MAXDOP

Microsoft recently published new guidance on setting server level MAXDOP. I hope to help the community by analyzing the new guidance and offering some of my own thoughts on query parallelism.

Line by line


Documentation is meant to be shared after all, so hopefully no one minds if I quote most of it:

Starting with SQL Server 2016 (13.x), during service startup if the Database Engine detects more than eight physical cores per NUMA node or socket at startup, soft-NUMA nodes are created automatically by default. The Database Engine places logical processors from the same physical core into different soft-NUMA nodes.

This is true and one of the bigger benefits of auto soft-NUMA as far as I’ve been able to tell.

The recommendations in the table below are aimed at keeping all the worker threads of a parallel query within the same soft-NUMA node.

SQL Server is not designed to keep all worker threads in a single soft-NUMA node. That might have been true in SQL Server 2008, but it changed in 2012. The only semi-official documentation that I know of is here and I looked into the behavior here. Read through both if you’re interested in how scheduling of parallel worker threads is performed by SQL Server, but I’ll provide a quick summary via example here.

Suppose you have two soft-NUMA nodes of 6 schedulers each and the server just restarted.NUMA node 0 has positions 0-5 and NUMA node 1 has positions 6-11. The global enumerator starts at position 0. If I run a MAXDOP 4 query then the enumerator advances by 4. The parallel workers are allowed in positions 0-3 which means that any four out of six schedulers can be chosen from NUMA node 0. All parallel worker threads are in NUMA node 0 for the first query. Suppose I run another MAXDOP 4 query. The enumerator advances by 4 and the allowed positions are 4-7. That means that any two schedulers can be chosen from NUMA node 0 and any two schedulers can be chosen from NUMA node 1. The worker threads are split over two soft-NUMA nodes even though query MAXDOP is less than the size of the soft-NUMA nodes.

Unless you’re on a server with a single soft-NUMA node it is difficult to guarantee that all worker threads end up on the same soft-NUMA node. I strongly recommend against aiming for that as a goal. There are more details in the “Preventing hard NUMA worker splits” section of this blog post.

This will improve the performance of the queries and distribution of worker threads across the NUMA nodes for the workload. For more information, see Soft-NUMA.

I’ve heard some folks claim that keeping all parallel workers on a single hard NUMA nodes can be important for query performance. I’ve even seen some queries experience reduced performance when thread 0 is on a different hard NUMA node than parallel worker threads. I haven’t heard of anything about the importance of keeping all of a query’s worker threads on a single soft-NUMA node. It doesn’t really make sense to say that query performance will be improved if all worker threads are on the same soft-NUMA node. Soft-NUMA is a configuration setting. Suppose I have a 24 core hard NUMA node and my goal is to get all of a parallel query’s worker threads on a single soft-NUMA node. To accomplish that goal the best strategy is to disable auto soft-NUMA because that will give me a NUMA node size of 24 as opposed to 8. So disabling auto soft-NUMA will increase query performance?

Starting with SQL Server 2016 (13.x), use the following guidelines when you configure the max degree of parallelism server configuration value:

Server with single NUMA node [and] Less than or equal to 8 logical processors: Keep MAXDOP at or below # of logical processors

I don’t understand this guidance at all. If MAXDOP is set to above the number of logical processors then the total number of logical processors is used. This is even mentioned earlier on the same page of documentation. This line is functionally equivalent to “Set MAXDOP to whatever you want”.

Server with single NUMA node [and] Greater than 8 logical processors: Keep MAXDOP at 8

This configuration is only possible with a physical core count between 5 and 8 and with hyperthreading enabled. Setting MAXDOP above the physical core count isn’t recommended by some folks, but I suppose there could be some scenarios where it makes sense. Keeping MAXDOP at 8 isn’t bad advice for many queries on a large enough server, but the documentation is only talking about small servers here.

Server with multiple NUMA nodes [and] Less than or equal to 16 logical processors per NUMA node: Keep MAXDOP at or below # of logical processors per NUMA node

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node, so this covers all server configurations with more than 8 physical cores. Soft-NUMA scheduler counts per node can range from 4 to 16. If you accept this advice then in some scenarios you’ll need to lower MAXDOP as you increase the number of physical cores per socket. For example, if I have 24 schedulers per socket without hyperthreading then auto soft-NUMA gives me three NUMA nodes of 8 schedulers, so I might set MAXDOP to 8. But if the scheduler count is increased to 25, 26, or 27 then I’ll have at least one soft-NUMA node of 6 schedulers. So I should lower MAXDOP from 8 to 6 because the physical core count of the socket increased?

Server with multiple NUMA nodes [and] Greater than 16 logical processors per NUMA node: Keep MAXDOP at half the number of logical processors per NUMA node with a MAX value of 16

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node. I believe that this is impossible. At the very least, if it possible I can tell you that it’s rare. This feels like an error in the documentation. Perhaps they were going for some kind of hyperthreading adjustment?

NUMA node in the above table refers to soft-NUMA nodes automatically created by SQL Server 2016 (13.x) and higher versions.

I suspect that this is a mistake and that some “NUMA node” references are supposed to refer to hard NUMA. It’s difficult to tell.

Use these same guidelines when you set the max degree of parallelism option for Resource Governor workload groups.

There are two benefits to using MAXDOP at the Resource Governor workload group level. The first benefit is that it allows different workloads to have different MAXDOP without changing lots of application code. The guidance here doesn’t allow for that benefit. The second benefit is that it acts as a hard limit on query MAXDOP as opposed to the soft limit provided with server level MAXDOP. It may also be useful to know that the query optimizer takes server level MAXDOP into account when creating a plan. It does not do so for MAXDOP set via Resource Governor.

I haven’t seen enough different types of workloads in action to provide generic MAXDOP guidance, but I can share some of the issues that can occur with query parallelism being too low or too high.

What are some of the problems with setting MAXDOP too low?


  1. Better query performance may be achieved with a higher MAXDOP. For example, a well-written MAXDOP 8 query on a quiet server may simply run eight times as quickly as the MAXDOP 1 version. In some scenarios this is highly desired behavior.
  2. There may not be enough concurrent queries to get full value out of the server’s hardware without increasing query MAXDOP. Unused schedulers can be a problem for batch workloads that aim to get a large, fixed amount of work done as quickly as possible.
  3. Row mode bitmap operators associated with hash joins and merge joins only execute in parallel plans. MAXDOP 1 query plans lose out on this optimization.

What are some of the problems with setting MAXDOP too high?


  1. At some point, throwing more and more parallel queries at a server will only slow things down. Imagine adding more and more cars to an already gridlocked traffic situation. Depending on the workload you may not want to have many active workers per scheduler.
  2. It is possible to run out of worker threads with many concurrent parallel queries that have many parallel branches each. For example, a MAXDOP 8 query with 20 branches will ask for 160 parallel workers. When this happens parallel queries can get downgraded all the way to MAXDOP 1.
  3. Row mode exchange operators need to move rows between threads and do not scale well with increased query MAXDOP.
  4. Some types of row mode exchange operators evenly divide work among all parallel worker threads. This can degrade query performance if even one worker thread is on a busy scheduler. Consider a server with 8 schedulers. Scheduler 0 has two active workers and all other schedulers have no workers. Suppose there is 40 seconds of CPU work to do, the query scales with MAXDOP perfectly, and work is evenly distributed to worker threads. A MAXDOP 4 query can be expected to run in 40/4 = 10 seconds since SQL Server is likely to pick four of the seven less busy schedulers. However, a MAXDOP 8 query must put one of the worker threads on scheduler 0. The work on schedulers 1 – 7 will finish in 40/8 = 5 seconds but the worker thread on scheduler 0 has to yield to the other worker threads. It may take 5 * 3 = 15 seconds if CPU is shared evenly, so in this example increasing MAXDOP from 4 to 8 increases query run time from 10 seconds to 15 seconds.
  5. The query memory grant for parallel inserts into columnstore indexes increases with MAXDOP. If MAXDOP is too high then memory pressure can occur during compression and the SELECT part of the query may be starved for memory.
  6. The query memory grant for memory-consuming operators on the inner side of a nested loop is often not increased with MAXDOP even though the operator may execute concurrently once on each worker thread. In some uncommon query patterns, increasing MAXDOP will increase the amount of data spilled to tempdb.
  7. Increasing MAXDOP increases the number of queries that will have parallel workers spread across multiple hard NUMA nodes. If MAXDOP is greater than the number of schedulers in a hard NUMA node then the query is guaranteed to have split workers. This can degrade query performance for some types of queries.
  8. Worker threads may need to wait on some type of shared resource. Increasing MAXDOP can increase contention without improving query performance. For example, there’s nothing stopping me from running a MAXDOP 100 SELECT INTO, but I certainly do not get 100X of the performance of a MAXDOP 1 query. The problem with the below query is the NESTING_TRANSACTION_FULL latch:

Preventing hard NUMA worker splits


It generally isn’t possible to prevent worker splits over hard NUMA nodes without changing more than server level and query level MAXDOP. Consider a server with 2 hard NUMA nodes of 10 schedulers for each. To avoid a worker split, an administrator might try setting server level MAXDOP to 10, with the idea being that each parallel query spreads its workers over NUMA node 0 or NUMA node 1. This plan won’t work if any of the following occur:

  • Any query runs with a query level MAXDOP hint other than 0, 1, 10, or 20.
  • Any query is downgraded in MAXDOP but still runs in parallel.
  • A parallel stats update happens. The last time I checked these run with a query level MAXDOP hint of 16.
  • Something else unexpected happens.

In all cases the enumerator will be shifted and any MAXDOP 10 queries that run after will split their workers. TF 2467 can help, but it needs to be carefully tested with the workload. With the trace flag, as long as MAXDOP <= 10 and automatic soft-NUMA is disabled then the parallel workers will be sent to a single NUMA node based on load. Note that execution context 0 thread can still be on a different hard NUMA node. If you want to prevent that then you can try Resource Governor CPU affinity at the Resource Pool level. Create one pool for NUMA node 0 and one pool for NUMA node 1. You may experience interesting consequences when doing that.

The most reliable method by far is to have a single hard NUMA node, so if you have a VM that fits into a single socket of a VM host and you care about performance then ask your friendly VM administrator for some special treatment.

Final thoughts


I acknowledge that it’s difficult to create MAXDOP guidance that works for all scenarios and I hope that Microsoft continues to try to improve their documentation on the subject. Thanks for reading!

T-SQL Protip: watch those TOPs without ORDER BY

In the documentation for TOP, the following is listed as a best practice:

In a SELECT statement, always use an ORDER BY clause with the TOP clause. Because, it’s the only way to predictably indicate which rows are affected by TOP.

Let’s work through a real world example.

The good


One of the great things about the “Top Resource Consuming Queries” query store SSMS report is that it is always able to render the query plan, even for very complex queries. I’m not aware of a pure T-SQL solution that can avoid requiring the end user to save xml to files in all cases. The report nearly always takes a long time to run, so it’s easy to capture the T-SQL that powers the grid details version:

DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_duration DESC;

Note the presence of the ORDER BY. I get exactly the results that I was expecting:

The bad


If I ask for extra details (who doesn’t want more details?), a significantly more complex query is generated:

-- grid format query with additional details
-- grid format query with additional details
DECLARE @results_row_count INT = 100,
@interval_start_time DATETIMEOFFSET = '2019-05-24 15:30:00 +00:00',
@interval_end_time DATETIMEOFFSET = '2019-05-24 18:00:00 +00:00';
 
With wait_stats AS
(
SELECT
    ws.plan_id plan_id,
    ws.execution_type,
    ROUND(CONVERT(float, SUM(ws.total_query_wait_time_ms)/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))*1,2) avg_query_wait_time,
    ROUND(CONVERT(float, SQRT( SUM(ws.stdev_query_wait_time_ms*ws.stdev_query_wait_time_ms*(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms))/SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms)))*1,2) stdev_query_wait_time,
    CAST(ROUND(SUM(ws.total_query_wait_time_ms/ws.avg_query_wait_time_ms),0) AS BIGINT) count_executions,
    MAX(itvl.end_time) last_execution_time,
    MIN(itvl.start_time) first_execution_time
FROM sys.query_store_wait_stats ws
    JOIN sys.query_store_runtime_stats_interval itvl ON itvl.runtime_stats_interval_id = ws.runtime_stats_interval_id
WHERE NOT (itvl.start_time > @interval_end_time OR itvl.end_time < @interval_start_time)
GROUP BY ws.plan_id, ws.runtime_stats_interval_id, ws.execution_type ),
top_wait_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(ws.avg_query_wait_time*ws.count_executions))*1,2) total_query_wait_time,
    SUM(ws.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM wait_stats ws
    JOIN sys.query_store_plan p ON p.plan_id = ws.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (ws.first_execution_time > @interval_end_time OR ws.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
),
top_other_stats AS
(
SELECT TOP (@results_row_count)
    p.query_id query_id,
    q.object_id object_id,
    ISNULL(OBJECT_NAME(q.object_id),'') object_name,
    qt.query_sql_text query_sql_text,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    ROUND(CONVERT(float, SUM(rs.avg_cpu_time*rs.count_executions))*0.001,2) total_cpu_time,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_reads*rs.count_executions))*8,2) total_logical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_writes*rs.count_executions))*8,2) total_logical_io_writes,
    ROUND(CONVERT(float, SUM(rs.avg_physical_io_reads*rs.count_executions))*8,2) total_physical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_clr_time*rs.count_executions))*0.001,2) total_clr_time,
    ROUND(CONVERT(float, SUM(rs.avg_dop*rs.count_executions))*1,0) total_dop,
    ROUND(CONVERT(float, SUM(rs.avg_query_max_used_memory*rs.count_executions))*8,2) total_query_max_used_memory,
    ROUND(CONVERT(float, SUM(rs.avg_rowcount*rs.count_executions))*1,0) total_rowcount,
    ROUND(CONVERT(float, SUM(rs.avg_log_bytes_used*rs.count_executions))*0.0009765625,2) total_log_bytes_used,
    ROUND(CONVERT(float, SUM(rs.avg_tempdb_space_used*rs.count_executions))*8,2) total_tempdb_space_used,
    SUM(rs.count_executions) count_executions,
    COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY p.query_id, qt.query_sql_text, q.object_id
)
SELECT TOP (@results_row_count)
    A.query_id query_id,
    A.object_id object_id,
    A.object_name object_name,
    A.query_sql_text query_sql_text,
    A.total_duration total_duration,
    A.total_cpu_time total_cpu_time,
    A.total_logical_io_reads total_logical_io_reads,
    A.total_logical_io_writes total_logical_io_writes,
    A.total_physical_io_reads total_physical_io_reads,
    A.total_clr_time total_clr_time,
    A.total_dop total_dop,
    A.total_query_max_used_memory total_query_max_used_memory,
    A.total_rowcount total_rowcount,
    A.total_log_bytes_used total_log_bytes_used,
    A.total_tempdb_space_used total_tempdb_space_used,
    ISNULL(B.total_query_wait_time,0) total_query_wait_time,
    A.count_executions count_executions,
    A.num_plans num_plans
FROM top_other_stats A LEFT JOIN top_wait_stats B on A.query_id = B.query_id and A.query_sql_text = B.query_sql_text and A.object_id = B.object_id
WHERE A.num_plans >= 1
ORDER BY total_duration DESC
)

Now we have not 1, not 2, but THREE TOP operators! But only one of them has an ORDER BY. The results are completely different, and are pretty much useless:

The ugly


This has nothing to do with TOP as far as I know, but I included it just for fun:

Final thoughts


All of you developers out there should watch your TOPs and make sure you’re using ORDER BY as needed. Otherwise, you might end up with annoyed end users writing blog posts about your code. Thanks for reading!

Why is SYSDATETIME() slower than SYSUTCDATETIME()?

Consider the following code that calls SYSDATETIME() 10 million times in a loop:

GO

CREATE OR ALTER PROCEDURE #p_local AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @dummy DATETIME2(7), @loops INT = 0;

	WHILE @loops <= 10000000
	BEGIN
		SET @dummy = SYSDATETIME();

		SET @loops = @loops + 1;
	END;
END;

GO

EXEC #p_local;

On my machine the code takes about 11.6 seconds to execute. Replacing SYSDATETIME() with SYSUTCDATETIME() makes the code take only 4.3 seconds to execute. Why is SYSUTCDATETIME() so much faster than SYSDATETIME()?

It’s always a CPU problem


Both while loops drive a CPU core to 100% while executing. Within SQL server, elapsed time nearly equals CPU. For this code, I wouldn’t expect waits, latches, or spinlocks to reveal any clues. Query plans, execution stats, and other DMVs are unlikely to help as well. One thing that could help is information about the internals of SYSDATETIME() and SYSUTCDATETIME(). There’s a little bit on books online:

SQL Server obtains the date and time values by using the GetSystemTimeAsFileTime() Windows API. The accuracy depends on the computer hardware and version of Windows on which the instance of SQL Server is running. The precision of this API is fixed at 100 nanoseconds. The accuracy can be determined by using the GetSystemTimeAdjustment() Windows API.

Probably all of the useful information is likely locked behind the vault at Microsoft. Fortunately, both SQL Server and the guest OS are happy to give us information about what SQL Server is using its CPU to do if we have the right public symbols and ask nicely. I know of four primary ways of doing this:

  1. A True Professional could use WinDbg (the cool kids pronounce it “wind bag”) or another debugger to step through the code of both functions. I don’t have the skills or patience to do that.
  2. DBCC STACKDUMP or sqldumper.exe can be used to create a filtered memory dump on demand. The problem with this approach is that it only provides a single snapshot and our functions execute very quickly.
  3. Extended events can provide callstacks for many events. I’m not aware of an extended event that would fire at the right times for this code.
  4. ETW tracing can provide a summary of call stacks during a sampled time period. Common choices are Windows Performance Recorder or PerfView.

Option 4 is exactly what we need. We can run a single query at a time that performs uniform work and we want to see where CPU time is spent during the entire query’s execution. My tool of choice is PerfView.

Revealing SQL Server’s secrets


I used PerfView’s default settings and collected ETW profile data separately during both while loops. After resolving symbols and changing folding to 0, there’s a clear difference between SYSDATETIME() (on the left) and SYSUTCDATETIME() (on the right):

kernelbase!GetTimeZoneInformation shows up with a high percentage for exclusive time for SYSDATETIME(). It does not show up at all for SYSUTCDATETIME(). In case it helps, here’s documention for a few of the column names:

Exc – The amount of cost (msec of CPU time) that can be attributed to the particular method itself (not any of its callees)
Exc % – The exclusive cost expressed as a percentage of the total cost of all samples.
Inc – The cost associated with this node as well as all its children (callees) recursively. The inclusive cost of the ROOT contains all costs.
Inc % – The inclusive cost expressed as a percentage of the total cost of all samples (will be 100% for the ROOT node)

sqltses!CXVariant::GetSysDatetime sounds an awful lot like SYSDATETIME() to me. It also shows up in the right place in the stack to be a match. Drilling into the methods are called by it:

Note that the call to kernelbase!GetTimeZoneInformation takes up nearly all CPU time. There’s also a call to kernelbase!GetSystemTimeAsFileTime that barely shows up as well.

Sharp-eyed readers might be wondering if SYSDATETIMEOFFSET() results in calls to sqltses!CDatetimeOffset::SetSystemDatetimeOffset. The answer is yes, but this is left as an exercise for the reader.

Again, sqltses!CXVariant::GetSysUTCDatetime sounds like awful lot like SYSUTCDATETIME() to me. Drilling into the methods are called by it:

Note the low percentage of inclusive time for the call. SYSUTCDATETIME() is so cheap to call that most of the CPU time is spent executing the loop code instead of that function. kernelbase!GetTimeZoneInformation does not show up at all. There’s is still a call to kernelbase!GetSystemTimeAsFileTime that barely shows up.

This is probably an oversimplification, but I think it’s fair to say that each call to SYSDATETIME() needs to get time zone information. This results in a call to kernelbase!GetTimeZoneInformation that is relatively expensive. There is no need to call that method for SYSUTCDATETIME().

Does this matter? It depends™


Database folks like answering questions with “it depends”. Regrettably, they don’t always provide sufficient or accurate details about what it depends on. I strive to not fall into that trap. For many queries the performance difference between the two functions won’t matter. Consider the following two batches:

DROP TABLE IF EXISTS #BENCHMARK_TIME_FUNCTIONS;

CREATE TABLE #BENCHMARK_TIME_FUNCTIONS (SAVED_DATETIME DATETIME2(7) NULL);

INSERT INTO #BENCHMARK_TIME_FUNCTIONS WITH (TABLOCK)
SELECT TOP (10000000) SYSDATETIME()
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

GO

DROP TABLE IF EXISTS #BENCHMARK_TIME_FUNCTIONS;

CREATE TABLE #BENCHMARK_TIME_FUNCTIONS (SAVED_DATETIME DATETIME2(7) NULL);

INSERT INTO #BENCHMARK_TIME_FUNCTIONS WITH (TABLOCK)
SELECT TOP (10000000) SYSUTCDATETIME()
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

SQL Server doesn’t need to execute the datetime functions once per row. If it did, we’d expect a run time difference of about 7 seconds between the two queries. Instead, they execute in nearly exactly the same amount of time. I think that the details about guarantees of this behavior were found on a Connect (RIP) comment.

The performance difference can make a difference when the function is used to profile code that executes row by row. Suppose that we have related SQL statements and we want to measure the execution time of just one of the statements by calculating the time difference before and after execution. I’m not saying that this is the best way of doing this analysis, but I suspect it to be a common technique. Below is an example stored procedure with one copy using SYSDATETIME() and the other using SYSUTCDATETIME():

GO

CREATE OR ALTER PROCEDURE #USE_SUPERIOR_AMERICAN_TIME AS
BEGIN
	DECLARE @start_time DATETIME2(7),
	@end_time DATETIME2(7),
	@total_microseconds BIGINT = 0,
	@loops INT = 0;

	SET NOCOUNT ON;

	DROP TABLE IF EXISTS dbo.OFFICIAL_STATEMENTS;
	CREATE TABLE dbo.OFFICIAL_STATEMENTS (OFFICIAL_STATEMENT VARCHAR(30));
	
	WHILE @loops <= 50000 BEGIN INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO COLLUSION');
		IF @@ROWCOUNT > 0	
		BEGIN
			SET @start_time = SYSDATETIME();

			INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO OBSTRUCTION');

			SET @end_time = SYSDATETIME();

			SET @total_microseconds = @total_microseconds + DATEDIFF(MICROSECOND, @start_time, @end_time);
		END;

		SET @loops = @loops + 1;
	END;

	SELECT @total_microseconds;
END;

GO

EXEC #USE_SUPERIOR_AMERICAN_TIME;

GO

CREATE OR ALTER PROCEDURE #USE_INFERIOR_EUROPEAN_TIME AS
BEGIN
	DECLARE @start_time DATETIME2(7),
	@end_time DATETIME2(7),
	@total_microseconds BIGINT = 0,
	@loops INT = 0;

	SET NOCOUNT ON;

	DROP TABLE IF EXISTS dbo.OFFICIAL_STATEMENTS;
	CREATE TABLE dbo.OFFICIAL_STATEMENTS (OFFICIAL_STATEMENT VARCHAR(30));
	
	WHILE @loops <= 50000 BEGIN INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO COLLUSION');
		IF @@ROWCOUNT > 0	
		BEGIN
			SET @start_time = SYSUTCDATETIME();

			INSERT INTO dbo.OFFICIAL_STATEMENTS VALUES ('NO OBSTRUCTION');

			SET @end_time = SYSUTCDATETIME();

			SET @total_microseconds = @total_microseconds + DATEDIFF(MICROSECOND, @start_time, @end_time);
		END;

		SET @loops = @loops + 1;
	END;

	SELECT @total_microseconds;
END;

GO

EXEC #USE_INFERIOR_EUROPEAN_TIME;

The second insert query takes a total of 4381219 microseconds with SYSDATETIME() and 4164658 microseconds with SYSUTCDATETIME(). That’s a 5% difference just from changing the datetime function. Of course, given that this code is measuring the difference of times SYSUTCDATETIME() is the correct choice because it remains accurate even if there’s a daylight savings switch during code execution. My own approach is trying to avoid running one-off stored procedures around 2:00 AM.

Final thoughts


There’s a large performance difference between SYSDATETIME() and SYSUTCDATETIME(). It isn’t necessary to guess which one is slower and why. ETW tracing tools such as PerfView can reveal exactly where the performance difference comes from. Thanks for reading!

Breaking Columnstore Delta Rowgroups

The documentation on delta rowgroups says:

A delta rowgroup is a clustered index that’s used only with columnstore indexes. It improves columnstore compression and performance by storing rows until the number of rows reaches a threshold and are then moved into the columnstore.

This clustered index isn’t chosen by the person who creates the table. It’s a hidden internal column automatically added to delta rowgroups. Perhaps the implementation is similar to that of table spools:

The worktable is structured as if it was defined with a clustered index on zero columns. This means that a 4-byte uniqueifier is added to each row stored except the first.

You know how programmers are: they love reusing old code. If a similar implementation is used for delta rowgroups then it should be possible to see SQL Server errors with the right pattern of data loading. More investigation is required.

Defining the rules


It might be possible to find evidence of this clustered index by using DBCC PAGE. In the T-SQL code below, I create a table clustered columnstore index, insert 500 rows, delete 499 rows, insert 500 more rows, and delete 499 rows again:

CREATE TYPE dbo.SEAN_GALLARDY_INT FROM SMALLINT NOT NULL;

DROP TABLE IF EXISTS dbo.view_hidden_clustered_index;

CREATE TABLE dbo.view_hidden_clustered_index (
	ID SEAN_GALLARDY_INT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

GO

CREATE OR ALTER PROCEDURE #p AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @value_to_insert INT = 1;

	WHILE @value_to_insert <= 500
	BEGIN
		INSERT INTO dbo.view_hidden_clustered_index VALUES (@value_to_insert);

		SET @value_to_insert = @value_to_insert + 1;
	END;

	DELETE FROM dbo.view_hidden_clustered_index
	WHERE ID < 500;

	WHILE @value_to_insert <= 1000
	BEGIN
		INSERT INTO dbo.view_hidden_clustered_index VALUES (@value_to_insert);

		SET @value_to_insert = @value_to_insert + 1;
	END;

	DELETE FROM dbo.view_hidden_clustered_index
	WHERE ID > 500 AND ID < 1000;
END;

GO

EXEC #p;

There’s only a single page that stores the 2 rows currently held by the table. This can be viewed with the undocumented DBCC PAGE, TF 3604, and the undocumented sys.dm_db_database_page_allocations:

DECLARE @file_id SEAN_GALLARDY_INT;
DECLARE @page_id INT;

SELECT @file_id = allocated_page_file_id, @page_id = allocated_page_page_id
FROM sys.dm_db_database_page_allocations 
    (DB_ID(), OBJECT_ID('dbo.view_hidden_clustered_index'),NULL, NULL, 'DETAILED')
WHERE is_allocated = 1 AND allocation_unit_type = 1 AND is_iam_page = 0 and page_type = 1;

DBCC TRACEON(3604);
DBCC PAGE('TEST',@file_id,@page_id,3) WITH TABLERESULTS;
DBCC TRACEOFF(3604);

The information that we’re looking for is near the bottom of the result set:

A few things are worth calling out. The 500th row that was inserted into the table has a value of 499 for the “CSILOCATOR” field. This value appears to be stored in little-endian format in the memory dump for Slot 0 Offset 0x60. You can decode the raw value to 499 in T-SQL if desired:

SELECT CAST(0x000001F3 AS INT);

The 1000th row that was inserted into the table into the table has a value of 999 for the CSILOCATOR field. Most importantly, this CSILOCATOR field has a length of four bytes. A typical four byte int in SQL Server has a maximum value of 2147483647. If it’s possible to load billions of rows into a single delta rowgroup then we may run out of values for the CSILOCATOR field.

Of course, a single delta rowgroup cannot hold more than 1048576 rows. As you can see in this example, SQL Server does not always reuse freed up values for the CSILOCATOR field. The table currently has two rows, yet the field has advanced to 999. The right pattern of deletes, inserts, and updates should allow the CSILOCATOR to continue to grow without running out of room for rows in the delta rowgroup.

As a final note, I cannot prove that the CSILOCATOR field corresponds to the clustered index, but it certainly seems to serve the function of a uniqueifier that would be needed for said clustered index.

Running up the score


I need to find a relatively efficient way to advance the CSILOCATOR because I need to do it over 2 billion times, if my theory is correct about the maximum allowed value. Both updating all of the rows in a delta rowgroup and deleting and reinserting advance the CSILOCATOR. I expected that small batch sizes would work best, and they did. For my table’s schema, the sweet spot for updates is about 275 rows and the sweet spot for delete/inserts is about 550 rows. Delete/inserts appeared to be faster than updates for the purpose of constantly reloading the same rows over and over.

Strategies that use multiple CPU cores are possible, but I wanted to do other work on this PC and didn’t want to listen to a loud fan all day. Here’s what the final testing code looked like:

DROP TABLE IF EXISTS dbo.delta_store_test;

CREATE TABLE dbo.delta_store_test (
	ID TINYINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);


DROP TABLE IF EXISTS dbo.LOG_TABLE;

CREATE TABLE dbo.LOG_TABLE (
	log_time DATETIME,
	loop_count INT,
	PRIMARY KEY (log_time)
);

GO

DROP TABLE IF EXISTS dbo.delta_store_source;
CREATE TABLE dbo.delta_store_source (
	ID TINYINT NOT NULL
);

INSERT INTO dbo.delta_store_source
SELECT TOP (550) 1
FROM master..spt_values t1
OPTION (MAXDOP 1);

GO

CREATE OR ALTER PROCEDURE #p AS
BEGIN
	SET NOCOUNT ON;

	DECLARE @loops INT = 0;

	WHILE @loops <= 8000000
	BEGIN
		DELETE FROM dbo.delta_store_test

		INSERT INTO dbo.delta_store_test
		SELECT 1
		FROM dbo.delta_store_source WITH (TABLOCK);

		SET @loops = @loops + 1;

		IF @loops % 10000 = 0
		BEGIN
			INSERT INTO dbo.LOG_TABLE
			VALUES (GETDATE(), @loops);
		END;
	END;
END;

GO

EXEC #p;

If you’re wondering about the temporary stored procedure creation, it’s a habit that I’ve gotten into whenever I write a T-SQL while loop. While not applicable here, very fast loops can incur ASYNC_NETWORK_IO overhead due to the passing of DONE tokens to the client.

Winning the game


After about seven hours the code reaches its glorious end:

Msg 666, Level 16, State 2, Procedure #p, Line 11 [Batch Start Line 26]
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 72057596406595584. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.

I loaded a few more rows into the table until I was unable to insert even one row. Using DBCC PAGE and friends again, we can see that the CSILOCATOR has a very large value of 2147483646.

Issuing a REBUILD does resolve the issue because it wipes away our sins, as REBUILDs often do. Naturally using another clustering key is not an option.

Joe: 1

Microsoft: 0

The cost of playing


My desktop computer has an intel core i5-4670 processor. Intel claims a TDP of 84 watts. Using CPUID HWMonitor, it looks like my CPU uses about 15 W of additional power when running the workload. It’s a single core workload, so I feel that 15 W is reasonable. 15 watts is 0.015 kW, and when used over 7 hours it translates to 0.105 kWh. My last electric bill has a rate of $0.11663 per kWh, so the total cost of this test comes out to about 1.2 cents. I will be sending Erik an invoice.

Final thoughts


It is possible to hit error 666 when maliciously loading data into a columnstore index. It requires loading over 2 billion rows into the same delta rowgroup, so it’s difficult to think of a production scenario that would lead to this outcome. For those of you with eccentric workloads that give cause for concern, you can roughly check if you are running of IDs by running a query similar to the following for each columnstore table:

SELECT 2147483647 - 1 - MAX(CAST(SUBSTRING(%%physloc%%, 5, 4) AS INT)) REMAINING_CSILOCATORS
FROM dbo.delta_store_test;

Thanks for reading!