Batch Mode On Row Store vs Batch Mode Tricks

Quiet Time


I think Batch Mode is quite spiffy for the right kind of query, but up until SQL Server 2019, we had to play some tricks to get it:

  • Do a funny join to an empty table with a column store index
  • Create a filtered column store index with no data in it

If you’re on SQL server 2019 Enterprise Edition, and you’ve got your database in compatibility level 150, you may heuristically receive Batch Mode without those tricks.

One important difference between Batch Mode Tricks™ and Batch Mode On Rowstore (BMOR) is that the latter allows you to read from row mode tables using Batch Mode, while the former doesn’t.

Tricks have limits, apparently.

Squish Squish


To cut down on typing, I’ll often create a helper object like this:

CREATE TABLE dbo.t
(
    id int NULL,
    INDEX c CLUSTERED COLUMNSTORE
);

If you read this post, you’ll understand more why.

Now, let’s compare these two queries:

SELECT 
    p.OwnerUserId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
LEFT JOIN dbo.t
    ON 1 = 0
WHERE p.Score < 50
GROUP BY p.OwnerUserId
HAVING COUNT_BIG(*) > 2147483647
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'), MAXDOP 8);

SELECT 
    p.OwnerUserId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.Score < 50
GROUP BY p.OwnerUserId
HAVING COUNT_BIG(*) > 2147483647
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_150'), MAXDOP 8);

 

One executes in compatibility level 140, the other in 150.

Splish Splash


There are a couple interesting things, here.

the porter

Even though both queries have operators that execute in Batch Mode (Filter, Hash Match), only the second query can read from the row store clustered index in Batch Mode. In this case, that shaves a couple hundred milliseconds off the seek.

There is likely some additional invisible benefit to not having to convert the row mode seek to a batch mode hash join at the next operator, since one executes for 501ms, and the other executes for 278ms. There’s nothing in the query plan to signal that happening, so you’ll just have to use your imagination.

Thanks for reading!

A Word From Our Sponsors


First, a huge thank you to everyone who has bought my training so far. You all are incredible, and I owe all of you a drink.

Your support means a lot to me, and allows me to do nice stuff for other people, like give training away for free.

So far, I’ve donated $45k (!!!) worth of training to folks in need, no questions asked.

Next year, I’d like to keep doing the same thing. I’d also like to produce a whole lot more training to add value to the money you spend. In order to do that, I need to take time off from consulting, which isn’t easy to do. I’m not crying poor, but saying no to work for chunks of time isn’t easy for a one-person party.

I’m hoping that I can make enough in training bucks to make that possible.

Because this sale is extra sale-y, I’ve decided to name it after the blackest black known to man.

From today until December 31st, you can get all 25 hours of my recorded training content for just $100.00. If you click the link below to add everything to your cart, and use the discount code AllFor100 to apply a discount to your cart.

Everything

Everything

Everything

Some fine print: It only works if you add EVERYTHING. It’s a fixed amount discount code that you need to spend a certain amount to have kick in.

Thank for reading, and for your support.

Making The Most Of Temp Tables Part 4: Batch Mode

Le Big Zoom Zoom


When you have queries that need to process a lot of data, and probably do some aggregations over that lot-of-data, batch mode is usually the thing you want.

Originally introduced to accompany column store indexes, it works by allowing CPUs to apply instructions to up to 900 rows at a time.

It’s a great thing to have in your corner when you’re tuning queries that do a lot of work, especially if you find yourself dealing with pesky parallel exchanges.

Oh, Yeah


One way to get that to happen is to use a temp table with a column store index on it.

SELECT 
    v.UserId, 
    SUM(v.BountyAmount) AS SumBounty
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON  v.PostId = c.PostId
    AND v.UserId = c.UserId
GROUP BY v.UserId
ORDER BY SumBounty DESC;

CREATE TABLE #t(id INT, INDEX c CLUSTERED COLUMNSTORE);

SELECT 
    v.UserId, 
    SUM(v.BountyAmount) AS SumBounty
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON  v.PostId = c.PostId
    AND v.UserId = c.UserId
LEFT JOIN #t AS t 
    ON 1 = 0
GROUP BY v.UserId
ORDER BY SumBounty DESC;

Keep in mind, this trick won’t work if you’re on SQL Server 2019 and using in memory tempdb. But aside from that, you’re free to rock and roll with it.

If you end up using this enough, you may just wanna create a real table to use, anyway.

Remarkable!


If we look at the end (or beginning, depending on how you read your query plans) just to see the final times, there’s a pretty solid difference.

you can’t make me

The first query takes around 10 seconds, and the second query takes around 4 seconds. That’s a pretty handsome improvement without touching anything else.

Thanks for reading!

A Word From Our Sponsors


First, a huge thank you to everyone who has bought my training so far. You all are incredible, and I owe all of you a drink.

Your support means a lot to me, and allows me to do nice stuff for other people, like give training away for free.

So far, I’ve donated $45k (!!!) worth of training to folks in need, no questions asked.

Next year, I’d like to keep doing the same thing. I’d also like to produce a whole lot more training to add value to the money you spend. In order to do that, I need to take time off from consulting, which isn’t easy to do. I’m not crying poor, but saying no to work for chunks of time isn’t easy for a one-person party.

I’m hoping that I can make enough in training bucks to make that possible.

Because this sale is extra sale-y, I’ve decided to name it after the blackest black known to man.

From today until December 31st, you can get all 25 hours of my recorded training content for just $100.00. If you click the link below to add everything to your cart, and use the discount code AllFor100 to apply a discount to your cart.

Everything

Everything

Everything

Some fine print: It only works if you add EVERYTHING. It’s a fixed amount discount code that you need to spend a certain amount to have kick in.

Thank for reading, and for your support.

Columnstore Loading on Eight Socket Servers

I had a brief opportunity to do SQL Server workload testing on an eight socket server. It didn’t go well.

Sockets

I’ll give an extremely brief introduction to NUMA and sockets because I have a bit more free time these days. You’re probably reading this blog post on a machine with a single socket. All that means is that there’s a single CPU chip plugged into the motherboard. All of the RAM for the machine is local to that CPU chip. The recent king of single socket performance for SQL Server is the Intel 8280. 28 cores at 2.7 GHz with the latest instruction sets is quite a lot of processing power. There’s a newer chip out there now but it’s not clear to me if anyone can buy it yet.

What can you do if a workload can’t be scaled out and it needs more than the CPU power available with a single socket solution? A two socket solution could be the answer. A two socket server has two CPU chips plugged into different slots on a motherboard and each CPU has a local bank of RAM. Any part of memory can be access by either CPU but there is a performance penalty for foreign memory access, which is just accessing memory from a different CPU. See the simple diagram below:

As mentioned earlier, it is more expensive for CPU0 to access memory local to CPU1 compared to its own local memory. The performance penalty of NUMA will depend on the workload as well as the hardware. It can range from not noticeable at all to a serious issue that needs to be addressed in some way. In some situations this can be as easy as convincing your VM admin to change how the VM is set up. In others, your best bet may be to figure out a way to keep the workload on a single socket. Some folks advocate for single socket solutions compared to two socket solutions for OLTP workloads due to the NUMA penalty.

Perhaps you are wondering what happens if a workload needs more power available from a two socket server? Enter the four socket server. Four socket servers have four slots for CPU chips. An Intel 8280 processor in a four socket configuration gives you a total of 112 physical cores. That’s a lot. Four socket servers are more expensive than two socket servers and are significantly less common to find in data centers. Often, they are a custom order which makes months for the hardware vendor to build and deliver. In addition, the connections between CPUs are more complicated. Below are two common configurations:

Under configuration A, it takes two hops for CPU0 to access memory local to CPU3. That will increase latency more than going to CPU1 or CPU2. However, memory latency between CPUs won’t be the same for all pairs even in configuration B. In addition, it’s more complicated for SQL Server to try to manage memory on a four socket box. Even in the best case scenario for memory management, by design everything stored in SQL Server owned memory isn’t NUMA aware, such as the columnstore object pool. Nonetheless, if you need more CPU than a two socket server or VM can provide then you’re stuck living with the complexities of a four socket solution.

What if you need even more power? Hardware manufacturers once again have you covered with eight socket servers. These are even more expensive and uncommon than four socket solutions. Different vendors handle NUMA differently. There are simply more possible permutations with eight CPUs compared to four. I’m told that the following is an example layout for an eight socket machine:

I’m not a NUMA expert by any means. However, I think it’s fair to say that it looks a lot more complicated. There are many more instances where accessing foreign memory will take two hops instead of one. I also don’t see a way to carve out a four socket VM with all CPUs just one hop away. All in all, that diagram makes me nervous. As a final note, eight socket machines are not the limit of what’s possible, but I’ll stop here because I’ve never run code on anything larger than eight sockets.

Test Setup

I elected to use a high concurrency CCI insert workload to compare performance between a four socket VM and an eight socket VM. Quite conveniently, I already had a test columnstore workload that I knew pushed the SQL Server scalability limits in terms of memory management. To perform the threading I used the SQL Server Multi Thread open source framework. I wanted all sessions to go to their own schedulers. That could have been tough to manage with tests up to 200 threads but the threading framework handles that automatically.

For those following along at home, testing was done with SQL Server 2019 with LPIM and TF 876 enabled. Guest VMs were built with VMware with Windows Server 2019 installed. The four and eight socket VMs were created on the same physical host with about 5.5 TB of RAM available to the guest OS in both configurations.

Test Results

In case you didn’t click the earlier link, each job inserts about 10.4 million rows into a table with a clustered columnstore index. The smallest test ran just a single thread. The largest test on the four socket VM ran 100 concurrent jobs for a total insert volume of a billion rows. The largest test on the eight socket VM ran 200 concurrent jobs for a total insert volume of two billion rows. In all test scenarios, the four socket VM performed more work per second than the eight socket VM:

Quite a disappointing result. Double your SQL Server license fees for a slower overall system! There are the expected memory related wait stats for the 200 thread result:

I grabbed a sample of callstacks with PerfView:

Drilling into the top stack shows that decommitting memory is a scalability issue:

The stacks for the 100 thread test on the four socket VM results are a bit surprising by comparison:

Overall CPU reported by PerfView is the same. Perhaps this is a limitation of the tool. However, it really does seem like all of those extra cores simply aren’t helping under the eight socket configuration. Throughput goes down instead of up. I had very limited time with this machine and don’t want to repeat much of a previous blog post, so I’ll stop the technical analysis here. It’s likely that nothing can really be done to improve the workload except large configuration or hardware changes.

If I had to make this workload perform better, I would directly engage with the vendors. If that didn’t work, I would try a bare metal configuration, SQL Server on Linux, or running on an eight socket machine from a different vendor. I think that it’s fair to say that the test results would be different, but I can’t speculate as to whether they would be better or worse or what the difference would be.

Final Thoughts

We live in a wonderful era of hardware prosperity. With a big enough checkbook, you can run your SQL Server workloads on servers with hundreds of CPU cores. However, proceed with extreme caution if you’re considering moving a SQL Server workload to a server or VM with more than four sockets. It might not scale as well as you’re hoping. Thanks for reading!

The Trillion Row Operator

SQL Server friend and excellent webmaster Mr. Erik C. Darling showed me a query plan gone wrong:

48 billion rows for a single operator is certainly a large number for most workloads. I ended up completely missing the point and started wondering how quickly a query could process a trillion rows through a single operator. Eventually that led to a SQL performance challenge: what is the fastest that you can get an actual plan with at least one operator processing a trillion rows? The following rules are in play:

  1. Start with no user databases
  2. Any query can run up to MAXDOP 8
  3. Temp tables may be created and populated but all such work needs to finish in 30 seconds or less
  4. A RECOMPILE query hint must be present in the final query
  5. Undocumented features and behavior are fair game

Unless stated otherwise here, all testing was done on SQL Server 2017 CU20. The processor of my local machine is an Intel i7-9700K. If you’re following along at home you may see different results depending on your hardware. The queries are extremely CPU intensive. If you’d like to skip ahead to the winning query then scroll down the bottom.

Loop Join

A natural way to generate a lot of rows quickly is to cross join together two equally sized tables. Throwing a million rows into a temporary table takes very little time. Finding the count of the result set should meet the challenge requirements as long as the right optimizer transforms are not available. An initial attempt for data prep:

DROP TABLE IF EXISTS #loop_test;

CREATE TABLE #loop_test (
ID TINYINT NOT NULL
);

INSERT INTO #loop_test WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #loop_test (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8);

The query returns the expected result of 1 trillion, but there is no operator that processes one trillion rows:

The query optimizer uses the LocalAggBelowJoin rule to perform an aggregate before the join. This brings down the total query estimated cost from 1295100 units to 6.23274 units for quite a savings. Trying again after disabling the transform:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

Results in a new query plan:

This query meets the requirements of the challenge. The stream aggregate, nested loops, and row count spool operators will process 1 trillion rows each. However, it is unsatisfactory from a performance point of view. The row mode stream aggregate has a significant profiling penalty compared to a batch mode hash aggregate alternative. It is simple enough to create a dummy table to add batch mode eligibility for the query plan:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The new plan has a nice batch mode operator to do the counting:

With an actual plan request, the new query processes rows at roughly twice the rate as the row mode query. Perhaps query profiling does its work once per batch for batch mode operators and once per row for row mode operators. Each batch has around 900 rows so profiling might be much cheaper per row for the hash aggregate compared to the stream aggregate. Strictly speaking, the goal of this challenge is to optimize the amount of time it takes to get an actual plan. This might lead to different tuning results than optimizing for query completion time without an actual plan request.

The final time for the batch mode query is 2 hours and 12 minutes. I uploaded an actual plan to pastetheplan.com. The query has a CPU to elapsed time ratio of only 5.37. That means on that average during query execution, more than 2.5 CPU cores were not doing useful work. Getting all eight CPU cores to perform work as much as possible during query execution should allow the query to finish faster. The CPU imbalance in this case is caused by how rows are distributed to threads on the outer side of the nested loop operator:

The query engine has different algorithm choices to perform a parallel scan of a table. In this case, it doesn’t pick an algorithm that leads to optimal performance. The nested loop operator in total drives 42717 seconds of CPU work. It’s important to have that work as balanced between threads as possible. In this situation it is helpful to reduce the number of rows per page. To understand why I recommend watching this Pass Summit presentation by Adam Machanic. This results in better row distribution between threads and theoretically better performance. The following code only allows for a single row per data page:

DROP TABLE IF EXISTS #wide_1_million_rows;

CREATE TABLE #wide_1_million_rows (
ID TINYINT NOT NULL,
FILLER VARCHAR(4200) NOT NULL
);

INSERT INTO #wide_1_million_rows WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID, REPLICATE('Z', 4200) FILLER
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #wide_1_million_rows (ID) WITH FULLSCAN, NORECOMPUTE;

The new query plan generates in 1:29:40 with a CPU ratio of the query is 7.977 which is near perfect. Rows are quite balanced between threads:

I don’t know of a method to further optimize the nested loop join query. Most of the query’s execution time is spent on profiling. Sampled stacks from 15 seconds of query execution:

ETW tracing tools such as PerfView along with internals knowledge can be used to more finely break down how CPU is used by this query. Almost 75% of the query’s execution time is spent on sqlmin!CQScanProfileNew:GetRowImp and on methods called by it. The query only takes 23 minutes to complete without an actual plan which is quite close to 25% of the runtime with an actual plan. If you’re interested in more examples of using PerfView to draw conclusiosn about SQL Server consider checking out the list of links in this stack exchange answer.

Merge Join

A cross join can only be implemented as a nested loop join but the same end result can be emulated with a merge join. Let’s start with testing a MAXDOP 1 query. Setup:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #merge_serial;

CREATE TABLE #merge_serial (
ID TINYINT NOT NULL
);

INSERT INTO #merge_serial WITH (TABLOCK)
SELECT TOP (1000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE CLUSTERED INDEX CI ON #merge_serial (ID);

The query to run:

SELECT COUNT_BIG(*)
FROM #merge_serial t1 WITH (TABLOCK)
INNER JOIN #merge_serial t2 WITH (TABLOCK) ON t1.ID = t2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MERGE JOIN, MAXDOP 1, QueryRuleOff LocalAggBelowJoin);

I did not allow the query to run to completion because it would have taken around 25 hours to complete. On an 8 core machine with merge join, there’s no reason to expect that a parallel query could do any better than 25/8 = 3.125 hours. This makes merge join an inferior option compared to loop join so there isn’t a good reason to pursue further approaches that use a merge join.

For completeness, the best merge plan time I was able to get was 5 hours and 20 minutes. The plan avoids the scalability issues found with parallel merge and order preserving repartition streams. The end result is still poor. The merge join simply consumes a lot of CPU to do its work. The additional overhead added by the loop join is not helpful.

Row Mode Hash Join

Perhaps we’ll have more luck with a hash join compared to a merge join. After all, the query optimizer assigns a lower estimated cost for a row mode hash join compared to a merge join for the trillion row scenario. What could go wrong? The first thing that goes wrong is that the addition of the empty CCI will lead to a batch mode hash join for this query. Avoiding that requires changing the join to a type that doesn’t allow for batch mode execution. Among other methods, this can be accomplished by putting the hash join on the inner side of a nested loop or changing the join to also match on NULL. The second technique will be used at first because it leads to significantly less overhead compared to the loop join approach.

Starting again with a MAXDOP 1 test:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_maxdop_1;
CREATE TABLE #hash_maxdop_1 (
	ID TINYINT NULL
);

INSERT INTO #hash_maxdop_1 WITH (TABLOCK)
SELECT TOP (1000000) 0 
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_maxdop_1 (ID) WITH FULLSCAN, NORECOMPUTE;

The test query has a more complex join condition:

SELECT COUNT_BIG(*)
FROM #hash_maxdop_1 t1 WITH (TABLOCK)
INNER HASH JOIN #hash_maxdop_1 t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (MAXDOP 1, RECOMPILE)

Note how the join now includes rows where both sides are NULL. There are no such rows in the table but the query optimizer does not know that so a row mode join is used. The query completed in 5.5 hours. This means that a parallel hash join approach could be competitive with the nested loop join. An optimistic time estimate for a parallel query is 41 minutes.

Getting a parallel query isn’t as simple as changing MAXDOP from 1 to 8 will not have the desired effect. Take a look at the following query plan born from such an attempt:

The repartition stream operators have a partitioning type of hash. That means that a hashing algorithm is used to assign rows to different threads. All of the rows from the table have the same column value, so they will all hash to the same value and will end up on the same thread. Seven CPUs will have no work to do even though the query technically executes at DOP 8. This will not lead to performance that’s better than executing at DOP 1.

Through testing I found that the following values for a nullable TINYINT all go to different threads: 0, 1, 4, 5, 7, 9, 14, and 17. Note that I’m relying on undocumented details here that might change between releases. Inserting 250k rows of each value into one table and 500k rows of each value into another table should result in 8 * 250000 * 500000 = 1 trillion rows total. Code to do this is below:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #row_hash_join_repartition_8_build;
CREATE TABLE #row_hash_join_repartition_8_build (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_build
SELECT TOP (2000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_build (ID) WITH FULLSCAN, NORECOMPUTE;


DROP TABLE IF EXISTS #row_hash_join_repartition_8_probe;
CREATE TABLE #row_hash_join_repartition_8_probe (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_probe
SELECT TOP (4000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_probe (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #row_hash_join_repartition_8_build t1 WITH (TABLOCK)
INNER JOIN #row_hash_join_repartition_8_probe t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, HASH JOIN, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The query plan appeared after 42 minutes with a CPU ratio of 7.7. This was a better result than I was expecting. The repartition stream operators don’t seem to have much of a negative effect on overall runtime.

For completeness, I also tested a parallel hash join plan without the repartition stream operators. All of the hash join work is on the inner side of a nested loop join. The loop join adds significant overhead so this ends up being a losing method. If you’d like to see for yourself, here is one way to set up the needed tables:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_join_demand_62500_rows;
CREATE TABLE #hash_join_demand_62500_rows (
	ID TINYINT NOT NULL
);

INSERT INTO #hash_join_demand_62500_rows WITH (TABLOCK)
SELECT TOP (62500) ISNULL(CAST(0 AS TINYINT), 0) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_join_demand_62500_rows (ID);

As well as one way to generate such a query:

SELECT COUNT_BIG(*)
FROM
(VALUES
	  (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
) v(v)
CROSS JOIN (
	SELECT 1 c
	FROM #hash_join_demand_62500_rows t1 WITH (TABLOCK)
	INNER HASH JOIN #hash_join_demand_62500_rows t2 WITH (TABLOCK) ON t1.ID = t2.ID
) ca
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, FORCE ORDER, MAXDOP 8, MIN_GRANT_PERCENT = 99, NO_PERFORMANCE_SPOOL);

It might be difficult to understand what’s going on here, so here’s a picture of the query plan:

Work is split up into 256 buckets. Each row fed into the outer side of the nested loop join results in 62500 rows hash joined with 62500 rows.

The disappointing final result takes about 2 hours of runtime. This query might perform better on a busy system compared to the simple parallel hash join because work is assigned to threads somewhat on demand. In this case, the repartition streams query query requires equal work to be completed on all threads. However, the testing environment only has a single query running at a time.

The query pattern here is similar to a parallel apply. Perhaps you’ve heard good things about that pattern and are wondering why it didn’t work out here. In my experience, that query pattern does best when an order preserving repartition streams operator can be avoided, there’s a supporting index on the inner side, and the row count coming out of the inner side is relatively low. The above query is 0 for 3.

Batch Mode Hash Join

I’ll be honest. I expected batch mode hash join to be the winner. I was wrong. Batch mode parallelism works in a fundamentally different way than row mode parallelism. There is no repartition streams operator which assigns rows to different threads based on a hash or other algorithm. Rather, each operator takes care of parallelism by itself. In my experience this is generally quite advantageous for both performance and scalability, but there can be drawbacks.

We intend to get a batch mode hash join, so it seems logical enough to join together a CCI with one million rows. Sample code to do that:

DROP TABLE IF EXISTS #batch_mode_hash_equal_sized;

CREATE TABLE #batch_mode_hash_equal_sized (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #batch_mode_hash_equal_sized WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #batch_mode_hash_equal_sized (ID) WITH FULLSCAN, NORECOMPUTE;

The query to be run no longer needs the join to the empty CCI table:

SELECT COUNT_BIG(*) 
FROM #batch_mode_hash_equal_sized c1
INNER JOIN #batch_mode_hash_equal_sized c2 ON c1.ID = c2.ID
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

An actual plan is generated on my machine in about 27.5 minutes. While clearly better than all other attempts so far, the CPU to elapsed ratio suggests that significant improvement is possible: it is only 4.6 out of a possible 8.0. Batch mode processing is generous enough to provide a helpful wait stat. There is about 5607 seconds of the HTDELETE wait. Put simply, some of the threads of the hash join ran out of work to do and waited for the other threads to finish their work. As I understand it with batch mode joins, the probe side is what matters in terms of work imbalance. Row count per thread of the probe side:

The million row CCI table is split up into 9 chunks of rows which are divided between 8 threads which leads to many threads having no work to do for a while. Quite unfortunate. Oddly enough, switching to MAXDOP 9 could lead to much better performance for this query. That isn’t an available option on my local machine so I switched to testing on a VM with Intel 8280 sockets. The theory works out: the MAXDOP 8 query takes 50 minutes to complete and the MAXDOP 9 query takes 25 minutes to complete. At MAXDOP 9 the row distribution is quite nice on the probe side:

The rules for the different algorithms available to do parallel scans on CCIs are not documented. I previously investigated it here. Interestingly enough, this is an example where an exchange operator could make a positive difference. However, the batch mode hash join doesn’t have that as an option. While there are various tricks to achieve even row distribution from a CCI, it seems simplest to just replace the tables with row store. Converting 2 million rows from row mode to batch mode won’t add any measurable overhead. Using the same table that was used for the best loop join time:

SELECT COUNT_BIG(*) 
FROM #wide_1_million_rows c1
INNER JOIN #wide_1_million_rows c2 ON c1.ID = c2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin)

The new query plan finishes in just under 16 minutes. Row distribution per thread on the probe side is much more even:

You might be thinking that 16 minutes is a pretty good result. I too initially thought that.

The Winner

All of the previous approaches spent the vast majority of their time performing a join to generate rows from smaller tables. The scans of the underlying tables themselves were quite fast. Perhaps a better approach would be to avoid the join entirely. For example, consider inserting a billion rows into a table and scanning that table 1000 times with UNION ALL. That would send one trillion rows to a single operator without the expense of a join. Of course, the downside is that SQL Server would need to read one trillion rows total from a table. It would be important to make the reading of rows as efficient as possible. The most efficient method I know is to create a columnstore table with extreme compression.

The title of this section gives it away, but let’s start with a small scale test to see if the approach has merit. Code to load 100 million rows into a CCI:

DROP TABLE IF EXISTS #count_test;

CREATE TABLE #count_test (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #count_test WITH (TABLOCK)
SELECT TOP (100000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #count_test (ID);

SQL Server is able to compress this data quite well because there’s just a single unique value. The table is only 1280 KB in size. The data prep code does exceeds the 30 second time limit on my machine due to the statistics creation, but I’ll ignore that for now because it’s just a trial run. I can avoid all types of early aggregation with the following syntax:

SELECT COUNT_BIG(*)
FROM
(
	SELECT ID
	FROM #count_test
	UNION ALL
	SELECT ID
	FROM #count_test
) q
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER)

Picture of the query plan:

On my machine, that query takes about 41 ms to process 200 million rows. Based on that, it seems reasonable to expect that a query that processes a trillion rows could finish in under four minutes. There is a balancing act to perform here. Putting more rows into the table makes it harder to complete data prep within 30 seconds, but cuts down on the number of UNION ALLs performed which reduces query compilation time.

The approach that I settled on was to load 312.5 million rows into a table. Reading that table 3200 times results in a trillion total rows. 3200 is a convenient number to achieve with nested CTEs and I’m able to load 312.5 million rows quite reliably under 30 seconds even with MAXDOP 4. I can’t say that this is the best possible approach but it seems to work fairly well.

The last detail in terms of data prep is to deal with statistics creation. SQL Server forces a sample rate of 100% for very small tables. Unfortunately, extremely compressed columnstore indexes can fall into the 100% range even if they have a high row count. Creating a statistic object with a sample of 312.5 million rows will take longer than 30 seconds. I don’t need good statistics for query performance so I work around the issue by creating a statistic with NORECOMPUTE after an initial load of 500k rows. Code to create and populate a CCI with 312.5 million rows:

DROP TABLE IF EXISTS #CCI_1M_ONE_COLUMN;

CREATE TABLE #CCI_1M_ONE_COLUMN (ID TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

DECLARE @insert_count TINYINT = 0;
WHILE @insert_count < 8
BEGIN
	INSERT INTO #CCI_1M_ONE_COLUMN
	SELECT TOP (125000) 0
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	OPTION (MAXDOP 1);

	SET @insert_count = @insert_count + 1;
END;

DROP TABLE IF EXISTS #C;

CREATE TABLE #C (I TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

INSERT INTO #C WITH (TABLOCK)
SELECT TOP (500000) 0
FROM #CCI_1M_ONE_COLUMN
OPTION (MAXDOP 1);

CREATE STATISTICS S ON #C (I) WITH NORECOMPUTE, FULLSCAN;

WITH C13 AS (
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
)
, C312 AS
 (
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
)
INSERT INTO #C WITH (TABLOCK)
SELECT 0
FROM C312
OPTION (MAXDOP 4, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);


DBCC TRACEON(10204);

ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

DBCC TRACEOFF(10204);

The above code isn’t as efficient as possible but it gets the job done. It also provides a preview of the syntax that is used to query the table 3200 times. A nested CTE approach is used instead of writing out UNION ALL 3199 times. Trace flag 10204 is used because there’s no need to act on rows outside of the delta rowgroups.

Query compile time can be quite significant for this query. On an older PC while running these queries, I could determine when the query compile ended by a change in fan noise. The new additions to the query hints are there to minimize query compile time as much as possible. This is important when a table is referenced 3200 times in a query. I’m told that there are undocumented trace flags which can reduce query compile time even more but I’ll leave that as an exercise to the reader. Here’s the query text of the UNION ALL approach:

WITH C8 AS (
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I  FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
)
, C160 AS
 (
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
)
, C3200 AS
(
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
)
SELECT COUNT_BIG(*)
FROM C3200
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);

The good news is that generating an actual plan takes about 3 minutes on my machine. The bad news is that uploading a 27 MB query plan is a lot more difficult than I thought. I ended up loading it to my personal website. You can view all of the XML in its glory here.

For a method more friendly to my personal bandwidth, you can look at the screenshot below:

I don’t know of a more faster way to complete the challenge. Without an actual plan, the query takes about 63 seconds to complete. However, the query takes about 3 minutes to complete on SQL Server 2019, even without requesting an actual plan. The difference is caused by lightweight query profiling in SQL Server 2019, which is turned on by default. I find query profiling to be very useful in general for evaluating performance issues in real time and for estimating how long a query will take to complete. However, it can still add significant overhead for certain types of queries. The difference is quite apparent in the call stacks:

This is why I did all of my testing for this blog post on SQL Server 2017, in case you’ve been wondering that.

Final Thoughts

Summary of all attempts:

The results are quite specific to the test scenario: they should not be used to draw conclusions about general join performance. As mentioned earlier, in a few cases the bottleneck is the query profiling infrastructure instead of typical aspects of query performance. If anyone out there knows of a faster technique than those tested here, I would be delighted to learn about it. Thanks for reading!

SQL Server 2019 Disappointment: sp_estimate_data_compression_savings

Missed It The First Time


This is documented, but I know a lot of people are allergic to documentation, whether it’s writing or reading.

In SQL Server 2019:

  • Exciting stuff: In memory tempdb!
  • Exciting stuff: sp_estimate_data_compression_savings can evaluate columnstore compression!
  • Disappointing stuff: If you use in memory tempdb, you can’t have any columnstore anything in tempdb

That means if you’re using sneaky tricks like clustered columnstore indexes on temp tables to induce batch mode, you’re gonna get a lot of errors.

Likewise, you won’t be able to evaluate if columnstore will help your tables.

EXEC sp_estimate_data_compression_savings 
     @schema_name = 'dbo',
     @object_name = 'Badges',
     @index_id = 1,
     @partition_number = 1,
     @data_compression = 'COLUMNSTORE';

Msg 11442, Level 16, State 1, Line 4
Columnstore index creation is not supported in tempdb when memory-optimized metadata mode is enabled.

There’s no workaround for this, either. You can’t tell it to use a different database, this is just the way it’s built.

Hopefully in the future, there will be more cooperation between these two features.

Thanks for reading!

TABLOCK Isn’t Always Better

Utility


There are a lot of good reasons to use tablock on your inserts, as long as you can handle the reduced concurrency of a table level lock.

You may get minimal logging here, here. And you may even get parallel inserts. You may even avoid weird compression side effects.

All very good reasons to vie for the affections of tablock hints!

You should do that.

You should try.

Because Sometimes


You may run into oddball scenarios where the results without tablock are better.

For example, I was experimenting with clustered column store temp tables for another post, and I came across this oddity.

Assume two identical inserts, but one has a tablock hint and one doesn’t.

DROP TABLE IF EXISTS #t_cci;
GO 

CREATE TABLE #t_cci (UserId INT, VoteTypeId INT, CreationDate DATETIME, INDEX c CLUSTERED COLUMNSTORE);

INSERT #t_cci WITH (TABLOCK) -- A tab of locks
    ( UserId, VoteTypeId, CreationDate )
SELECT v.UserId, v.VoteTypeId, MAX(v.CreationDate)
FROM dbo.Votes AS v
WHERE v.UserId IS NOT NULL
GROUP BY v.UserId, v.VoteTypeId;

DROP TABLE IF EXISTS #t_cci;
GO 

CREATE TABLE #t_cci (UserId INT, VoteTypeId INT, CreationDate DATETIME, INDEX c CLUSTERED COLUMNSTORE);

INSERT #t_cci -- A lack of tabs
    ( UserId, VoteTypeId, CreationDate )
SELECT v.UserId, v.VoteTypeId, MAX(v.CreationDate)
FROM dbo.Votes AS v
WHERE v.UserId IS NOT NULL
GROUP BY v.UserId, v.VoteTypeId;

The plans for those queries, in that order, look like this:

Grey Skies

The fully parallel query takes 4.6 seconds, and the serial insert query takes 1.7 seconds.

Even more strange, the insert with tablock leaves four open rowgroups, and the non-tablock query has one compressed rowgroup.

It was not a dream.

Adding to that, using sp_spaceused on the temp tables, we can see some important differences.

neener

The uncompressed temp table is much larger, has more unused space, and has… index size?!

Yep, because open row groups, or delta stores, are stored as b-trees.

the columnstore index might store some data temporarily into a clustered index called a deltastore

I believe this happens with the parallel insert, because each thread inserts ~75k rows, which is fewer than the minimum 102,400 needed to qualify for skipping the delta store and being compressed. A simple way to test this is to run the tablock query at MAXDOP 2.

Hoowee

That ends up being a little faster than the serial insert, and also gets full compression.

Each thread is ~150k rows and gets compressed. Hooray.

Looking at a PerfView diff of the good and bad samples, this path is only present in the slow run:

Step by step

When inserting to the delta store, we need to acquire exclusive latches like just about any b-tree insert, and reserve log space to do so.

Morality Hilarity


To summarize the problem:

  • The TABLOCK hint gets us a fully parallel insert plan
  • There aren’t enough rows on each parallel thread to get row group compression
  • We need to insert into a clustered index behind the scenes that makes up the delta store
  • Which results in threads waiting on exclusive latches and reserving log space

Will you ever hit this? Maybe someday.

This will also happen with real tables (not just temp tables), so I suspect it’s a potentially common scenario for column store users.

You get all worked up getting these parallel inserts and then your compression sucks.

Thanks for reading!

No Column Store Indexes With In-Memory Tempdb? No Problem!

Old Habits


Let’s say you’re on SQL Server 2019. No, seriously. It’s been out for a couple weeks now.

You could be.

I say that you could be because you’re the kind of brave person who tries new things and experiments with their body server.

You may even do crazy things like this.

Stone Cold


CREATE TABLE #t ( id INT, INDEX c CLUSTERED COLUMNSTORE );

SELECT    COUNT_BIG(*) AS records
FROM      dbo.Users AS u
JOIN      dbo.Posts AS p
    ON u.Id = p.OwnerUserId
JOIN      dbo.Comments AS c
    ON u.Id = c.UserId
LEFT JOIN #t AS t ON 1 = 0;

Woah ho ho. What happened there? A #temp table with a clustered column store index on it left joined on 1 = 0?

Yes. People do this.

People do this because it’s getting some batch mode operations “for free”, which have the nasty habit of making big reporting queries run a lot faster.

Yonder Problem


When you enable 2019’s new in memory tempdb, which can really help with stuff tempdb needs help with, you may find yourself hitting errors.

Msg 11442, Level 16, State 1, Line 14
Columnstore index creation is not support in tempdb when memory-optimized metadata mode is enabled.
Msg 1750, Level 16, State 1, Line 14
Could not create constraint or index. See previous errors.

The good news is that this works with *real* tables, too.

CREATE TABLE dbo.t ( id INT, INDEX c CLUSTERED COLUMNSTORE );

SELECT    COUNT_BIG(*) AS records
FROM      dbo.Users AS u
JOIN      dbo.Posts AS p
    ON u.Id = p.OwnerUserId
JOIN      dbo.Comments AS c
    ON u.Id = c.UserId
LEFT JOIN dbo.t AS t ON 1 = 0;

And you can get plans with all sorts of Batchy goodness in them.

Long way from home

Yeah, you’re gonna have to change some code, but don’t worry.

You’re the kind of person who enjoys that.

Right?

Thanks for reading!

Stack Overflow Demo Database Column Store Edition: Helpful Links

Sniff


If you’re starting from scratch with column store, here are some links that’ll help you get a better understanding of how they work, and what they’re good for, start here.

For some background information on column store indexes, see these:

For a detailed look at column store from Niko Neugebauer:

For scripts to help you analyze column store indexes:

Joe Obbish’s posts here:

Thanks for reading!

Stack Overflow Demo Database Column Store Edition: Exploring Relationships

Special Relativity


To get you started exploring the Stack Overflow column store database, here are some queries that show how tables are related.

The two main relationships are User Id, and Post Id.

User Id

/*User Id*/
SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Badges AS b
    ON b.UserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Comments AS c
    ON c.UserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON p.OwnerUserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Votes AS v
    ON v.UserId = u.Id;

Post Id

/*Post Id*/
SELECT COUNT_BIG(*) AS records
FROM dbo.Posts AS p
JOIN dbo.Comments AS c
    ON c.PostId = p.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
    ON v.PostId = p.Id;

Note-ry


A couple things to note, here:

  • Joining Users to Votes is unreliable, because most of the voting is anonymized in the dump
  • Things get much more interesting when you start building queries within relationships

For example, using the User Id columns in tables that aren’t users to join larger tables together, or joining Comments to Votes on Post Id.

You can really start to feel your CPU fans.

Thanks for reading!