Stack Overflow Demo Database Column Store Edition: Helpful Links

Sniff


If you’re starting from scratch with column store, here are some links that’ll help you get a better understanding of how they work, and what they’re good for, start here.

For some background information on column store indexes, see these:

For a detailed look at column store from Niko Neugebauer:

For scripts to help you analyze column store indexes:

Joe Obbish’s posts here:

Thanks for reading!

Stack Overflow Demo Database Column Store Edition: Exploring Relationships

Special Relativity


To get you started exploring the Stack Overflow column store database, here are some queries that show how tables are related.

The two main relationships are User Id, and Post Id.

User Id

/*User Id*/
SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Badges AS b
    ON b.UserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Comments AS c
    ON c.UserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON p.OwnerUserId = u.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Votes AS v
    ON v.UserId = u.Id;

Post Id

/*Post Id*/
SELECT COUNT_BIG(*) AS records
FROM dbo.Posts AS p
JOIN dbo.Comments AS c
    ON c.PostId = p.Id;

SELECT COUNT_BIG(*) AS records
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
    ON v.PostId = p.Id;

Note-ry


A couple things to note, here:

  • Joining Users to Votes is unreliable, because most of the voting is anonymized in the dump
  • Things get much more interesting when you start building queries within relationships

For example, using the User Id columns in tables that aren’t users to join larger tables together, or joining Comments to Votes on Post Id.

You can really start to feel your CPU fans.

Thanks for reading!

Stack Overflow Demo Database Column Store Edition: Downloads And Links

Gett Itt


If you want to download the database, here’s the magnet link for the torrent. I don’t have another means of distributing this; it’s too big of a file.

If you want the GitHub scripts to create and load data, head over here.

Info


If you’re downloading the database, it’s about a 80 GB backup file, that needs to be restored to SQL Server 2017 or higher. It expands to a database that’s about 160 GB. It’s not the biggest database in the world, but it’s a good starting place to learn about column store. You can always make it bigger, if you want.

Current table row counts:

  • Badges: 811,264,500
  • Comments: 1,885,946,200
  • Posts: 1,147,995,425
  • Users: 273,307,375
  • Votes: 4,439,806,603

For more information about the Schema, head over here.

Downloading vs. Building


If you’re comfortable with a database of that size on your computer (compared to the hardware), then downloading is fine. The computers I use it on have 64-128 GB of RAM.

Some people may want to build their own and find a size that better fits their hardware, which is where the create and build scripts make more sense. I wouldn’t wanna see you trying to query tables of this size on a laptop with a VM admin amount of RAM (say 16GB or less).

Scripts To Help You Explore Column Store


Great scripts to help you look at what SQL Server’s DMVs have to say about column store indexes live here: Columnstore Indexes Scripts Library

And of course, poke around this site for Joe Obbish’s posts about column store, along with Niko’s site for his material.

Thanks for reading!

Stack Overflow Demo Database Column Store Edition: Making It Bigger

Stack Over Sharing


Rather than an F5 script to do this, I decided to make a stored procedure out of it. It’s in good shape, but if you find any problems or have any requests, head over to my GitHub repo to file an issue.

One thing I’d love feedback on is advancing dates. Right now, the script doesn’t do that at all. I thought of different ways to handle it, but didn’t like any of them, mostly because of how it might mess with the partitioning function I’m using. I felt like I was overthinking it quite a bit, and decided to leave dates as-is, and only increment User and Post Ids.

A quick note: This script assumes that a database called StackOverflow will be the source of the loads. If you need to use a different version, that’s a manual change. I didn’t want to go down the dynamic SQL route here until I gauged popularity.

Options


The stored procedure has relatively few options.

@loops INT = 1
@truncate_tables BIT = 1
@rebuild_when_done BIT = 1
@count_when_done BIT = 1

How many loops you want to run, if you want to start fresh by truncating tables first, if you want to rebuild indexes after you’re done loading, and if you want to get a count from each table at the end.

There are good enough reasons to include these for you to decide on. For instance, you might want to:

  • Start fresh, and see what the DMVs say about column store compression without rebuilding
  • Load on top and then see what they say without rebuilding
  • Do the opposite
  • Make additional bullet points

Iter and Iterate Walked Into A Store


Without being too explain-y, the way the script works is to:

  • Figure out if we need to increment Ids
  • Go through each of the main tables (Badges, Comments, Posts, Users, Votes) and insert the contents into the CCI version
  • Do this for as many loops as you specify

At the end of the script, do some cleanup of things that shouldn’t exist. Then rebuild and get counts if you asked for it.

Examples


This doesn’t require too much, but here you go:

EXEC dbo.make_big_stack @loops = 25, @truncate_tables = 1, @count_when_done = 1, @rebuild_when_done = 1;

EXEC dbo.make_big_stack @loops = 25, @truncate_tables = 0, @count_when_done = 1, @rebuild_when_done = 1;

In tomorrow’s post, I’ll give you download links and some more details about the database.

Thanks for reading!

Stack Overflow Demo Database Column Store Edition Introduction

In The Way Back Machine


I really wanted a version of the Stack Overflow data dump that was all clustered column store. In SQL Server 2016, that didn’t really work because of restrictions around MAX data types. In 2017 it did, but… If your data warehouse has a bunch of max data type columns, that’s bad and you should feel bad.

The problem here is that once you drop out the “big” columns (AboutMe from Users, Text from Comments, Body from Posts), the entire ~300GB database compressed down to about 6GB. That means if we want a realistically sized data warehouse, we’d need a “Make Big” script, like people used to use for Adventure Works before it went out of business.

This week I’m going to talk about that process, and share links to download the examples so you can mess with them, or create your own.

Today, I’m going to talk about some of the design considerations, and what the initial setup script does, if you want to build your own version.

Hand In Hand


“Large” column store tables should be partitioned. While no one in their right mind would consider partitioning to be a performance feature with row store indexes, it can be beneficial to column store indexes. One of the first things I create after the standard database script-out is a partition function and scheme.

    CREATE PARTITION FUNCTION pfunc (DATETIME)
    AS RANGE RIGHT FOR VALUES
    (   
	    N'2007-01-01T00:00:00.000',
        N'2008-01-01T00:00:00.000',
        N'2009-01-01T00:00:00.000',
        N'2010-01-01T00:00:00.000',
        N'2011-01-01T00:00:00.000',
        N'2012-01-01T00:00:00.000',
        N'2013-01-01T00:00:00.000',
        N'2014-01-01T00:00:00.000',
        N'2015-01-01T00:00:00.000',
        N'2016-01-01T00:00:00.000',
        N'2017-01-01T00:00:00.000',
        N'2018-01-01T00:00:00.000',
        N'2019-01-01T00:00:00.000',
        N'2020-01-01T00:00:00.000'

The years here go up to 2020. This covers you in case your source database is either the full size version, or the 2010 or 2013 version.

The scheme I create puts everything on the primary filegroup. Since the bones of this database is a backup/restore, it has four files, but they’re all in the primary filegroup. You’re welcome to change that, but I don’t find it necessary.

CREATE PARTITION SCHEME pscheme AS PARTITION pfunc ALL TO ([PRIMARY]);

Also I’m a bit lazy.

Swimmin’ In Synonyms With Their Own Condominiums


I did something kind of goofy at first. When I was experimenting with doing this, everything was in one database. So uh, I suffixed all the column store tables with “_cs”.

That turned out to be really annoying when running different demo scripts against this database, because I’d have to change all the names. To get around that, I created synonyms, but that felt hacky too.

For instance, any time I needed to write a DMV query that referenced a table, I’d screw up and reference the synonym, which doesn’t quite work as well as you’d hope. By that I mean not at all.

In the final version, all object names match those in other versions of the Stack Overflow database.

Tomorrow


In tomorrow’s post, I’ll show you parts of the script that use a StackOverflow database of your choice as a source to build up the column store version.

Thanks for reading!