Normalization – Darling Data

One Who Overflew

In the Stack Overflow database, the biggest (and probably most important) table is Posts.

The Comments table should be truncated hourly. Comments were a mistake.

But the Posts table suffers from a serious design flaw in the public data dump: Questions and Answers are in the same table.

I’ve heard that it’s worse behind the scenes, but I don’t have any additional details on that.

Aspiring Aspirin

This ends up with some weird data distributions. Certain attributes can only ever be “true” for a question or an answer.

For example, only questions can have a non-zero AcceptedAnswerId, or AnswerCount. Some questions might have a ClosedDate, or a FavoriteCount, too. In the same way, only answers can have a ParentId. This ends up with some really weird patterns in the data.

Was it easier at first to design things this way? Probably. But introducing skew like this only makes dealing with parameter sniffing issues worse.

Even though questions and answers are the most common types of Posts, they’re not the only types of Posts. Even if you make people specify a type along with other things they’re searching for, you can end up with some really different query plans.

Designer Drugs

When you’re designing tables, try to keep this sort of stuff in mind. It might not be a big deal for small tables, but once you realize your data is getting big, it might be too late to make the change. It’s not just a matter of changes to the database, but the application, too.

Late stage redesigns often lead to the LET’S JUST REWRITE THE WHOLE APPLICATION FROM THE GROUND UP projects that take years and usually never make it.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Chalky

I see a lot of tables that look something like this:

CREATE TABLE dbo.orders
(
    order_id int NOT NULL PRIMARY KEY
         DEFAULT (NEXT VALUE FOR dbo.order_id),
    order_date datetime NOT NULL,
    order_ship_date datetime NOT NULL,
    order_total money NOT NULL,
    order_tax money NOT NULL,
    customer_id int NOT NULL
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250)
);

Looking at the design, there are two big problems:

There are “order” columns that are going to get a lot of inserts and updates
You’re going to be storing the same customer information over and over again

The more related, but not independent, data you store in the same table, the harder it becomes to effectively index that table.

A while back, I blogged about Tables Within Tables, but uh… surprisingly, the problem still exists! Usually when I blog about something, the problem disappears. Hm.

Better But Not Best

If we follow some practical guidance and put named columns into their own table, we end up with this:

CREATE TABLE dbo.orders
(
    order_id int NOT NULL PRIMARY KEY
         DEFAULT (NEXT VALUE FOR dbo.order_id),
    order_date datetime NOT NULL,
    order_ship_date datetime NOT NULL,
    order_total money NOT NULL,
    order_tax money NOT NULL,
    customer_id int NOT NULL
);

CREATE TABLE dbo.customers
(
    customer_id int NOT NULL PRIMARY KEY
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250)
);

This is a better scenario, because we only store customer information once, and inserts/updates to order information don’t impact people working with customer data.

But this still isn’t great — what if a customer wants to send an order to a different address?

If we wanted to store everything in this table, we’d be breaking other practical rules: we’d have to have multiple rows for users, or we’d have to add columns columns to the table to deal with multiple addresses. That’s a mess both for people who don’t use all those extra columns, and for people who might have half a dozen addresses they send to.

Getting There

A better way to phrase the customer table might be like this:

CREATE TABLE dbo.customers
(
    customer_id int NOT NULL PRIMARY KEY
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    default_fullname nvarchar(250),
    default_street nvarchar(250),
    default_street_2 nvarchar(250),
    default_city nvarchar(250),
    default_state nvarchar(250),
    default_zip nvarchar(250),
    default_country nvarchar(250)
);

Most of the time, people are going to send stuff to one address — call it home if you want. It’s probably also their billing address, so it makes sense for it to be the default, and to have it be the first choice.

Then we’ll have a table of EAV data that looks like this:

CREATE TABLE dbo.customers_address_book
(
    address_id int NOT NULL
        DEFAULT (NEXT VALUE FOR dbo.address_id),
    customer_id int NOT NULL,
    address_type tinyint,
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250),
    CONSTRAINT pk_cab_id PRIMARY KEY (customer_id, address_id)
);

In a table like this, whenever a customer ships to a non-default address it gets stored off here. Now customers can have as many addresses as they want to choose from without us having to have an extra bloated table of default information plus non-default information.

Because of the way this data is modeled, we don’t need to keep adding columns to accommodate multiple addresses. We just tack rows on, and since this data isn’t likely to get updated the insert/select pattern should end up with minimal blocking.

Tomato Sauce

I know, horrifying. You might have to write a join. You poor, downtrodden developer.

Of course, this makes the most sense when you’re dealing with OLTP workloads. And sure, a lot of these columns probably don’t need to be as long as they are, but that’s a totally different post.

When you’re dealing with reporting data, de-normalizing is generally preferred. Though if you’re doing serious reporting and using column store indexes, I’d probably wanna keep the strings out as much as possible, and just key back to them in other tables. Yuck.

Have I ever mentioned that strings in databases were a mistake?

Thanks for reading!

Category: Normalization

How Avoiding Normalization Contributes To Skewed Data In SQL Server Tables

One Who Overflew

Aspiring Aspirin

Designer Drugs

Going Further

Normalizing SQL Server Tables To Reduce Query Blocking

Chalky

Better But Not Best

Getting There

Tomato Sauce

Going Further