Our demo database has the same data in two tables
One is partitioned, the other is not.
Want to run the demos?
Instructions on how to get the sample database and run the scripts are in the next lesson. Onward!
Transcript
Let me give you a quick introduction to the data that we will be using in this course.
Meet the sample data
We’ll be using sample data that contains baby names. This data comes free from the Social Security Administration in the United States, and it’s aggregate data that we get from them. Every year they put out a list that says the list of names and then the number of babies total that got each name. They report for names that had five or more babies born with that name.
I’ve faked out some data
Now, that data is pretty compact for each year and we want a fair amount of data to test with partitioning. To create some larger tables to have fun with, I transformed that data into some tables that have one row per baby born.
Almost as if every time a baby was born in a hospital, you know, it transmitted the information to a centralized service that inserted a row into the database. As far as I know, that doesn’t really happen. Where you see the word ‘fake’ on a column, that is data that I have made up. Just to be clear, when you see the word ‘fake’, it’s really fake.
We have two tables that we’ll be comparing
The table in the dbo schema is not partitioned. The table in the pt schema is partitioned into years.
They have the same columns and the same data in them.
They each have about 159 million rows.
Their clustered indexes are 3.7 GB. They have the same definition of unique clustered index, too.
I wanted to try to keep things very, very similar between the tables, because I’m not trying to cheat on my performance here, right? I really want to compare the tables in a very similar state, except we have partitioned the one on the right.
The partitions are not lumpy. They’re pretty evenly sized. Each year has between 2.7 million and 3.3 million rows and they’re not very big.
This is a pretty tiny table.
Here is a look at the partition function
The partitioning column is FakeBirthDateStamp, which is DATETIME2(0) .
When I created the partition function I gave it a list of values, each of them one year apart.
Each of the tables has a nonclustered index on the BirthYear column. BirthYear is a computed column and sometimes we’re going to query against the BirthYear for a table.
We have a columnstore index just on the new (partitioned) table
We decided that when we’re going to test out partitioning, we’re also going to test out columnstore nonclustered indexes for fun, because why not try out all the things?
I will show you how against the new table we can say, “Hey we don’t want to test the nonclustered columnstore index.” It’s really easy when you’re testing queries to throw a hint in and say, “Okay, I don’t want to use that now, I want to compare how you do without the columnstore, and THEN how you do with the columnstore.”
The nonclustered columnstore index does not contain the BirthYear column though, because it’s not allowed. In my current version of SQL Server, you cannot put computed columns in your nonclustered columnstore index. So I just couldn’t put it in there.
It is a pretty dinky little table, like I said. Now, it does meet the basic standard for partitions in a columnstore index. You want to have at least a million row per partition, so I’m not SO small that I don’t qualify for the columnstore, but I wanted to keep the database in a relatively reasonable size for you to be able to download and play with, without it taking hours for you to download.
I tried to keep the tables really narrow and relatively lean, but still big enough that we can test queries who are faster against the nonpartitioned table, and get slower when we run the same query against the partition table.
Let’s dig in and look at some of those problem queries.