A tour of the Popular Names query (7 minutes)

This video doesn’t spoil the solutions! In this video, I outline the challenge, then step through the challenge query. If you’re already comfortable diving into the TSQL challenge script, feel free to skip this video.

Transcript

-Welcome to the SQL Challenge: Speeding up the Popular Names Query. I’m Kendra Little from SQLWorkbooks.com.

The query that we’re tuning today ranks baby names in popularity for a given year

We specify a year, and it gives us the top 10 names for that year, and then it also says, what’s the rank 10 years later, that uses an inner join for that. In addition, it uses outer joins to look at the rank 10 years prior, 20 years prior, and 20 years later if those rows exist. Here is an example of query results when we pass in the year 1991.

You can see that Michael is ranked number one, and he’s held that rank for a while. He was number one 20 years ago, as well 10 years ago, but in 1992, he drops down to rank number two. And 20 years after this, he is down at rank number 10, but he’s still in the top 10 20 years later. If we look at other names, names like Ashley, she may be ranked number three in 1991, but she’s come a long way because, 20 years before, she was ranked 439, but Ashley, her popularity as a name does go down, and she doesn’t stay in the top 10 for too long.

So, this is our example dataset for 1991, and…

Our challenge isn’t to touch indexes, we only want to be changing the TSQL, and we want to make it so that query uses less then 500 logical reads when run for the year 1991, we do want it to return the same result.

As we walk through sample solutions for this, you’re going to notice that there’s different types of tuning we’re going to do here.

We going to for look for quick fixes. Is there a way we can just quickly make this work, even if it has some downsides? And we’ll talk about what those downsides are. We’ll then look for a more scalable longer term solution.

Can we get the query under 500 reads without using those quick fixes? And we’ll step through a real-world query tuning process to get to that more scalable fix.

I’ve already downloaded the sample database for this

This is the exact version of the sample database that you’ll be seeing in this script. We do get new baby names every year, and I am about to update this with names from 2017 that have just been released. But, in this current version of the database in Release 1.2, we only have names through 2016. I’ve already run this code here that restores the database and makes sure that the database compatibility level is set to 140. All of that has already completed successfully.

Before we start tuning, we’re going to just take a quick tour through the popular names query. The way the query is written, it has a local variable at the top called YearToRank, and then we’ve got a common table expression named rankbyyear at the top. The common table expression rankbyyear has a query in it that pulls all the rows from agg.FirstNameByYear, and it performs a rank on them. It looks at the row, and then it says, okay, in the context of this table, I’m going to group or partition the data by ReportYear. And for each ReportYear, I am going to order the data by NameCount Descending.

I’m going to do this for a Rank function, so I am ranking for each report year ordered by names count descending. So the name counts who are the highest, the names that had the most babies, use that name, they will be at the top. So rankbyyear, our common table expression, is referenced in our query multiple times.

Our FROM: we are selecting From rankbyyear, aliasing it as startyear. For startyear, we are returning columns from that, we’re retuning the gender from it, we’re returning the rank from that startyear, and then, also, we have some predicates on it down at the bottom if the query. We are looking at rows where the startyear.ReportYear is equal to YearToRank. We are specifying 1991 as the value for the local variable of YearToRank, and further, we’re saying, we want the names who, in startyear, their rank was in the top 10. Now, we are joining to other versions of that CTE.

We’re saying join a rankbyyear as ten_years_later as well, and for this one who’s 10 years later, startyear.ReportYear plus 10 is what we want that to be. So we want to look at the table and say, okay, we want to go 10 years ahead and look at the rank there. And for 10 years later, that is what we are returning for 10 years later, the rank that year, and then we’re aliasing that column as Rank 10 years later.

Let’s go ahead and give our query run

We haven’t done any tuning out, we’re just looking at the problem query here, and we’re using the set statistics IO statement at the top of this session to say, when you run the query, after you complete this, this is Set Statistics IO. I’ve also said, turn on Statistics Time as well. This tells its SQL server, please print to the Messages tab.

This is the Statistics IO portion in here. It tells us how many logical reads it did. So, we’re trying to get this under 500 logical reads, just against our reads for the first name by your table alone, we’re at 26,000.

We have a ways to go to get this down to 500 logical reads or less. Looking at our data results, there is Michael, he is number one, and here are our results. Our next step is going to be stepping through our query and figuring out how can we get this below 500 logical reads.