Opinions expressed on this site are solely those of Kendra Little of Catalyze SQL, LLC. Content policy: Short excerpts of blog posts (3 sentences) may be republished, but longer excerpts and artwork cannot be shared without explicit permission.
on January 21, 2011
Optimizing queries is the most fun when you don’t need to add indexes. There’s nothing quite so nice as finding a way to make reading data faster, without slowing down writes or creating new data structures that need to be maintained.
Here’s one way you can use BACKWARD scans to do this.
The Scenario: Clustered index on an increasing integer, and you’d like recently created rows
This is a common enough situation: you have a table with a clustered index on an integer value which increases with each row. You have another column which records the date the row was created.
You’d like frequently query the most recently created rows over some period of time.
The table has very frequent inserts, so for performance reasons you want to use the minimal indexes required. (And in general, this is the best practice.)
Question: Do you need to add a nonclustered index on the column containing the date the row was created?
Answer: Maybe not!
Getting the right clustered index scan
Say we’re working with the following table, which we have filled with five million rows of Tweetie birds. (Note: This generation technique is a tally table population technique which I found on Stack Overflow, which is attributed to Itzik Ben-Gan.)
CREATE TABLE dbo.Birds (
birdId INT NOT NULL ,
birdName NVARCHAR(256) NOT NULL,
rowCreatedDate DATETIME2(0) NOT NULL )
GO
--Insert 5 million Tweetie birds
--Make them as if they were all created a minute apart.
;WITH
Pass0 as (select 1 as C union all select 1),
Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
Tally as (select row_number() over(order by C) as Number from Pass5)
INSERT dbo.Birds (birdId, birdName, rowCreatedDate)
SELECT Number AS birdId ,
'Tweetie' AS birdName ,
DATEADD(mi, number, '2000-01-01')
FROM Tally
WHERE Number <= 5000000;
GO
--Cluster on BirdId. We won't add any other indexes.
CREATE UNIQUE CLUSTERED INDEX cxBirdsBirdId ON dbo.Birds(BirdId);
GO
Say we would just like to see the maximum value in the rowCreatedDate column.
The most basic way to get this row is with this query:
SELECT MAX(rowCreatedDate)
FROM dbo.Birds;
GO
However, that leads to a table scan. We get lots of reads: 22,975 logical reads and 201 physical reads.
If we know we have a strong association between the BirdId column and the RowCreatedDate column, and that the highest ID in the table is the most recent row, we can rewrite the query like this:
SELECT MAX(rowCreatedDate)
FROM dbo.Birds
WHERE birdId = (SELECT MAX(birdId) FROM dbo.Birds);
GO
This query still does a clustered index scan. But yet it does only 3 logical reads and 2 physical reads.
Looking in the execution plan, our query was able to use the extra information we provided it to scan the index backwards. It stopped when it had everything it needed, which was after a short distance– after all, it only needed recent rows, and those are all at one end of the table.
This backwards scan can be very useful, and can make using the MAX aggregate very useful.
But you usually need more than just the max value…
To see a bit more about how you extend this logic, compare these three queries:
Query A
This makes you think you need that non-clustered index: it does 22,975 logical reads, 305 physical reads, and 22968 read-ahead reads.
--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS;
GO
SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00';
GO
Query B
We can introduce the backwards scan by adding an ORDER BY BIrdId DESC to the query. Now we get 23019 logical reads, 47 physical reads, and 22960 read-ahead reads.
--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS;
GO
SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdid desc;
GO
Query C
The this last query gives the optimizer extra information about using BirdId to do a BACKWARD scan to grab the maximum BirdId, and then use that to do a BACKWARD seek of the clustered index in nested loops to get the data. It does only 50 logical reads, 4 physical reads, and 817 read-ahead reads.
--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS;
GO
SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE birdId >=
(SELECT MAX(birdId)
FROM dbo.Birds
WHERE rowCreatedDate <= '2009-07-01 05:00:00')
AND rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdId DESC;
GO
Be Careful Out There
The examples I’m using work because there is a correlation between the integer field and the date field. Not all tables may be like this. As with all queries, you need to be familiar with your data.
Consider Your Options– Even the Ones You Don’t Think Are Great
I’m quite sure BACKWARD index reads are covered in some talks and publications on tuning. But I learned about this by considering multiple approaches, even those I didn’t think would work at first. It pays to try things out, and you can look a lot by looking carefully at execution plans (including the properties) and your Statistics IO output.
What this means to me: it’s good to keep an open mind.