on January 19, 2017
Over the years, I’ve come across a pattern fairly frequently: an application in an OLTP database periodically creates new tables, loads some data into them and fixes it up, then does a switcheroo and replaces old tables with the new tables.
This can cause major problems with blocking if anyone else is querying the table.
The problem looks like this
- Innocent query / queries are selecting from the current production table
- Switcheroo process comes along and tries to do an sp_rename
- Switcheroo process is blocked and is waiting on a schema modification lock
- Switcheroo process starts blocking EVVERRRYBODYYYY else who wants to query the table
Sometimes this clears quickly, but it varies depending on how long the queries take and how complex the Switcheroo process is. On a busy system, a big blocking backlog can even potentially cause THREADPOOL waits, which means the SQL Server generally feels like it’s not working for anyone.
This is a tough problem, because you can’t get around it with isolation levels (even read uncommitted/nolock queries require a shared schema lock, which blocks a schema modification lock). You also can’t get around it with optimistic locking.
In the past, I wrote that if you have to do this switcheroo, sp_rename is better than ALTER SCHEMA TRANSFER, but it still has a bunch of problems.
Today, a little light bulb went on. There is a better way to do this than to use sp_rename if you have existing code using this pattern!
Note: If you can avoid the ‘switcharoo’ pattern altogether and simply create and manage multiple versions of your tables, and have your application use the latest version, that is best because it avoids the locking problem entirely. The trick in this post is for existing codebases where sp_rename or ALTER SCHEMA TRANSFER is already in use, and you need something to mitigate blocking problems in the short term.
Here’s a 12 minute video talking through the optional pattern. You can also scroll down below the video to read a written version of the solution. If you enjoy the video, you might like to subscribe to the podcast. I would also love a review on iTunes!
This pattern works in SQL Server 2014 and higher. And it even works in Standard Edition of 2014.
Some folks will see the word ‘Switch’ in this pattern and assume the pattern that I’m suggesting is Enterprise Edition only for versions before SQL Server 2016 SP1.
However, oddly enough, you can use partition switching even in Standard Edition, as long as the tables only have one partition.
And all rowstore tables have at least one partition! That happens automagically when you create a table.
Here’s the pattern: Use SWITCH PARTITION with WAIT_AT_LOW_PRIORITY
If you’d like to play around with this in full, I’ve got a big old code sample in a gist for you to use on your test instance. But here’s the part that contains the magic:
BEGIN TRAN
ALTER TABLE dbo.ProductionTable SWITCH PARTITION 1 TO dbo.ProductionTableOld PARTITION 1
WITH ( WAIT_AT_LOW_PRIORITY ( MAX_DURATION = 1 MINUTES, ABORT_AFTER_WAIT = BLOCKERS ));
--Anyone who tries to query the table after the switch has happened and before
--the transaction commits will be blocked: we've got a schema mod lock on the table
ALTER TABLE dbo.StagingTable SWITCH PARTITION 1 TO dbo.ProductionTable PARTITION 1;
COMMIT
Some notes:
- For Standard Edition below SQL Server 2016 SP1, the syntax is a little different: ALTER TABLE dbo.ProductionTable SWITCH TO dbo.ProductionTableOld; (Thanks to Ben for pointing this out in the comments!)
- The reason this solution helps is that WAIT_AT_LOW_PRIORITY won’t cause the big blocking chain behind it if this gets blocked. It will sit by the side MUCH more gracefully, even if it can’t get a schema modification lock. (Here’s a Microsoft post that goes into a lot of detail about wait at low priority - if you want the nitty gritty.)
- This also helps because you get to pick how long it waits, and what it does after that time is up. In this sample I say to kill off the blockers. That’s not going to be a good choice all the time: maybe you’d like it to just sit there waiting at low priority, or to give up itself. You get the choice.
- You probably do want to use an explicit transaction with this, unless it’s OK that someone queries the table while it’s empty, between the switching. And if that’s the case, you’d want the WAIT_AT_LOW_PRIORITY options on the second switch.
- This only works when the tables are in the same filegroup. Otherwise it wouldn’t be a metadata only change.
- You have to create matching indexes on your staging table and production table to make the switching work. (But you were doing that work before you renamed the new table in, anyway.) I kept the sample gist super simple, but you can add indexes to that if you want to see it in action.
What if I want to truncate the data, instead of switching it out?
The TRUNCATE TABLE command doesn’t have the WAIT_AT_LOW_PRIORITY option. Using it would put you right back in your big old blocking chain problem. A DELETE statement will be logged and also has blocking problems.
Instead, you can use the SWITCH pattern above, and then just immediately truncate dbo.ProductionTableOld. As long as that table isn’t being read by anyone, you don’t have a blocking problem truncating it.
Anyone think this is a terrible idea?
There could well be something I’m missing about this solution. Happy to hear about it in the comments if you see a problem!