Find out why that update statement stuck around for so long
Learn other reasons that may cause this pattern to happen, and which application patterns are inherently problematic for this type of blocking.
Links from this lesson
Articles on doomed/abandoned transactions from Michael J. Swart:
- Case study: Troubleshooting Doomed Transactions - http://michaeljswart.com/2017/01/case-study-troubleshooting-doomed-transactions
- Don’t Abandon Your Transactions http://michaeljswart.com/2015/10/dont-abandon-your-transactions
Error handling articles from Erland Sommarskog:
- Error and Transaction Handling in SQL Server - Part One – Jumpstart Error Handling - http://www.sommarskog.se/error_handling/Part1.html
- A description of XACT_ABORT with some notes on limitations - http://www.sommarskog.se/error-handling-II.html#XACT_ABORT
Transcript
Let’s get into solutions for our repeat blocker.
One thing I want to make sure is clear is:
Not all sleeping sessions are doomed transactions or orphan transactions or any of the other nasty names that I called them
This session was such a problem because it had done bgein tran, ran a bunch of commands, and then gone to sleep without committing or rolling back.
Sessions that are just sleeping, that don’t have open transactions use minimal resources against the SQL Server.
Now it is totally true that if I’m connected to the SQL Server and I’m using it in a database I’m gonna have a really lightweight lock shared just on that database just to make sure nobody blocks the database while I’m around. But that doesn’t consume a lot of resources and it’s not gonna block at the table level in that database.
So if you do run a built-in command like sp_Who2, it’s gonna show you a lot of things that are sleeping, those things aren’t necessarily causing problems.
sp_WhoIsActive will ignore sleepers unless they’re doing things like have an open transaction and that’s why sp_WhoIsActive showed this sleeper to us
One of the things I like about sp_WhoIsActive is it makes it easy to tell if you have a problem sleeper or not because it just want show you sleeping dogs that are okay to let lie in there.
When I tried to solve the problem using NOLOCK it was kind of like I’d cut off my finger and was like, okay maybe if I take a big handful of Aspirin it won’t hurt anymore.
That’s the problem with NOLOCK is it’s likely to cause other problems and it doesn’t at all address the cause of the locking, so don’t just NOLOCK your problems away.
Back when I hit this as an early DBA it was really tough to solve the problem
We kept finding the issue happening and sleeping it, and we did investigate, we didn’t have a script quite as cool as sp_WhoIsActive but we did figure out that it was always the same code that fell asleep holding its locks to block other things. We could put together a job to more quickly kill the session but that just felt really dangerous and bad.
What if we killed off the wrong thing, what if it had done more work in the transaction, what if something changed and we’re auto-creating a long painful rollback that’s causing terrible problems.
We didn’t want to do that and we started talking with the developers who owned that section of the code.
- The devs said, “We didn’t cause this,” “we didn’t change anything recently,” “we didn’t change our code.”
- And we were like, “Every time it happens it’s your code.” Which was true but maybe not the most helpful way to put it.
- And they said, “Look it didn’t do this before” “and we didn’t change anything,” “so it’s something in the SQL Server.”
- But the SQL Server wasn’t making it sleep, so we’re like, “It’s not the SQL Server, it’s you,”
- and they said, “It’s not us, it’s you,”
- and we’re like “The users are really, really mad,” “maybe we should tell the users it’s you,”
- but they’re like, “It’s not us.”
This conversation was not successful at all
Everyone ended up unhappy and essentially we just ended up in this standoff where we’re all like, no it’s you, no it’s you, everyone got very defensive.
Finally, the problem was solved
A developer from another team solved it.
Essentially on our team, I was like, “Maybe we bring in a deadlock manager who can point out who’s really at fault,” and be like, “No, it’s really you.”
A developer on another team was looking at release notes and KBs, and figured out that the .NET framework, which was what this application used, it had a bug in it that could cause orphaned transactions on the SQL Server. Which is exactly what we were seeing. We got a patch for the application servers, the application servers got patched by that team.
Problem did not happen again.
It was in a different place in the stack for us than we had imagined
I have run into this pattern though of the sleeping blocker, many times during my career it keeps coming back and it sometimes had different causes, it’s not always the .NET framework.
There’s some coding design patterns, this is a similar one but it’s not always a sleeper.
Problematic pattern: disabling a trigger in a transaction
Some coding design problems I’ve seen will do a begin transaction, they’ll do something like an alter table, and often seen this one a couple times, they do alter table and they disable the triggers on the table because they want to update a bunch of data and they don’t want the triggers firing while they’re doing it.
Then the whole time they’re doing their work, and it may go to sleep for awhile because sometimes it goes off to do its work in some other part of the application not even related to the SQL Server before it comes back to commit. While the whole time it’s doing before it reenables the triggers and commits it has an exclusive lock on that whole table because disabling triggers is modifying the schema of a table and no one else can touch that table while you have an exclusive lock. Everyone else has to wait.
So this one’s difficult to troubleshoot, because you have to say, “If you gonna disable the triggers and do a bunch of work, you can’t do that at the same time that people are actually using the table.”
This is an application design pattern that we have to find some other way to do it.
Either you have to be able to leave the triggers on or there has to be some other way to let people use the table, because it’s just not a concurrent operation.
Problematic pattern: holding locks and going off to do something else
I’ve also found cases where the application starts a transaction and the SQL Server does begin tran, does some work, and then its complex environment, they go off and talk to another data source, and the one I’m thinking of they were doing something in Mongo DB.
And sometimes they would go do something in Mongo DB and their query into Mongo DB would just disappear, like it would never finish. And so it would never come back tot he SQL Server because it was just lost in the Mongo DB and this would leave blocking in the SQL Server, so the application had to be changed to handle that because SQL Server will faithfully wait.
Problematic pattern: user leaves an open transaction in SSMS
And of course I’ve also seen this happen where someone’s running a script in Management Studio and they accidentally leave a transaction open and they’ve done some stuff and they went to get a coffee, or they went home, and it caused a lot of problem. Whenever folks have access to Management Studio there is a risk of them accidentally taking out locks and leaving open transactions.
Doomed transactions
Other folks in the SQL Server community have fought this battle against doomed transactions as well, and Michael Swart has written some fantastic blog posts on doomed transactions. His posts talk about the importance if you’re using stored procedures then SQL Server of being very careful with your error handling.
The transact abort setting in SQL Server defaults to off. You can turn it on in your stored procedures meaning, hey, if something fails I really want you to roll the whole transaction back. In his case it really helps simplify error handling and avoid some of these sleeping sessions holding locks and causing problems issues. Check out the post on his site, it’s called Don’t Abandon Your Transactions. It’s really, really helpful for this.
But Michael also admits in other posts on his site that transact abort, setting it on doesn’t cure everything in SQL Server that can cause a transaction to go to sleep and not get rolled back. Transact abort, it won’t fix compilation errors, you can still get hangs there.
There’s complications with RAISEERROR
You want to essentially make sure that you don’t want to just knee-jerk turn it on everywhere either because in an existing application it might interfere with some error handling logic you already have in place you wanna keep like, you wanna do logging before you bail out after a problem, so be careful turning it on.
Erland Skommarskog has great resources if you have to troubleshoot this specific doomed transactions and you think it’s due to error handling, he has some fantastic resources on his site on error handling, everything from how you handle transact abort to other things like try / catch, really valuable content that if you need to go down that troubleshooting, error handling road you want Erland’s help with it. (http://www.sommarskog.se/error_handling/Part1.html)
Big picture strategy
Look at the whole stack.
When you find these sleepers who are blocking others you want to know not only has the code in the application Server changed recently but we recently gone to a new version of services on the app Servers.
When the calling code comes into SQL Server, what does it look like? At what point does it run begin_tran and what point does it start to do work?
Is it doing this in stored procedures? If so what is their error handling like that?
Is it possibly related to triggers on tables that it’s using too?
You’ve gotta look everywhere from the application server down through the triggers on the table.
It can be quite time consuming, just don’t my mistake
Back when I did this as a new DBA, I was too fixated on saying, “It’s gotta be your code.”
Instead, ask questions:
Can you help me investigate this change?
Can you help me find the calling code and where it is in the calling app that actually makes this call?
Can you help me sift through the error handling?
Asking those questions more opens up things as your strategy and will get you farther
It will pull people into your investigation to help you rather than leading to a roadblock like I got stuck in.
Pulling back together what we learned in this module. When you’re not sure if blocking is your problem, and sometimes it’s not, we can have performance problems for a lot of reasons.
Wait statistics in SQL Server are super valuable
Sys.dm_os_wait_stats dynamic management view shows you, here’s the historic waits on this instance. Those are the waits that have finished.
If you want to look at waits that are in progress, is there blocking right now, you can look at sys.dm_os_waiting_tasks.
Instead of writing a complex query with a bunch of joins you can write run, the free sp_WhoIsActive procedure from Adam Machanic as a shortcut. And I love this to quickly see, do I have blocking right now?
I’m also a big fan of using performance-based alerts in the SQL Server agent to quickly say, “Hey, there’s blocking going on.”
Now it doesn’t tell you everything about the blocking, it just lets you know, “Hey, blocking happened now.” So we are gonna dive in though to this issue of what happens if I’m not at the keyboard when this alert comes in, and how do I troubleshoot it?
That will be up in our next module all about the sneaky blocker in SQL Server. I’ll see you then.