By Kendra Little on September 23, 2024
Asking Microsoft for support for SQL Server or Azure SQL is a lousy experience these days. This is true whether you are using a cheaper service tier or the more expensive support tier formerly known as “Premiere Support.” Either way, I’ve found the same issues: as the person requesting support, I must know a whole lot about the root cause of my problem and how to solve it, or my request will be dismissed with misinformation. I need to have data and metrics that back up my claims in order to get the ticket escalated to someone who can help, and I will need to provide those receipts three or four times. Once something is escalated to the Product Group, I may get a helpful response, but it will generally take a while. If I’m not engaged directly with the Product Group and the answer is being relayed through a lower support tier, it often won’t make much sense.
These issues don’t happen due to bad work ethics or personal failings of support workers. These are good humans, who are trying their best! The problem is worse, because it’s systemic.
As an outside observer, support staff don’t seem equipped to learn deeply about the products they support. There is clearly a culture of guessing at answers, which indicates they lack access to subject matter experts to help get right answers when they aren’t sure. They also seem to have an incredibly hard time accessing telemetry for hosted services with any level of detail. I wonder if they can see or access many messages uploaded to the support ticket or even previous emails I’ve sent– I suspect this is all being routed through a ticketing system that’s terrible to use. Basically, Microsoft support for SQL Server products is persistently bad, even when support team members are clearly trying to deliver good service– which is most of the time. To me, this indicates that the failure starts a long way above them.
But still, you will have occasions when you need help from Microsoft for SQL Server products to report bugs with the “run it yourself” product or outages with Azure SQL Database or Azure SQL Managed Instance. The support experience is repetitive and time consuming, and it’s frustrating as a customer. Here are my tips for making it through the Microsoft Support Machine as efficiently as you can while minimizing your own frustration.
Start the ticket promptly, but immediately start researching on your own. You’re going to have to be the expert here.
A lot of folks assume that Microsoft Support employees will understand quite a bit about SQL Server, Azure SQL Database, or Azure SQL Managed Instance. This isn’t the case, my consistent experience is that they know just enough to be dangerously wrong.
With the cheaper production tier of Microsoft Support, you will need to persistently restate the name of the exact product you are using, and you will frequently be given responses that only apply to a different flavor of SQL Server offering – and which aren’t even possible with the product you are using. You’ll be given queries that don’t run successfully, answers that seem to come from another database, or universe.
With the more expensive tier of production support, they do consistently understand what product you are using. (If you’ve used the cheaper tier, this will feel like a relief: it is quite depressing to repeatedly explain to Microsoft Support that one of their products is not the same thing as another product.) However, knowledge about how the product works is very limited. It’s a lot like interacting with Chat GPT: you’ll get back blocks of somewhat complex text that typically don’t hold up much to scrutiny.
For instance, imagine that you report seeing messages like this in the error log of an Azure SQL Managed Instance:
SQL Server has encountered [#] occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [blah blah blah] in database [blah].
You might be told that this is how Managed Instance throttles storage under the General Purpose tier, with links to documentation about IO limits, and that to resolve this you need to pay more for a more expensive tier. This could cost you quite a lot of money to solve, unless you understand that:
- You can see and prove how much I/O was being used at the time from the sys.server_resource_stats dynamic management view, and that it’s well below the limits.
- You are familiar with the error and know that it means what it says, 15 plus seconds of storage not responding, and that the error exists because this is effectively a failure of the storage subsystem to the database storage engine.
- You already know for sure that this is not how transaction log throttling works on the service, and prior conversations with folks in the Product Group have led you to believe that this is not how other I/O is throttled, either.
(Pausing for a second– can you imagine a hosted database service that throttles I/O by making the storage completely stop responding for 15 seconds or more if you touch the max threshold for IOPS or throughput? The storage file would, by design, just not respond at all to the database storage engine for 15+ seconds for hundreds of requests? Would there be a worse way to design it? Especially when you’ve already built a feature into the product called “Resource Governor” that has features to govern IOPS requests?)
You might also be told that RECOMPILE hints are causing PAGELATCH waits, that you need to be the only active user in a database in order to shrink a file, and that if an ordinary query you are running triggers stack dumps that you should just stop running that query.
This issue with Microsoft Support is not new, but quality has degraded significantly in recent years. This degradation is in part due to complexity of various Azure SQL offerings where the products have very similar names, have been re-branded multiple times, and share blended documentation pages. It’s also incredibly hard to learn in depth about how these products work without actually being a user of the products. I’ve never seen evidence that support staff have the opportunity to gain real world experience. While I know that Microsoft used to offer week long training courses on SQL Server Internals to employees taught by industry experts, I haven’t heard of this happening for many, many years.
You may need a consultant.
If you don’t have someone on staff who understands SQL Server deeply, it is worth hiring a consultant to help you manage the support process when you are having problems.
I realize this feels like it shouldn’t be necessary. I agree, Microsoft Support should be better. But I have very little hope that the experience will improve soon– Generative AI is going to have the exact same problems. Early versions of Gen AI Advice in the Azure Portal certainly do– it’s just as quick to recommend solutions for the wrong product as the cheapest service tier does. If anything, Generative AI is going to bring things down to the level of the cheaper, and worse, support tier.
Working with someone who knows SQL Server deeply to help with your problems may seem expensive at first, but so is SQL Server licensing and/or cloud costs. A good expert will help you call bullshit (politely) on bad advice from Microsoft Support. That support advice is often to solve your problem by throwing more, more, more dollars into licenses or the Azure cloud to try to spend the problem away. And that solution doesn’t always work, either.
Again, I don’t think there is any nefarious intent from support staff in this advice– they are almost certainly doing their best guessing how to solve your problem with what documentation they have, and that documentation generally puts an overly optimistic aura around the more expensive tiers of the product. But that advice is often not in your best financial interest as a customer, and it often won’t even work. There is, at an organizational level, a bit of a conflict of interest when your cloud provider also creates the software that you use: organizational incentives are generally around their own profit margins, not around helping customers find the most cost effective solution.
Keep all of your research in a document so that you can easily copy and paste it back into the ticket. You’re going to need to provide data repeatedly.
If you are the person working the ticket, don’t fall into the trap of only entering data into the email thread. The thread is going to be long, and people are going to miss the information you have provided.
I keep a copy of all my research and data in a document so that I can easily and quickly restate it without having to go back through the thread. I just imagine that support staff can’t see anything I provided previously and have to work every day as if they’re an actor in Memento or Groundhog Day – I can’t change the system I’m working with, but I can make it easier for myself to manage this way.
This has a secondary benefit of also making it easy to share back to the team the nature of the problem, the data we have gathered, the current status of the ticket, what workarounds we have evaluated, and what we need from the support request: it’s in my document.
Save off blurbs that block common brush-offs, and use them to save time.
Support folks worldwide work off scripts. You’ll quickly notice patterns with Microsoft Support: if you report an unexpected downtime in Azure SQL Database or Azure SQL Managed Instance, you’ll absolutely get back a set of long paragraphs about how your application needs to use retry logic. This will be suggested as a “solution”.
If, like almost every application, you already have retry logic, go ahead and save off phrases that you can use to either head these suggestions off proactively when you file the ticket, or to redirect the conversation back to the real problem. It’s not worth your time to custom write these each time: you get to work off scripts, too.
Frequently ask for escalation to the Product Group.
If you aren’t getting a helpful answer right away, ask early and ask often for the ticket to be escalated. Depending on the support contract, there may be several levels of escalation needed. Folks inside the Product Group DO tend to know the product very well, and they tend to be well equipped to understand and diagnose problems. They can make things happen to fix real bugs! They can also give you straight up advice and insights that are very useful.
You will find that sometimes you can’t seem to find the right people in the Product Group. For example, at this point I think probably nobody is home when it comes to the Change Tracking feature, and there isn’t anyone at Microsoft who understands why it seems to have serious intermittent problems when used with a SQL Server in a synchronous Availability Group relationship at high load for many users.
But, most of the time, if I can get an issue escalated with the right supporting data, I can get some helpful information back. It just takes a while, and I need to do a lot of research and testing on my own, as well as come up with my own workarounds and optimizations.
Someone’s going to call at the end to put a human face on this all
Microsoft will call you at the end of the process when you’ve reached some sort of acceptance about your issue. Microsoft will call for this feedback, even if you’ve specified that you prefer to be contacted by email. I’m guessing metrics show that if people answer the phone then their frustration with this whole system is somewhat lessened by talking to a human. In any case, they will ask you how your support experience was.
Like all the support people involved, this will be a good human, who is trying their best. But this whole system shows little signs of positive change.