Opinions expressed on this site are solely those of Kendra Little of Catalyze SQL, LLC. Content policy: Short excerpts of blog posts (3 sentences) may be republished, but longer excerpts and artwork cannot be shared without explicit permission.
By Kendra Little on December 18, 2024
What are your stories of unbelievably bad performance from cloud vendors? I’ll go first. For years, Azure SQL Managed Instance’s General Purpose Tier has documented “approximate” storage latency as being “5-10 ms." This week they added a footnote: “This is an average range. Although the vast majority of IO request durations will fall under the top of the range, outliers which exceed the range are possible.”
How approximate is that 5-10 milliseconds, you might wonder? If you use Azure SQL Managed Instance these days, you will regularly find messages in your SQL Server Error log indicating that all data and log files have experienced latency of up to 60 seconds. At least, 60 seconds is the maximum I’ve observed personally, looking in the logs of several customers' Managed Instances. Could it be worse? Microsoft hasn’t documented a ceiling. My testing shows that this latency occurs randomly to your workload and is not related to your resource usage: using less IO will not make the errors less likely. You have no way to avoid these storage failures (I don’t see how 15-60 second latency is not a failure), and they can occur anytime.
I’m not a mathematician here, but one minute appears to be 6000x higher than that advertised 10ms latency. Is that what a prospective customer would assume when reading the documentation? And when it comes to a hosted database service, is that an acceptable product feature?
What are reasonable latencies for database storage?
Back in 2013, Paul Randal, who was once a PM on the Storage Engine team at Microsoft, wrote:
Everyone has their idea of what constitutes good or bad I/O latency, and here’s my take:
- Excellent: < 1ms
- Very good: < 5ms
- Good: 5 – 10ms
- Poor: 10 – 20ms
- Bad: 20 – 100ms
- Shockingly bad: 100 – 500ms
- WOW!: > 500ms
Paul is talking about average latencies. So let’s think about what average latency ends up looking like in a 5 minute window when you have a 45 or 60 second IO stall on all of the read/write requests to your data and log files. I don’t want to put words in Paul’s mouth, but I think we’re in “what the hell is wrong with this?” territory by his scale.
General Purpose - more like No Purpose
Microsoft SQL Server, the product, agrees that multiple seconds of stalled IO are an alarming problem and should be considered a serious error for the storage subsystem. Back in 2010, Microsoft’s Bob Dorr shared his excellent SQL Server I/O Presentation. In it, he wrote:
Stalled/Stuck I/O: SQL Server 2000 SP4 added a warning that the I/O was taking too long and appears to be stuck or stalled. When an I/O request is posted (async) the time is kept (sys.dm_io_pending_io_requests) with the tracking information. Lazy writer checks these lists periodically and if any I/O is still pending at the operating system level (FALSE == HasOverlappedIoCompleted) and 15 seconds has elapsed the warning is recorded. Each file will report the number of stalls at most every 5 minutes to avoid flooding the log.
Since a normal I/O request should respond in ~15ms or less 15 seconds is way too long. For example if the I/O request is stalled for 30 seconds and the query timeout is 30 seconds it can cause query timeouts. If the stalled I/O request if for the log it can cause unwanted blocking situations.
If you are seeing these warnings you need to double check the I/O sub-system and use SQLIOSIM.exe to help narrow the problem. It can be anything from the configured HBA queue depth, multi-path failover detection mechanism, virus scanners or other filter drivers.
But these days, dear reader, it could simply be that you are using Microsoft’s Azure SQL offering. You will not be able to take action on any of those things.
Bob Dorr’s warnings are worth reiterating here: stalled/stuck I/O will regularly cause blocking in SQL Server. The longer the duration of the stall, the more blocking can pile up, and this can push the impacts of that storage latency on your workload to far longer than 60 seconds.
How much does General Purpose cost again?
Here are some sample prices for pay-as-you-go in US-East today:
- 24 vCore, Premium Series Memory Optimized - 16 TB storage - $7,547.47/month - $90,569.64/year
- 64 vCore, Premium Series Memory Optimized - 16 TB storage - $16,412.74/month - $196,952.88/year
You can spend a lot on something where the storage is regularly slower than a fax machine.
It’s simple to find these errors. Here’s how.
To find these errors, simply open the SQL Server log and search for “longer than 15 seconds.”
You’ll find messages that look like this:
SQL Server has encountered [#] occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [blah blah blah] in database [blah].
The end of the message will report the current highest latency IO, which I’ve seen be as high as 60 seconds on Azure SQL Managed Instance. I’m curious if you ever see it go higher– please let me know in the comments.
Some tips:
- It’s incredibly slow to open the logs in Azure SQL Managed Instance, so I recommend using Erik Darling’s free sp_loghunter utility to find these programmatically. This also allows you to copy /paste the message info out.
- Azure SQL Managed Instance General Purpose deletes log files after it restarts (ugh), so you may not see the messages shortly after maintenance has occurred. This doesn’t mean the errors haven’t occurred recently, it just means the evidence has been lost. My experience has been that these messages regularly occur, both on read/write instances and failover partners, in different regions.
- As mentioned in Bob Dorr’s quote above, you will only see these errors logged every 5 minutes. They will occur for longer than that! You are only seeing a glimpse of the full scope of this problem.
What about GPV2? Why is General Purpose still GPV1? Here’s the gossip.
These errors occur on the only fully supported version of Managed Instance General Purpose, which Microsoft is now calling GPV1. A new version, GPV2, is in public preview. Microsoft may suggest you use this, but be sure to ask about the fine print. Ask them if there is an outage on GPV2, will Microsoft provide credits for that outage?
This is why everyone is still using GPV1, because it’s fully supported. Preview products are not.
But GPV2 has been in preview for a long time. With these massive problems of latency in GPV1, why hasn’t Microsoft simply made GPV2 fully supported?
I have no secret or non-secret info here: Microsoft hasn’t told me this. And let’s face it, due to a long stream of my complaints and failed attempts to get Microsoft to either fix this storage problem or be transparent about it with their customers, I don’t think Microsoft will be telling me any secrets in the future.
But I can tell you what rumors are going around in the community. These may or may not be true, but they do reflect the reality of what non-Microsoft DBAs and consultants in this space think is the case about why GPV2 hasn’t shipped:
- It will destroy the Business Critical offering. For a good while, rumors were that GPV2 was moving so slowly because having better storage in General Purpose would make the more expensive Business Critical offering even less attractive. Personally, I’ve found that Business Critical Managed Instance already isn’t worth the price tag.
- Internal wheelings and dealings have the Managed Instance product locked in to pay for General Purpose V1 storage for a long time, and they don’t want to take the financial loss of moving off it soon.
Basically, the community of folks who work with these products outside Microsoft think that GPV2 would work well, but internal politics and pricing and packaging issues keep customers stuck with a product with regular issues with abysmally bad storage performance, with the only alterative being to pay through the nose for Business Critical.
These rumors may or may not be accurate, but they reflect the depression of the profressional community around the Azure SQL offering.
Does Azure SQL Database also have regular storage latency of up to a minute?
As far as I’m aware, Azure SQL Database General Purpose uses the same storage model and will have the exact same issues. Probably the Fabric version is the same.
Unfortunately for users, Azure SQL Database does not provide access to the SQL Server Error Log and does not provide ways to access all the errors in the log. It only provides very limited access into security errors.
This probably means that latencies of up to 60 seconds can also occur on General Purpose Azure SQL Databases, but you’ll have a much harder time proving it.
Unreasonable customers
I’ve professionally been a champion of Microsoft SQL Server for nearly 20 years. I’ve been proud of the product as it’s grown and worked to help developers and DBAs around the world get the most out of it. I’ve thought for a long time that it’s worth the money.
I don’t believe that regular storage stalls of 15-60 seconds are acceptable for a hosted database product, though. And I also feel strongly that cloud vendors should be up front with customers about expected latencies with their products if they aren’t going to fix them.
I’ve been trying to be a good partner and let Microsoft document this issue before I share my technical observations here publicly. Yesterday, a friend asked if I’d seen the documentation pull request to update the resource limits page. The PR includes the comment:
Added clarification to the IO latency duration range to make clearer that IO requests exceeding the range are possible. (Although not as clear as I’d like - using “compromise language” based on comment from Microsoft Employee–name redacted intended to balance clarity with keeping us out of trouble with unreasonable customers).
I’m not linking to the PR because I don’t think it matters who wrote this. I also don’t particularly care if I’m the “unreasonable customer” here – if thinking regular storage latency between 15 and 60 seconds is ludicrous and should not be kept secret is unreasonable, yeah, I’m REALLY unreasonable. You betcha.
What I think is notable here is the blatant butt-covering. The focus isn’t on what’s best for the customer. The focus is on protecting the corporation FROM customer expectations. We might demand that products work as they are described.
Y’all, if you’re a customer, check your error logs. Be willing to be called unreasonable for asking for something to live up to SQL Server 2000’s standards. Because customers are going to have to force Microsoft to make their hosted versions of SQL Server lived up to the product’s own expectations from SQL Server 2000. Complain to your Microsoft sales reps and cloud architects, and let them know this matters.
I’ve already gotten hate messages over this
I’ve been a Microsoft employee twice in the past. I complained about this storage issue and PR comment on LinkedIn, and I’ve already gotten a message from a former teammate that “there’s nothing productive” in my criticism and “we have some unresolved feelings about your working here.” Giving them some grace, maybe they couldn’t believe the 60 second things was real and thought I was making things up– although I do tend to be a bit of a stickler for accuracy, so safe to say they don’t know me well.
But I’m not bothered. Y’all, it’s fine if Microsoft never wants to hire me again. It’s OK if some Microsofties don’t like the way I’m trying to make the product better– I’m just sharing basic technical observations and putting them into a customer’s realistic context. I’m a database nerd, and I’m on my customers' side.
This isn’t the first time it’s taken someone being brutally honest to change something at Microsoft – it was only after Brent Ozar wrote SQL Server 2014 Standard Edition Sucks, and It’s All Your Fault that Microsoft raised the ridiculously low memory limit on Standard Edition from 64GB to 128GB. And I recall that some Microsoft employees were super mad that he dared to write that post at the time.
Please, join me in complaining– perhaps Microsoft can work through whatever internal political blockades exist and push GPV2 into full support. Maybe we can make Azure SQL less of a cautionary tale. From my experience in working in software companies in general, I absolutely believe that customer complaints can be a real asset to the folks inside the corporation who do care about customer experience and making things better. These complaints help them justify getting the resources to make changes. That is something that everything not labeled “Fabric” badly needs.
What are your cloud stories? How bad does it get?
Like I said at the beginning: I want to hear your stories. Azure, AWS, Google Cloud, whatever the Oracle Cloud is– do you have a story of cloud performance that’s shockingly slow? I know I’m not alone here, I’d love to hear it.