All Eyes on the Wrong Problem: How Mitigations Distract from Real Performance Pain

By Kendra Little on April 6, 2025

The biggest lesson I’ve learned from helping folks manage data in Azure is this: if you’ve got a truly terrible problem you’d rather people didn’t notice, a great way to hide it is by educating your support staff and users about something bad but not AS terrible— something with a small mitigation—and constantly refocusing them on that.

The user base— and even your own support staff— will think that anyone who talks about the bigger issue just doesn’t understand how to fix the “known” problem.

This is the story of Azure General Purpose storage for Azure SQL Managed Instance and Azure SQL Database.

Flaw 1: consistent lousy storage performance

Here’s the “really bad but you can mitigate it a little” thing about General Purpose storage in Azure SQL:

In the General Purpose service tier, every database file gets dedicated IOPS and throughput that depend on the file size. Larger files get more IOPS and throughput… There’s also an instance-level limit on the max log write throughput (see the previous table for values, for example 22 MiB/s), so you might not be able to reach the max file throughout on the log file because you’re hitting the instance throughput limit. Microsoft Docs

There’s a table in the document above that describes the IOPs / throughput levels for files, and why you need to grow them to 130 GB, 514 GB, 1026 GB, or more.

The larger your files, the more you pay, so this is not only inconvenient, but also expensive.

If you contact Microsoft support with any question about storage performance, they will fixate on this weird, complex limitation.

But this is far from the worst storage problem in Azure SQL General Purpose.

Flaw 2: intermittent atrocious storage performance

There’s a footnote on the resource limits above, which reads:

1 This [Storage IO latency measurement] is an average range. Although the vast majority of IO request durations will fall under the top of the range, outliers which exceed the range are possible.

It’s unclear what they mean by “vast majority”, but let’s be generous and say it’s 95% of requests. Now imagine that 5% of your database storage requests take more than 15 seconds to complete. In fact, I’ve observed this to be between 15 and 60 seconds. 60 seconds. Serially.

Stop and wait for 60 seconds. That’s a real long time.

Now think about what happens when a storage request underneath a database waits that long. The query may be holding locks that block things. All sorts of activity may back up. Things get real, real bad.

Even if it’s much smaller than 5%– in fact I think on average this probably happens to a given database about 10 times a week – you don’t want to run a database on storage like that. Not even an unimportant database. Particularly for what you pay for Azure SQL Database or Managed Instance.

This is beyond alarming, but every time a Microsoft employee reads my post on how Azure SQL Managed Instance Storage is Regularly as Slow as 60 Seconds, they send me an email, direct message, or leave me a comment about how I’ve missed the “bad” issue and that Azure SQL storage performance varies by file sizes. They don’t ask if it’s related, either, they’re confident I’m ignorant of the issue and there’s a way to make this better. This happens so often that I’ve had to update the post to add an explicit disclaimer for Microsoft employees.

In fact, the storage stall problem isn’t mitigated at all by growing data or log files. You can have the largest file allowed by General Purpose storage and you’ll still see IO freezes of 15-60 seconds happen regularly on them over time, just like the smallest files. You will see it when you’re so far under the IOPs and throughput thresholds that it’s laughable. You will see it in a can, you will see it in a ham; You will see it while you troubleshoot with every trace and plan; File size won’t change a thing—it’s not the magic key; You’ll still get hit with storage freezes unpredictably.

But the fact that people know about the “weird file size problem” blocks them from understanding this. Folks regularly argue with me when I let them know that I’ve received confirmation from the Managed Instance product group that increasing file size will not mitigate or stop the storage stalls. They are sure that if you grow the files it will make the issue better.

When the workaround becomes misdirection

The workaround– “grow the files! make ‘em huge! pay more for maybe better IOPS!” has become the performance placebo of Azure SQL support. It gives people something to recommend. It creates the illusion of understanding.

But the workaround creates tunnel vision. “THIS is the performance problem. We’ve solved it. Move along.”

Meanwhile, the nastier problem remains massively disruptive to users, and Microsoft support won’t even acknowledge it.

Cognitive Biases

This all lines up with common cognitive biases:

The Streetlight Effect: people keep poking at the file size thing because it’s the only issue illuminated by documentation and support scripts.
Maslow’s Hammer: once the workaround exists, suddenly every performance problem starts to look like it must be file-size related.
Normalcy Bias: hey, 60-second storage stalls sound too extreme to be real, right?

And when you show someone new evidence, Semmelweis Reflex kicks in—we reject new information out of habit, because it contradicts what we’ve been taught.

In a complex world, it’s easy to fall into a feedback loop that feels logical in the moment, but keeps us stuck. Our brains prefer a known explanation to a messier reality.

What to do about it?

First off, if you’re using Azure SQL General Purpose and you’re experiencing weird lockups, long-running queries that don’t make sense, or sessions waiting on I/O for disturbingly long periods—you’re not imagining things. You can try growing your files, but it’s not going to fix the whole issue. And sorry, but Azure Support isn’t going to understand the issue unless you escalate about six levels, and then you’re going to be told to just wait for GPV2.

More generally, we need to keep our minds open to new information.

Ask the awkward questions. Use data to measure performance yourself, and break it down critically. When you find something that doesn’t line up with the documentation or the community folklore, go ahead and speak up.

Our most valuable technical trait isn’t knowing database internals or mastering syntax. It’s being persistently curious enough to keep asking questions when the data doesn’t all line up.

Opinions expressed on this site are solely those of Kendra Little of Catalyze SQL, LLC. Content policy: Short excerpts of blog posts (3 sentences) may be republished, but longer excerpts and artwork cannot be shared without explicit permission.

Copyright (c) 2024, Catalyze SQL, LLC; all rights reserved. Opinions expressed on this site are solely those of Kendra Little of Catalyze SQL, LLC. Content policy: Short excerpts of blog posts (3 sentences) may be republished, but longer excerpts and artwork cannot be shared without explicit permission.

Flaw 1: consistent lousy storage performance

Flaw 2: intermittent atrocious storage performance

When the workaround becomes misdirection

Cognitive Biases

What to do about it?

Search

Posts or Comics for Your Inbox?