Opinions expressed on this site are solely those of Kendra Little of Catalyze SQL, LLC. Content policy: Short excerpts of blog posts (3 sentences) may be republished, but longer excerpts and artwork cannot be shared without explicit permission.
By Kendra Little on February 21, 2024
If you use readable secondaries in Availability Groups or Read-Scale out instances in Azure SQL Managed Instance, you may have queries fail repeatedly if there is a glitch and statistics are not successfully “refreshed” on the secondary replica. Those queries may keep failing until you manually intervene.
It’s unclear if Microsoft will ever fix this. There is a well established support deflection article which documents the issue and provides ‘workarounds’.
As a user on StackExchange wrote in September of 2022, “This is an outstanding bug in SQL Server which has been ignored for quite some time, unfortunately.” Based on a link provided in that post, it appears that feedback was provided for this as far back as SQL Server 2012, when AGs were introduced.
More than 12 years later, how viable are readable secondaries / scale-out read servers for production use if it’s acceptable for queries to fail at an undocumented/unknown rate?
What causes statistics not to refresh on a secondary? Can you prevent it?
It’s not really clear to me what causes this error. Microsoft’s support deflection article says:
This issue occurs because an active transaction prevents the cache invalidation log record from accessing and refreshing the statistics on the secondary replica.
When I ran into this issue, there were no long-running transactions on the primary replica or any secondary replicas. Potentially this had happened at some point in the past and the process that refreshes statistics gave up, or was a victim in a deadlock and didn’t retry? All I have is guesses.
The workarounds suck here, folks
If this issue happens to you, the article presents three options for workarounds:
- Dump all of the caches on the secondary replica
I don’t love this step as a fix for a production database server. It’s fine on some workloads, but there are plenty where it causes performance issues. The article also indicates that this is not as “permanent” an option as option 3, too.
- Fail the node over
OK, so this option is to take a brief outage on the server? Really? How is this option 2? Is this a production database or an Easy Bake Oven?
- Manually delete the statistic on the primary and recreate it
I guess I choose this as the “least worst” option. For undisclosed reasons, this is labeled a “more permanent workaround.”
Do I want to automate proactively finding this issue if it happens and repairing it? No.
Does Microsoft know how often this happens?
I don’t think Microsoft understands how often this happens.
First, the article linked above is intended to drive down support requests. This saves money for Microsoft, and frankly users appreciate it because none of us want to work through support tickets, however it means that Microsoft doesn’t get signals about how many people are hitting the problem.
I can say that this does still happen with the latest versions of SQL Server and that it occurs on Azure SQL Managed Instance, based on what I have seen and what I see others reporting in blogs and forum posts. The support article also references Azure SQL Database, so presumably it happens there as well.
What’s up with the mention of sp_table_statistics2_rowset?
In Microsoft’s support deflection article, the full sample error is:
Msg 2767, Level 16, State 1, Procedure sp_table_statistics2_rowset, Line [Batch Start Line ] Could not locate statistics ‘’ in the system catalogs.
The sp_table_statistics2_rowset system procedure is not fully documented. This procedure is run automatically when you use a linked server to run a query, but the article doesn’t mention linked servers at all or say that the bug only occurs when using a linked server.
Since the article explains that a statistic is not present/not usable on the secondary replica itself, presumably this can impact queries running without linked servers, but it’s not clear to me if they fail or if they just ignore the statistic (and maybe get a worse plan). I have seen that a query may fail due to this when run with a linked server, but the same query may run successfully if run against the replica correctly – more on that in an upcoming post.
Can you rely on an architecture on readable secondaries with SQL Server / Scale-out read instances with Azure SQL Managed Instance?
Given how long this issue has been around and the level of customer impact it seems to have (queries failing repeatedly with no warning until an administrator intervenes), I think this is an important question worth asking.
I would love to hear more from Microsoft about this in their blog posts or support articles:
- Why is the “manual” statistics fix “more permanent”?
- Can this happen with statistics related to indexes?
- What happens when sp_table_statistics2_rowset isn’t involved?
- Is it something Microsoft intends to fix eventually? Who needs to complain to make that happen?
As it is, it just seems like another reason not to use readable secondaries.