By Kendra Little on November 24, 2024
This is the worst bug I’ve found in SQL Server to date. Previously, my top find was SQL Server Online Index Rebuild sometimes happens offline without warning. This one has taken top slot because it makes my life more difficult on a daily basis.
Background: SQL Server generates a query_hash
for each query. This is stored in sys.query_store_query
and it’s one of the primary ways you can identify what a query is across different Query Stores, or even the same Query Store over time, as surrogate query_id
values get reset if Query Store is cleared or data ages on. The query_hash
is a “Zobrist hash over the shape of the individual query, based on the bound (input) logical query tree. Query hints aren’t included as part of the hash.” (Source)
Except that’s wrong. These days it’s NOT always a hash of the query. If you force query plans, or if you use something like Automatic Plan Correction which forces plans, frequently the query_hash
value will be bizarrely overwritten by a query_plan_hash
value. So sometimes the query_hash
ends up being the “MD5 hash of the individual plan.” (Source) This occurs first in the query_plan XML, then it weirdly invades the Dynamic Management Views in Query Store for all query_hash
values for that query_id
.
This presents a big challenge identifying queries in Query Store, or writing reliable diagnotic tooling and automation. I’ve reported this, but Microsoft hasn’t acknowledged that it’s a bug yet. At this point they are still asking me if there’s a pattern in how my TSQL is written that causes the query_hash
value to be the same as the query_plan_hash
.
Dear reader, if my TSQL was that powerful I would be doing MUCH wilder things than stunts with query_hash
values.
This post shows a repro of the problem and just how unreliable the query_hash
has become in SQL Server 2022 and Azure SQL Managed Instance. It may happen in additional versions, I haven’t tested more broadly than this.
Set up our environment to reproduce this
I’m using a copy of the StackOverflow2013 database to reproduce this. I’m currently running Developer Edition with @@version
Microsoft SQL Server 2022 (RTM-CU15-GDR) (KB5046862) - 16.0.4155.4 (X64).
To get started, I reset basic configuration, make sure Query Store is enabled, clear Query Store, set database compat level to 160, and make sure Automatic Plan Correction is off.
Warning: Automatic Plan Correction automatically forces query plans and it causes this bug to spread rapidly. Especially avoid on relying on
query_hash
to identify queries if you have that feature enabled. However, the bug occurs with plan forcing even if you’re manually forcing plans, so I’m reproducing that in this post.
USE [master]
GO
exec sp_configure 'show advanced options', 1;
GO
RECONFIGURE
exec sp_configure 'max degree of parallelism', 8;
GO
exec sp_configure 'cost threshold for parallelism', 50;
GO
RECONFIGURE
GO
ALTER DATABASE [StackOverflow2013] SET QUERY_STORE = ON
GO
ALTER DATABASE [StackOverflow2013] SET QUERY_STORE (
OPERATION_MODE = READ_WRITE,
MAX_STORAGE_SIZE_MB = 4000,
CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 10))
GO
ALTER DATABASE [StackOverflow2013] SET COMPATIBILITY_LEVEL = 160
GO
USE StackOverflow2013;
GO
ALTER DATABASE CURRENT SET QUERY_STORE CLEAR;
GO
ALTER DATABASE CURRENT
SET AUTOMATIC_TUNING ( FORCE_LAST_GOOD_PLAN = OFF );
GO
Create indexes and procedure for a parameter sniffing situation
Next, I create two indexes and a stored procedure. This is not meant to be beautiful code, it’s code I wrote for a demo on problems that can arise because Automatic Plan Correction doesn’t exclude temp tables, but it works to reproduce this issue as well.
EXEC dbo.DropIndexes;
GO
CREATE NONCLUSTERED INDEX ix_Posts_OwnerUserId_INCLUDES
on dbo.Posts (OwnerUserId) INCLUDE (Id);
GO
CREATE NONCLUSTERED INDEX ix_Users_DisplayName_INCLUDES
on dbo.Users (DisplayName) INCLUDE (Id);
GO
CREATE OR ALTER PROCEDURE dbo.PostsForDisplayname
@DisplayName NVARCHAR(40)
AS
BEGIN
SET XACT_ABORT, NOCOUNT ON;
SET @DisplayName=@DisplayName + '%'
SELECT u.Id as UserId
INTO #users
FROM dbo.Users as u
WHERE u.DisplayName like @DisplayName;
/* This query selects from the #users temp table
That means it can use statistics on the temp table to recompile if the table
has a lot more data in it than in the cached plan */
SELECT top 10 p.*
INTO #bin
FROM dbo.Posts as p
JOIN #users as u on p.OwnerUserId = u.UserId
ORDER BY Score DESC;
END
GO
Run the procedure and get two plans in cache for the second query
We run the following SQL:
exec dbo.PostsForDisplayname 'Clothilde';
GO
exec dbo.PostsForDisplayname 'Jon';
GO
There are no posts by users named ‘Clothilde%’, but there are a LOT of posts by users named ‘Jon%’, so a recompile naturally occurs due to statistics on the temporary table.
Run the following query to review the hashes from both the Query Store DMVs and extract the same values from the query plan XML:
SELECT
q.query_id,
p.plan_id,
qt.query_sql_text,
CAST(p.query_plan AS XML) AS QueryPlan,
q.query_hash as QueryHashFromSysQueryStoreQuery,
p.query_plan_hash as QueryPlanHashFromSysQueryStorePlan,
CAST(p.query_plan AS XML).value('declare namespace sp="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; (/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/@QueryHash)[1]', 'varchar(64)') AS QueryHashExtractedFromPlanXML,
CAST(p.query_plan AS XML).value('declare namespace sp="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; (/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/@QueryPlanHash)[1]', 'varchar(64)') AS QueryPlanHashExtractedFromPlanXML,
CAST(p.query_plan AS XML).value('declare namespace sp="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; (/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan/@UsePlan)[1]', 'int') AS IsUsePlanExtractedFromPlanXML
FROM sys.query_store_query AS q
JOIN sys.query_store_query_text AS qt
ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan AS p
ON q.query_id = p.query_id
WHERE q.object_id = OBJECT_ID('dbo.PostsForDisplayname', 'P');
GO
We care about the second query in the procedure here, which has been assigned query_id 7. Query_id 7 has two different query plans, which have plan_ids 9 and 10. Here are the hash values we currently have (extracted from the image):
Query Id | Plan Id | Query Hash sys.query_store_query | Query Plan Hash sys.query_store_plan | Query Hash in Plan XML | Query Plan Hash in Plan XML |
---|---|---|---|---|---|
7 | 9 | 0xD55EEC4C7A81E644 | 0x44A186BFC52AA053 | 0xD55EEC4C7A81E644 | 0x44A186BFC52AA053 |
7 | 10 | 0xD55EEC4C7A81E644 | 0x56049465CF9FFB95 | 0xD55EEC4C7A81E644 | 0x56049465CF9FFB95 |
This all looks normal and good:
- Query_id 7 has a
query_hash
that is the same insys.query_store_query
as it is in the query plans. - The
query_plan_hash
is a different value because it’s a hash of the query execution plan, not the query. - The query_hash values in
sys.query_store_plan
match the values in the query plan XML.
Force a query plan, then rerun the stored procedure for a new value
Next, we’re going to force the “small” query plan for this– the one we ran for Clothilde
which had very few rows in the temp table.
After forcing that plan, we run the procedure for the value J
, which is going to use that forced plan but will find a whole lotta rows. Run that twice:
exec sys.sp_query_store_force_plan @query_id=7, @plan_id=9;
GO
exec dbo.PostsForDisplayname 'J';
GO 2
Something’s gone weird in the query_hash value in our execution plan
Rerun the query above retrieving the query_hash
and query_plan_hash
values from the DMVs and the query plans.
We have a new query plan, and something is starting to look weird:
Here are the values for query_id 7 extracted from the image:
Query Id | Plan Id | Query Hash sys.query_store_query | Query Plan Hash sys.query_store_plan | Query Hash in Plan XML | Query Plan Hash in Plan XML | Is Use Plan Hint in Plan? |
---|---|---|---|---|---|---|
7 | 9 | 0xD55EEC4C7A81E644 | 0x44A186BFC52AA053 | 0xD55EEC4C7A81E644 | 0x44A186BFC52AA053 | |
7 | 10 | 0xD55EEC4C7A81E644 | 0x56049465CF9FFB95 | 0xD55EEC4C7A81E644 | 0x56049465CF9FFB95 | |
7 | 11 | 0xD55EEC4C7A81E644 | 0x61807D991A7B6914 | 0x61807D991A7B6914 | 0x61807D991A7B6914 | Yes |
SQL Server saw that we had forced a Query Plan. It also recognized that the row estimates in the forced query plan were way lower than the estimates for the currently executing query, so it created a morally equivalent plan which has the same shape as the forced plan, but larger row estimates.
That’s cool, but this weird bug showed up, too:
- The “morally equivalent plan” XML has the wrong
query_hash
value in it. It’s stored thequery_plan_hash
as thequery_hash
, too! - The
query_hash
in the execution plan for plan_id = 11 does not match thequery_hash
insys.query_store_query
.
Here’s what this looks like when viewed in the query plan properties in SSMS:
It gets worse: further executions make the query_hash incorrect for every row with this query_id in sys.query_store_query
Now run the stored procedure for another value:
exec dbo.PostsForDisplayname 'Kendra'
GO
Rerunning the same query to check query_hash
values in the dynamic management views shows that we have a new “morally equivalent plan” added.
And the Query Store DMVs have now become infected with that messed up query_hash
value.
I’ve extracted the data for query_id 7 to text, which is:
Query Id | Plan Id | Query Hash sys.query_store_query | Query Plan Hash sys.query_store_plan | Query Hash in Plan XML | Query Plan Hash in Plan XML | Is Use Plan Hint in Plan? |
---|---|---|---|---|---|---|
7 | 9 | 0x61807D991A7B6914 | 0x44A186BFC52AA053 | 0xD55EEC4C7A81E644 | 0x44A186BFC52AA053 | |
7 | 10 | 0x61807D991A7B6914 | 0x56049465CF9FFB95 | 0xD55EEC4C7A81E644 | 0x56049465CF9FFB95 | |
7 | 11 | 0x61807D991A7B6914 | 0x61807D991A7B6914 | 0x61807D991A7B6914 | 0x61807D991A7B6914 | Yes |
7 | 12 | 0x61807D991A7B6914 | 0x868D1CD7976DCFF8 | 0x868D1CD7976DCFF8 | 0x868D1CD7976DCFF8 | Yes |
Now we are in a state where:
- One “morally equivalent” query plan has erroneously set its
query_hash
value to thequery_plan_hash
value in the plan XML - Another “morally eqivalent” plan being generated has caused the
query_hash
value insys.query_store_query
to be bizarrely updated to the value of thequery_plan_hash
for one of the plans. - We have three query plans where the
query_hash
value in the plan XML does not match thequery_hash
value insys.query_store_query
- If we have any tooling that has recorded the previous
query_hash
value of0xD55EEC4C7A81E644
for this query, it (or you) will no longer find that insys.query_store_query
.
What this means if you force query plans
I’m hoping Microsoft recognizes this a bug and fixes the issue, because this makes using Query Store much harder for administrators. I think, most critically, it makes learning to use Query Store harder.
First of all, it makes no sense to me that the hashes of two different things (query text and a query plan) would regularly be the same value. Holy hash collision, Batman. It just is confusing.
But this also means:
- You can’t reliably use the
query_hash
value from an execution plan that you find with a tool likesp_WhoIsActive
or from a monitoring tool to find the query insys.query_store_query
.- Using the
query_plan_hash
and looking it up insys.query_store_plan
should be more reliable, but you have to know to do that.
- Using the
- If you have a single tenant architecture (multiple copies of the same database schema in production), you can’t reliably find the same query across different query stores using the
query_hash
value.- If query plans have been forced in some places, the
query_hash
could be set to who knows what, and you’ll incorrectly think the query isn’t running there. - You can’t use
query_plan_hash
for this, either, because possibly the query has totally different plans. last_compile_batch_sql_handle
insys.query_store_query
is potentially the best bet at this point. But that’s for the entire batch. Maybestatement_sql_handle
insys.query_store_query_text
is the best bet, because that value is also in query plan xml.
- If query plans have been forced in some places, the
- The same thing also goes for queries that you’ve saved off in the past and want to examine now, but Query Store data has changed.