Why do we have repeatable read and serializable? (10 minutes)

Transcript

Welcome to Repeatable Read and Serializable Isolation Levels in SQL Server. I’m Kendra Little from SQLWorkbooks.com.

Today, we will be answering five questions:

  1. We’ll be talking about why we need repeatable read and serializable isolation levels; we’ll be doing demos of how we can get incorrect data in the read committed isolation level
  2. If we’re using repeatable read and serializable, under different situations we say we have to have correct data or we’re just not going to finish our query
  3. We will talk about ways that the serializable isolation level may sneak in, even if you aren’t using it on purpose! Even if you if you haven’t changed your isolation level to serializable, or use a hint, there’s some scenarios where SQL Server will escalate your isolation levels behind-the-scenes.
  4. And we’ll talk about how to tell if your apps are even using these – as well as a very important question:
  5. What’s the way to make sure that you get correct data in SQL Server?

First up: why do we need repeatable read and serializable?

In fact there can be clowns and ghosts – or phantoms– that are sneaking into your data!

The default isolation level in SQL Server is called read committed. This is the default isolation level in what’s called the “boxed product” of SQL Server. If you install SQL Server yourself in a virtual machine or on a server– I’m not talking about Azure SQL Database– but if you create a database on a SQL Server you install yourself, the default isolation for your queries is called read committed. In Azure SQL Database, it’s different, because read committed has some issues.

“Read committed” sounds really good, right?

Okay, I’m going to read committed data. That sounds like it would be great, but read committed is lock based.

When we read, say from an index in read committed, we run through that index getting locks as we go and releasing them right away. That also sounds good, but when you look into the details, what this means is: since we let go of the locks as soon as we read them, and our query may still be running, data can move around on us.

We may read some rows twice: if we read it and passed it and then it jumped ahead, we may see it again.

We may miss a row entirely: if I’m running through the data and a row is updated and moves behind me, I may never see it as I traverse that index.

And, we even can return combinations of data that never existed. Just plain part “before data”, part “after data.” And we’ll dig into this last one. It’s the weirdest one, we will dig into a demo in detail and see how that can happen in read committed.

Isolation levels are defined by “phenomena”

When we talk about isolation levels, not just in SQL Server, but when it comes to relational databases altogether, isolation levels are actually defined by what phenomena can happen in that isolation level.

These phenomena are different types of weirdness that can happen– different “issues” if you will that can happen under that isolation level.

Dirty reads

Under the default isolation level of read committed, the phenomena of dirty reads… Represented here by the poop emoji… that is not possible. So read committed at least has the virtue of not allowing the dirty reads phenomenon.

But there are other phenomena that can happen in read committed: we can have non repeatable reads, which we will see as a clown. And we can have phantom reads.

Non-repeatable reads

The name, at first didn’t make any sense to me. But it really means what it says.

It means: I read this data and if in my transaction I were to read it again, I would not get the same value. This can be a huge problem, and the data will just look wront to users.

Phantom reads are a little different

If I read the same set of rows again, I won’t get the same set. Maybe there is a row inserted, maybe there were three rows when I read the data first, maybe there were four rows one I read the data again. Nothing was *updated*, but a phantom row appeared.

Simple examples of non-repeatable reads and phantom reads

This is not that hard to imagine.

For a non-repeatable read, imagine that I’m running a report

I’m running this in read committed, and I do begin tran. My first SQL statement selects the SUM of my revenue from the table. I’ve summed up my revenue. After that first statement completes, another transaction updates a row in that table, and it changes some of the data. My report it has the summary data at the top, and it has a detail section below it. So I then run a statement that’s getting all the detail data. In read committed we are going to get the updated information in that second query. So, when we selected the sum of revenuel we have a non repeatable read in there. We read some data that changed after that. We read it again and it was different.

Phantom rows are similar. We’re running our same report.

We select our summary data, and then after we select that summary data, but before we do anything else, another transaction along. They insert a new row into the same data we’re selecting. In this case we don’t even have a where clause, we don’t even have any predicates. They just insert a row into that table.

Now, when we run the detail data, now we’ve got an extra row in there that isn’t accounted for in our sum.

This data isn’t right

In both of these cases, in the case of the non-repeatable read and in the case of the phantom read, if your CEO is reading this report, she’s going to look at this and be like, “This data isn’t right. These numbers don’t match.”

So, we either look like a clown, or we’re seeing a ghost in our records, neither of which are good.

Higher isolation can protect you from clowns and ghosts

The isolation levels repeatable read and serializable were introduced to make sure that if we don’t want to have these phenomenon, if we don’t have these issues, we don’t have to. We can prevent them. If we aren’t worried about phantoms and we only want to prevent non-repeatable reads, we can set our isolation level to repeatable read. You can also use a query hint, but this as for my entire session, for everything right, you know, I want to use repeatable read unless I hint otherwise. We start our report, we select the SUM of our revenue.

The difference is, now that I’m in repeatable read, I am, after I select this data, I going to hold on to some locks on that data that protects it and says if anyone comes in and they try to update a row, I have a transaction that is – if I read this data again, it needs to be the same. I need to protect it. So while my transaction is ongoing, somebody else comes in and tries update that row. They are going to be blocked, and they can’t read until I’m done, until I commit my transaction or roll it back.

So now, when I select my detail data, my read are repeatable, they are all the same.

My CEO sees data that matches. I do not look like a clown.

This is good, right?

But there was an impact. There was a tradeoff.

That transaction that came in to update a row, it was blocked, and it had to wait.

What happens if in the rest of this transaction I’m doing stuff that takes a whole lot of time? And, what if the update is really important and there’s an executive from one of my clients waiting for that update to complete on a screen?

So, there are tradeoffs here, if we’re using this method to make sure our data is correct.

Because if we are using a disk-based table, repeatable read is going to protect this with locks. And with locks can come blocking.

What if I’m worried about inserts? Well, repeatable read isolation level is holding locks on what I read: it isn’t doing anything to prevent against rows being inserted. If I’m worried about those phantom rows, I need to set my transaction isolation level to serializable.

Serializable says: you won’t have any non-repeatable reads, and you won’t have any phantoms. So it includes the protections of repeatable read, and then adds additional protections. In this case, I run my first query and I acquire key range locks that are held to the end of my transaction. I’m not just protecting exactly what I’ve read, but I’m protecting a range that says– new stuff cannot be inserted into this range. If an insert happens, the insert will be blocked. Same thing for an update, if an update happens it will be blocked. So when I run my last statement, I can get data consistent with the first statement.

But again, this is a pessimistic– what we’re talking about with repeatable read and serializable are pessimistic isolation levels. We’re collecting key range locks, and, our tradeoff is blocking.

But we get correct data!

And that is a good thing!