Logo

Chapter 14: Reliable Storage

"The central question of this chapter is: How can we make a storage system more reliable than the physical device out of which it is built?"

The question of reliability actually revolves around 3 principles.

Persistent storage systems, in particular, must deal with two threats to reliability.

To address this, there are two mainstream solutions, one for each issue.

14.1 Transactions

Pre-Transactional Approaches

Many older file systems used ad-hoc approaches to solve the problem of interrupted operations.

FFS, for instance, would simply ensure that any changes to the disk were always completed in a specific order. Then, if a crash occurred during a group of updates, a disk scan could identify the inconsistency and repair the data structures. This disk scan, for FFS, was the fsck utility.

However, this approach has largely been abandoned during modern times for a few reasons.

  1. Complex Reasoning. Like multi-threaded concurrency, this approach quickly becomes complicated to reason about as systems grow more complex.
  2. Slow Updates. File systems are forced to insert barriers between dependent operations, reducing pipelining and parallelism.
  3. Slow Recovery. After a crash, a machine must scan all of its disks for inconsistencies. As persistent storage rapidly grew, the resulting overhead grew larger and larger.

Transaction Abstraction

As aforementioned, a transaction abstracts a sequence of several operations as a single, atomic unit. A transaction finishes in one of two ways, it...

A transaction, at its core, aims to fulfill the four ACID properties. (These are most commonly referenced in database system design, but really apply to any system desiring reliability).

Transaction Implementation

A transactional file system must deal, of course, with the inherent fact of persistent storage devices—their only atomic operation is writing a single sector or page. Typically, they do this by, before executing a transaction itself, recording a transaction's intentions in persistent storage, i.e. the operations it intends to perform. Only once these are all safely stored, the transaction should begin to perform its operations; any interruption can be recovered by the system re-executing the stored intentions.

Redo Logging is one way of doing exactly this. It consists of four stages.

  1. Prepare. Append all planned updates to the log.
  2. Commit. Append a commit record to the log, indicating that the transaction has committed. (Or, it may be rolled back. In this case, simply not writing a commit record may suffice.)
  3. Write-back. The transaction's updates are actually carried out.
  4. Garbage collect. Finally, the transaction's updates in the log are removed.

Note that step 2 is known as the atomic commit. Before that moment, the transaction may safely be rolled back; after step 2 completes, the transaction must take effect. This becomes clear after understanding the recovery routine: for each record in the log (scanning sequentially), if it is...

  1. an update record for a transaction, then add this record to a sequence of planned updates for this transaction.
  2. a commit record for a transaction, then write-back this transaction's logged updates (i.e. the sequence of planned updates for this transaction)
  3. a rollback record (if rollback records are supported), then discard the sequence of planned updates.

When the end of the log is reached, any remaining sequences of planned updates are discarded, as these transactions did not have commit records in the log!

There are, however, a few key details that must be observed.

Undo Logging
Transactions can also be implemented via undo logging, which essentially just replaces logging the transaction's intentions with logging the old version of the object. In the event of a failure, the system simply writes the old version of the object back into place. However, undo logging frequently requires more random I/Os (and just more I/O in general, especially if objects are large) and gives up the flexibility (and performance) of asynchronous write-backs. Thus, redo logging is more commonly used.

Undo/redo logging involves logging both the old and new versions of an object. This allows redoing the updates of committed transactions and undoing the updates of uncommitted transactions.

Concurrent Transactions and Shared State

What happens if concurrently executing transactions operate on the same shared state? Then, these transactions may not actually appear atomic. To address this, one popular method is two-phase locking (2PL). This divides a transaction into two phases: during expanding (pre-commit), locks are only acquired; during contracting (post-commit), locks are only released. 2PL helps ensure a strong form of isolation known as serializability, i.e. the results appear as if the transactions are processed sequentially, one at a time.

Notably, 2PL has the risk of introducing deadlock. Transactions, though, provide a simple solution themselves! If a set of transactions deadlocks, one or more transactions are rolled back, release their locks, and are queued for execution later.

Multiversion Concurrency Control (MVCC)

MVCC is an alternative to locks for enforcing transaction isolation. It is, essentially, the concept of RCU Locks abstracted and applied to transactions.

Isolation Relaxation

The strict requirements of serializability inevitably impact performance, as it forces transactions to, at times, block or roll back. In some cases, it may be beneficial to relax isolation requirements to improve performance. For instance, snapshot isolation only requires that each transaction's reads appear to come from a "snapshot" of the system's committed data when the transaction starts. Transactions are buffered until they commit; then, the system checks for write-write conflict, essentially two transactions attempting to modify the same object with overlapping transaction times. If there exists a write-write conflict, the transaction being committed is rolled back. Note that snapshot isolation is weaker than serializability because a transaction's reads and writes happen at logically separate times (i.e., you can imagine that each transaction is split into one operation of reads and one operation of writes, and serializability is only guaranteed after this split). This allows write skew anomalies, where a transaction reads object xx and updates object yy, while a concurrent transaction reads object yy and updates object xx.

Finally, let's discuss the performance of redo logging. Surprisingly, it's not too bad, despite seemingly duplicating every single write (once to the log, once to the actual data). This is for several reasons.

Transactional File Systems

Finally, we discuss the actual details of transactions in file systems. Traditional, update-in-place file systems were one of two types.

Many modern file systems, though, are copy-on-write. They offer several performance optimizations over journaled and log-structured file systems.

14.2 Error Detection and Correction

Storage systems include multiple defenses against device failures.

We'll discuss each of these in turn, separately.

Storage Device Failures and Mitigations

Individual storage devices—whether magnetic disk or flash—can fail in two distinct ways.

  1. An isolated disk sector or flash page can lose existing data or degrade to the point where it cannot consistently store new data
  2. An entire device can fail

Sector/Page Failures

Storage devices use two mitigation techniques.

Error Correcting Codes
Error correcting codes are a clever way of encoding data such that an error in a small number of bits within a page can be fixed automatically by the device hardware.

Remapping
Disks and flash devices typically include some number of spare sectors/pages. When a sector or page permanently fails, the OS or device firmware can remap the failed sectors/pages to good, spare ones.

Device Failures

When a whole storage device fails, the host computer will detect this failure, and any reads/writes will return error codes. Failure rates are typically described by an annual failure rate or the mean time to failure (MTTF). In general, "the number of failure events in a fixed time period can be mathematically modeled as a Poisson process, as the interarrival time between failure events follows an exponential distribution."

RAID

RAID, or Redundant Array of Inexpensive Disks, is a system that cleverly distributes data redundantly across several disks for the purpose of recovering data in the event of an individual disk failure. The two most commonly discussed RAID architectures are

Notably, there are a couple performance considerations to be had with the organization of rotating parity redundancy.

Below is a diagram of RAID 5 in action.

raid5.png

Atomic Updates

What if the system crashes while we're updating the parity block? Then, it'll seem like we have corrupted data, when really our update was just interrupted midway through. To deal with this, it's necessary to make our RAID updates atomic. There are a few solutions.

Non-volatile write buffer
Hardware RAID systems can often include a battery-backed write buffer, which simply includes the planned/in-progress writes.

Transactions
Transactions provide atomicity by nature.

Recovery scan
After a crash, the system scans the blocks and updates any inconsistent parity blocks.

RAID: Extended

Though we describe in detail just RAID 1 and RAID 5 above, there are actually many more RAID levels! The original paper described RAID 0 – RAID 5; three of these variants see wide use today.

RAID 0: JBOD (Just a Bunch Of Disks)
This system doesn't actually add any redundancy; it just spreads data across multiple disks.

RAID 1: Mirroring (described previously)

RAID 5: Rotating Parity (described previously)

More variants have also been developed.

RAID 6: Dual Redundancy
RAID 5 but with two parity blocks, generated using erasure codes (e.g. Reed-Solomon codes) that tolerate up to two disk failures.

RAID 10, RAID 50: Nested Raid
These were originally called RAID 1+0 and RAID 5+0. Quite simply, they just combine RAID 1 with RAID 0 or RAID 5 with RAID 0, respectively.

RAID Reliability?

So, how reliable are RAID 1 and RAID 5? They can actually lose data in 3 distinct ways.

The expected time until one such event occurs is the mean time to data loss (MTTDL).

But we can, of course, further improve RAID reliability. Broadly, there are three techniques.

  1. Increasing redundancy - double or even triple redundancy, i.e. mirroring data on more than one disk for RAID 1. A dual redundancy array allows data to be reconstructed tolerating at most two disk failures, i.e. this is just RAID 6, which is double redundancy for RAID 5.
  2. Reducing non-recoverable read error rates
    Scrubbing - periodically reading all disk sectors, detecting sectors with unrecoverable read errors, reconstructing their contents with RAID, and remapping them to a spare.
    More reliable disks - yeah... idk why this is listed in the textbook but yup
  3. Reducing mean time to repair (MTTR)
    Some systems include a hot spare disk that is idle, but can quickly replace a failing disk. MTTR is, though, inevitably limited by the time to write all the reconstructed data to the new disk too.
    Some systems also decluster data, splitting the reconstruction across multiple disks to allow parallel reconstruction.

Software Integrity Checks

These are integrity checks implemented at the software/file system level, to detect hardware failures themselves. We discuss two common techniques.