Logo

Chapter 6: Multi-Object Synchronization

6.1 Multiprocessor Lock Performance

On a multiprocessor system, throughput may only be slightly faster than on a uniprocessor. This is typically a result of one of

Linux kernel locking evolution. Symmetric multiprocessing (SMP) was first added in version 2.0. To avoid concurrency issues, Linux added the Big Kernel Lock (BKL), which protected all shared data structures in the kernel. This allowed Linux to run on a multiprocessor machine, while more fine-grained SMP support was still being developed. As of version 2.6, the BKL is deprecated, and removed from most kernel subsystems.

6.2 Lock Design Patterns

Four design patterns

Commutative interface design. The UNIX open syscall API specifies that the syscall returns the next open file descriptor (ascending order). This forces open to use a lock, as two calls to open one after the other must return x,x+1x,x+1 (or something similar in ascending order). A commutative API ensures that the result of two calls is the same regardless of which call was made first. For open, this would require returning any unique integer not currently associated with a file descriptor, and thus would not need a lock (outside of handling stdin, stdout).

6.3 Lock Contention

6.3.1 MCS Locks

The Mellor-Crummey-Scott (MCS) lock is an implementation of a spinlock optimized for the case when there are many waiting threads. On a multiprocessor system, acquiring and releasing a lock is associated with a much higher overhead, since the lock and shared data must be transferred between all waiting processors. Moreover, each thread attempting to acquire a lock queues an atomic RMW instruction, and the hardware has no way to know which one to prioritize (the release instruction!).

A naive solution is test and test-and-set. In essence, the spin lock is implemented as

while(lock == BUSY || test_and_set(&lock));

However, this doesn't help. The lock's release forces the previously holding processor to communicate the updated lock state to the waiting processors. This is an issue if a lock's critical section is typically short, as most time is spent on inter-processor communication.

A more scalable solution is to assign each thread its own memory location to spinlock on, and releasing the lock sets a bit for one thread's memory location, instead of letting all waiting threads know. Almost like a CV's signal instead of broadcast for spinlocks.

MCS is a popular implementation of this, that essentially maintains a queue of waiting threads. Waiting threads add themselves to the tail of the queue, and spin on the flag in its queue entry. The thread releasing the lock will set the flag for the first entry in the queue. This makes it much more efficient when there are many waiting threads, as the lock is passed efficiently along the threads in the queue. The overhead of the queue makes MCS less efficient with few waiting threads.

6.3.2 Read-Copy-Update (RCU) Locks

RCU locks are a read/write-lock that is highly optimized for read operations, while sacrificing write operations.

RCU makes three important changes to the standard RW lock:

Some diagrams that demonstrate the idea:
rcu-ex1.png

rcu-ex2.png

As you may notice from the second figure, writes must still be serialized. In particular, their publish phases must execute in order (otherwise, versioning would not be linear). However, any number of concurrent reads may occurs during a write.

The writers must also synchronize, i.e. wait until all readers are finished before freeing the previous version. To do this performantly, RCU locks often interface with the scheduler, where the scheduler has some function quiescentState() that informs the waiting writer that the old version is quiescent, i.e. no reader has access to it.

6.4 Multi-Object Atomicity

Sometimes, it is necessary for there to be concurrency-safe (essentially atomic) interactions between shared objects. There are some solutions to consider for this problem:

6.5 Deadlock

A deadlock is simply a waiting cycle among threads. The simplest example occurs in mutually recursive locking, i.e. thread 1 acquires A then B, while thread 2 acquires B then A. Another example is nested waiting, i.e. thread 1 acquires two locks, releases one with a call to a CV's wait, but other thread that will signal CV needs the other lock.

Deadlock vs. Starvation
Starvation: a thread fails to make progress for an indefinite period of time.
Deadlock: starvation, but no threads within a group can make progress ever.

There are four necessary conditions for deadlock.

Note, though, that these conditions are necessary but not sufficient for deadlock.

There are a number of ideas to help prevent deadlock.

6.5.4 Banker's Algorithm for Avoiding Deadlock

Banker's Algorithm is a procedure developed by Dijkstra that improves on the performance limitations of acquire-all. In the algorithm, a thread states its maximum resource requirements upon beginning a task, and acquires/releases resources incrementally as the task runs. Banker's algorithm uses this knowledge of each thread's maximum resource requirements to ensure that, if a request is added, all threads will eventually complete, via simulating the resource allocations. If a request comes in that would result in an unsafe state, i.e. could potentially lead to a deadlock, it will delay that request.

6.5.5 Detecting and Recovering from Deadlocks

There are two primary approaches:

6.6 Non-Blocking Synchronization

In essence, non-blocking synchronization is the idea of writing concurrent programs entirely without locks. This eliminates lock contention and deadlocks, but can make programs much more complex and dangerous to design. In particular, wait-free data structures guarantees progress for every thread, regardless of other threads' states, while lock-free data structures guarantee progress for at least one thread. In essence, wait-free implies lock-free, and lock-free can cause starvation.

Aside: Handling Deadlocks

162 defines 4 ways of handling deadlocks: