Chapter 5: Synchronizing Access to Shared Objects

Sequential reasoning does not work in programs with cooperating threads:

Threads' access to shared state may be interleaved
Nondeterministic program execution due to varying scheduling decisions, processor frequencies, etc.
Compilers and processor hardware may reorder instructions

5.1 Challenges

Race condition. When behavior of a program changes depending on thread interleaving.
Atomic operations. These operations cannot be interleaved with/split by other operations.

5.2 Structuring Shared Objects

Shared objects can be accessed safely by multiple threads. All shared state in a program should be encapsulated in a shared object.

There are several layers in shared object implementation:

Shared object layer. The underlying details of these objects are abstracted away—they seem to work perfectly in multithreaded situations.
Synchronization variable layer. Synchronization variables include semaphores, locks, and conditions variables. These coordinate access to state variables.
Atomic instruction layer. Some instructions must execute atomically in order to support synchronization variables.
Hardware layer. Of course, there must be hardware support for atomic instructions.

5.3 Locks: Mutual Exclusion

A lock is a synchronization variable that provides a thread exclusive access to some subset of shared thread. Any operations done while a thread holds a lock appear to be atomic to other threads.

A lock has two states—BUSY (held by a thread) or FREE (can be acquired)—and two associated methods—Lock::acquire() and Lock::release(). Lock::acquire() blocks until the lock is FREE, and then atomically transitions the lock state to BUSY. Lock::release() atomically transitions the lock state to FREE. If there are pending acquire operations, it will trigger one to proceed.

Formal Properties. A lock should ensure the following 3 properties:

Mutual Exclusion: At most one thread holds the lock.
Progress: If no thread holds the lock and any thread attempts to acquire the lock, eventually some thread will succeed.
Bounded waiting. If thread $T$ tries to acquire a lock, there is a bound on the number of times other threads can successfully acquire the lock before $T$ does.

Critical Section. A critical section is a section that atomically accesses shared state. So, a section of code in which a lock is acquired by a thread is a critical section.

5.4 Condition Variables: Waiting for a Change

A condition variables is a synchronization object that lets a thread efficiently wait for a change to shared state protected by a lock. That is, it avoids busy waiting. It has three methods:

CV::wait(Lock *lock). Atomically releases the lock and suspends execution of the calling thread. This avoids busy waiting by immediately yielding the processor. When the calling thread continues, it will reacquire the lock before returning.
CV::signal(). This call takes a thread off the condition variable's waiting list and marks it as ready to run (READY state), if the waiting list is nonempty.
CV::broadcast(). Same as CV::signal(), but for all threads.

A condition variables also has three noticeable properties:

Memoryless. A condition variable has no internal state other than a queue of waiting threads.
CV::wait atomically releases the lock. This is necessary to avoid a race condition, in which a signal might slip in between the lock releasing and the thread being put onto the waiting list.
CV::signal or CV::broadcast does not guarantee a waiting thread will run immediately. It will simply move the thread to the ready queue with no special priority, and then the scheduler can run it at a later time at its own discretion.

Notably, the above properties mean that wait must always be called within a loop, as, after a signal or broadcast call, the lock may not be in the FREE state anymore and the shared state variables may not satisfy the desired conditions. e.g., the following code...

if (testStateVariables(...)) {
	wait(&lock);
}

...could run into issues, as other threads may have run between the time of the signal and this thread continuing, resulting in changes to the shared state such that testStateVariables is no longer true, despite it being true when signal occurred.

Mesa vs. Hoare Semantics
These are different semantics for condition variables. Mesa has signal and broadcast calls pop waiting threads from a condition variable's waiting list and puts htem on a ready list. In contrast, Hoare has signal immediately suspend execution of the signaling thread and transfer lock ownership to a waiting thread, whose execution is immediately resumed. After this thread finishes, the lock ownership returns to the signaling thread.

Hoare semantics allows programmers to use if instead of while for condition variables. However, practically all modern systems use Mesa semantics due to the modularity advantage and, primarily, the reduction in context switches it provides.

Revisiting the thread life cycle, it's important to note that a RUNNING thread that calls wait is put into the WAITING state, and its TCB is moved into the WAITING thread queue of the condition variable. That is, there are multiple queues for different CVs. Locks function similarly, with a queue of waiting TCBs per lock.

5.5 Designing and Implementing Shared Objects

High Level Approach

Add a lock.
Add code to acquire and release the lock. In particular, the lock should be acquired/released at the beginning/end of each function that accesses shared object. This is the easiest, simplest method, and should generally always be used since acquiring a lock is relatively inexpensive (i.e. acquiring/releasing later/earlier will not significantly optimize the code).
Identify and add condition variables. Consider each method and ask when it can wait. Occasionally, it may also be more convenient to define fewer condition variables for a large number of possible conditions, and simply force methods to broadcast for the condition variable after releasing a lock.
Add loops to wait for using the condition variables. As aforementioned, condition variables should always be used with a loop. Modern operating systems often even allow spurious wakeups, sometimes (thread returns without any signal/broadcast). This also increases code modularity, as this provides guarantees about whatever condition was being waited on for the code following this loop.
Add signal and broadcast calls. signal is appropriate when (1) at most one waiting thread can make progress at a time and (2) any arbitrarily chosen thread can make progress. Meanwhile, broadcast is appropriate when (1) multiple waiting threads can make progress simultaneously or (2) the condition variable is used for different conditions/predicates for multiple threads.

Implementation Best Practices

Consistent structure. Having clean, consistent code allows for easier reasoning about the concurrency and more focus on the core problem.
Always synchronize with locks and condition variables. As in, use locks and CVs instead of semaphores, because they are more "self-documenting."
Always acquire the lock at the beginning of a method and release it right before the return. As previously discussed, this is the easiest, simplest method. If there is a legitimate reason for not doing this, perhaps this logic should be made into its own function.
Always hold the lock when operating on a condition variable. Otherwise, it's undefined behavior.
Always wait in a while() loop. Discussed previously, ensures modularity and portability with implementations that allow spurious wakeups.
(Almost) never use thread_sleep. wait is generally better and more useful.

Three Pitfalls

Double-checked locking. An "optimization" that makes a check before acquiring a lock to see if acquiring a lock is necessary, then acquires the lock, then checks again. The reason the check inside the lock is necessary is because of a race condition between the first check and the lock acquisition. However, this practice can result in code that is complex or even wrong.
Avoid defining a synchronized block in the middle of a method. Again, discussed previously.
Keep shared state classes separate from thread classes. Making thread classes have shared state as well blurs the lines between threads and shared objects, which can make things confusing.

5.7 Implementing Synchronization Objects

Some hardware primitives are required for synchronization objects:

Disabling interrupts.
Atomic read-modify-write instructions.

On a uniprocessor, it's possible to implement locks by simply disabling interrupts on acquisition, and re-enabling on release. However, untrusted userspace program could hold a lock forever, monopolizing the processor. Instead, it's better to use uniprocessor queuing locks. This only disables interrupts while acquiring/releasing the lock, but re-enables interrupts once the lock is acquired/released. This is because, if other threads do try and acquire this lock while it's BUSY, the thread will simply go to the lock's waiting list. In this way, the lock works as expected while avoiding processor monopolization by a single thread.

Critically, if a thread tries to acquire a lock and ends up sleeping, it is the responsibility of the next thread that wakes up from a lock to re-enable interrupts.

On a multiprocessor, though, disabling interrupts is insufficient, as other cores may be executing threads that modify the synchronization objects. Thus, processor architectures must provide atomic read-modify-write instructions to support synchronization. One possible implementation of locks is spinlocks, in which the thread waiting for a BUSY lock spins (busy-waits) while waiting for another thread to release the lock. This can be inefficient, though, if locks are held for a while.*

*Spinlocks can be useful, primarily when the expected waiting time is short, i.e. less than the overhead of context switching to a different thread.

Interrupt handlers and spinlocks
Notably, interrupt handlers that access shared data must use spinlocks. This is because a queuing lock might result in an interrupt handler being blocked, which cannot occur. Also, when a spinlock is acquired for an interrupt handler, the thread must disable interrupts. Otherwise, a deadlock could result if another interrupt arrives while the spinlock is still held.

For multiprocessor queuing locks, it's necessary to have a spinlock for the scheduler READY list. To reduce contention, each lock's internal state is guarded a spinlock. Importantly, it is also necessary to disable interrupts before locking the ready list, as a preemption could result in a deadlock.

Linux's lock implementation exploits the fact that most locks are generally FREE when acquisition is attempted. Therefore, it provides a fast path for acquiring a lock using several hardware optimizations and a design that allows lock acquisition and release without first acquiring the spinlock or disabling interrupts. It does this by keeping an atomic count variable that can denote the lock as FREE (1), BUSY (0), or BUSY and having possible waiting threads (any negative number). The fast path atomically decrements this count variable and then checks the value to see if it's FREE. If not, it falls back to the slow path, which requires acquiring the spinlock and disabling interrupts.

Condition variables are implemented very similarly. However, it's notably necessary to acquire the scheduler spinlock before releasing the lock associated with the CV. This ensures there are no race conditions with signaling/moving threads off the waiting list.

Meanwhile, application-level synchronization can be implemented via kernel-managed threads or user-managed threads. The former generally works very similarly. The latter has all synchronization data structures in process memory, and also disables signals/upcalls rather than interrupts (of course, the ability to disable interrupts from userspace would be devastating for security).

5.8 Semaphores Considered Harmful

Semaphores:

A semaphore has a non-negative value $x$ .
$x$ is initialized upon semaphore creation with the desired "capacity" of the semaphore, i.e. how many concurrent accesses are alowed.
semaphore::P() waits until $x>0$ , then atomically decrements $x$ by 1.
semaphore::V() atomically increments $x$ by 1, and then enables a thread waiting in semaphore::P() to run.
No other operations allowed. In particular, no thread can read the semaphore's internal state, specifically $x$ .

Semaphores can used for locks and condition variables. A mutual exclusion lock is implemented with $x=1$ . For condition variable-like waiting, semaphore::P() functions very similarly to CV::wait() and semaphore::V() functions similarly to CV::signal(). However, there are some distinct differences. Notably, semaphore::P() does not release an associated lock and a condition variable is stateless, while a semaphore is not. The latter, in fact, results in an observable difference: if a thread calls CV::signal with nothing on the waiting list, nothing happens; but, if a thread calls semaphore::V, the next call to semaphore::P will not block.

In general, it's more difficult to write concurrent programs with semaphores, for little benefit. Though, they are critically more performant for synchronizing communication between an I/O device and threads blocking for I/O. This is because the statefulness of semaphores allows for avoiding a race condition between an I/O device calling signal and the destination thread, when it's in between checking for I/O and waiting.

Aside: Atomic Read-Write-Modify Instructions

As aforementioned, multiprocessors cannot use the interrupt method to provide synchronization primitives. Moreover, another negative of the interrupt method is that users cannot use this lock, since users cannot enable/disable interrupts.

An effective alternative is atomic read-write-modify instructions. Hardware is responsible for implementing this properly. Notably, on multiprocessors, this introduces some complexity in the cache coherence protocol. This class considers the following three atomic RWM instructions:

int test_and_set(int* address) {
	int result = *address;
	*address = 1;
	return result;
}

void swap(int* address, int* reg) {
	int tmp = *address;
	*address = *reg;
	*reg = tmp;
}

bool compare_and_swap(int* address, int* reg1, int* reg2) {
	if (*reg1 == *address) {
		*address = *reg2;
		return true; // success
	} else {
		return false; // failure
	}
}

162 actually defines them as follows, exactly:

Consider test_and_set. We can implement a simple spinlock:

int lock = 0;

void acquire(int *lock) {
	// wait until old value is 0 (FREE), set to 1 to claim lock (BUSY)
	while(test_and_set(lock));
}

void release (int *lock) {
	// set to FREE
	*lock = 0;
}

Of course, spinlocks involve busy waiting, which is not ideal. We can do better!

int guard = 0; // global variable, spinlock for checking actual lock
int lock = FREE;

void acquire (int *lock) {
	while(test_and_set(guard)); // short busy-waiting/spinlock
	if (*lock == BUSY) {
		thread_state -> WAITING;
		guard = 0;
		sleep();
	} else {
		*lock = BUSY; // claim lock
		guard = 0;
	}
}

void release(int *lock) {
	while(test_and_set(guard)); // short busy-waiting
	if (wait_queue.size() > 0) {
		wait_queue.pop() -> READY; // first thread gets lock next
	} else {
		*lock = FREE;
	}
	guard = 0;
}

Aside: Fast Userspace Mutex (Futex)

futex is a syscall with the following signature (according to lecture):

int futex(int *uaddr, int futex_op, int val,
			const struct timespec *timeout);

futex_op can be...

FUTEX_WAIT $\implies$ if *uaddr == val, sleep until FUTEX_WAKE, i.e. put on waiting queue associated with uaddr. In essence, this is an atomic check that the condition still holds after disabling interrupts in the kernel. (That is, the user should check this first before calling the futex to avoid context switching overhead). Otherwise, return immediately.
FUTEX_WAKE $\implies$ wake up at most val waiting threads associated with uaddr's waiting queue.

Essentially, a conditional sleep for threads, and can be used to implement synchronization primitives like locks, semaphores, etc. We will consider implementing a lock.

The naive attempt might look something like this:

int lock = 0;

void acquire(int *lock) {
	while (test_and_set(lock)) {
		// assert that lock is BUSY and let current thread sleep
		futex(lock, FUTEX_WAIT, 1);
	}
}

void release(int *lock) {
	*lock = 0;
	futex(lock, FUTEX_WAKE, 1);
}

This avoids spinlocking by inducing each thread to sleep if the lock is currently BUSY (via an atomic check). However, every release requires a syscall to wake up the next thread, even if no threads are waiting. Thus, we can extend this implementation slightly to avoid this overhead.

bool maybe = false; // if there *may* be threads waiting
int lock = 0;

void acquire(int *lock, bool *maybe) {
	while (test_and_set(lock)) {
		// sleep, since lock is busy; and note that there are threads waiting
		*maybe = true;
		futex(lock, FUTEX_WAIT, 1);
		
		// if woken up by release(), ensure next call to release knows there are threads possibly waiting
		*maybe = true;
	}
}

void release(int *lock, bool* maybe) {
	*lock = 0;
	if (*maybe) { // if there may be threads waiting
		// no threads are considered to be waiting since the next thread dequeued will be run
		*maybe = false;
		
		// try to wake up a thread
		futex(lock, FUTEX_WAKE, 1);
	}
}

This implementation is much better because it avoids any syscalls if the lock is uncontended.

Aside: Circular Buffer

We can implement a simple, spin-locked producer-consumer queue with these synchronization variables, notably.

mutex lock = FREE;

Producer(item) {
	acquire(&lock);
	while (buffer_is_full()) {
		release(&lock);
		acquire(&lock);
	}
	enqueue(item);
	release(&lock);
}

Consumer() {
	acquire(&lock);
	while (buffer_is_empty()) {
		release(&lock);
		acquire(&lock);
	}
	item = dequeue();
	release(&lock);
	return item;
}

A better method, though, is to use semaphores!

Semaphore mutex = 1; // equivalent to regular lock for synchronizing access to buffer
Semaphore fullSlots = 0;
Semaphore emptySlots = bufSize;

Producer(item) {
	semaP(&emptySlots);
	semaP(&mutex);
	enqueue(item);
	semaV(&mutex);
	semaV(&fullSlots);
}

Consumer() {
	semaP(&fullSlots);
	semaP(&mutex);
	item = dequeue();
	semaV(&mutex);
	semaV(&emptySlots);
	return item;
}

Aside: Monitor

Monitors are a better abstraction to semaphores.

A monitor is 1 lock and 0 or more condition variables for managing concurrent access to shared data.

Essentially, it's a generalization of locks/condition variables. Here's an example of an infinite synchronized buffer implemented with a monitor.

lock buf_lock;
condition buf_cv;
queue queue;

Producer(item) {
	acquire(&buf_lock);
	enqueue(&item);
	cond_signal(&buf_cv); // signal waiters that item was enqueued
	release(&buf_lock);
}

Consumer() {
	acquire(&buf_lock);
	while (isEmpty(&queue)) {
		cond_wait(&buf_cv, &buf_lock); // if no items, sleep
	}
	item = dequeue(&queue);
	release(&buf_lock);
	return item;
}

There are different types of monitors. This is just Mesa vs Hoare semantics, again.

Hoare monitors.
Signaling thread gives up lock and CPU to waiting thread, and waiting thread runs immediately. When the waiting thread finishes (releases lock), CPU and lock are returned to signaling thread. This allows for this convention:

if (someCondition(...)) {
	cond_wait(&cv, &lock);
}

This makes it easier to prove the program's correctness (liveness) and guarantees the condition holds on resumption. However, it is more complex to implement and, critically, frequently context switches, reducing performance.

Mesa monitors.
Signaling thread keeps lock, waiting thread is placed on CPU's READY list. Waiting thread runs at some arbitrarily later point in time; this means the original condition, which was true at signal time, may not be true anymore! Thus, you must enclose wait calls in a while loop:

while (someCondition(...)) {
	cond_wait(&cv, &lock);
}

This is more performant due to less context switches and separates the signal from the condition check, which allows the decoupling of processes.

Aside: RW Locks

The problem is phrased as follows. We have a database that we'd like to allow concurrent access to. There are some readers and writers, but they cannot arbitrarily read/write to the database; otherwise, uncontrolled concurrency can introduce errors. In particular,

Reader
- Can read while other readers exist
- Cannot read while there is a write in progress
- Should wake up a waiting writer if there are no readers when it concludes
Writer
- Cannot write while there are reads/write in progress
- Should wake up waiting writer upon conclusion, or waiting readers if no waiting writers

Here is an example solution.

int AR = 0; // active readers
int WR = 0; // waiting readers
int AW = 0; // active writers
int WW = 0; // waiting writers
condition can_read = NIL;
condition can_write = NIL;
lock lock;

char *Reader::read() {
	char *buf;

	acquire(&lock);
	
	// while >0 writers waiting/active, wait to read
	while ((AW + WW) > 0) {
		WR++;
		cond_wait(&can_read, &lock); // sleep until can maybe read
		WR--;
	}
	
	// now active!
	AR++;
	release(&lock);
	
	// read-only access
	read(buf);
	
	// checkout of system
	acquire(&lock);
	AR--;
	
	// if all readers are done and >0 waiting writers
	if (AR == 0 && WW > 0) {
		cond_signal(&can_write); // wake up a writer
	}
	release(&lock);
	
	return buf;
}

char *Writer::write(char* buf) {
	acquire(&lock);
	
	// while >0 readers/writers active, wait to write
	while ((AR + AW) > 0) {
		WW++;
		cond_wait(&can_write, &lock);
		WW--;
	}
	
	// now active!
	AW++;
	release(&lock);
	
	// exclusive access
	write(buf);
	
	// checkout of system
	acquire(&lock);
	AW--;
	
	// if >0 waiting writers
	if (WW > 0) {
		cond_signal(&can_write); // wake up a writer
	} else if (WR > 0) { // else if >0 waiting readers
		cond_broadcast(&can_read);
	}
	release(&lock);
}

Aside: Language-Specific Synchronization

C

Well, they provide functions and structs for synchronization! Beyond that, though...

C++

Lock guards! In the destructor of a lock guard, it releases the associated lock. This allows for, e.g.

#include <mutex>

std::mutex lock;

void some_func() {
	std::lock_guard<std::mutex> lg(lock);
	
	...
	
	return;
	
	// mutex released here, after the function returns
}

Python

The with keyword. e.g.,

import threading

lock = threading.Lock();
with lock: # automatically calls acquire()
	...
# automatically releases once `with` statement is done

Java

The synchronized keyword. Every Java object has an associated lock. Lock is acquired on entry into and released on exit from each synchronized method. If an exception occurs in the method, it is automatically released.

Java also has support for monitors—every synchronized object has a condition variable associated with it. For waiting, there is void wait() and void wait(long timeout). For signaling, there is void notify() and void notifyAll().