Logo

Chapter 7: Scheduling

7.1 Uniprocessor Scheduling

Notation

  • Workload: a set of tasks for some system to perform, where each task has an arrival date and completion duration
  • Compute-bound: describes a task whose completion duration is limited by the CPU processing power
  • I/O-bound: describes a task whose completion duration is limited by waiting for I/O
  • Work-conserving: describes a scheduling policy in which the processor is never left idle if there is work pending

Metrics

  • Average response time (minimize)
  • Variance of average response time (minimize)
  • Fairness

FIFO

First-In-First-Out scheduling. Exactly how it sounds, also known as First-Come-First-Served (FCFS). Average response time can be extremely high due to head-of-line blocking (a.k.a. convoy effect) from a very long task. Note that the convoy effect also implies potential starvation of later tasks!

SJF

Shortest Job First scheduling. The preemptive version is known as Shortest-Remaining-Time-First (SRTF). Always runs the job that will end the soonest. For SRTF, a short task arrives while a long task is running, and the short task will finish before the long task finishes, the short task will preempt this task. It turns out that SRTF is precisely the optimal scheduling policy for minimizing average response time. However, it is also the pessimal policy for minimizing variance. Also, note that SJF is not feasible in the real-world, since it requires knowledge of when each job will end, which is not known a priori. SJF can also suffer from starvation and frequent context switches. It's also an unfair scheduler, and programs can game the scheduler by submitting itself in small segments. (Note: SJF can still suffer from head-of-line blocking without preemption).

RR

Round Robin rotates through all the tasks in order. Each task is allowed to run up to some threshold of time, known as a time quantum. When the time quantum elapses or the task finishes/yields the processor, the next task is scheduled on the processor. RR is most effective at preventing starvation. However, the time quantum must be chosen carefully—in particular, the time quantum should be longer than the context switching overhead, but reasonably short so that tasks will not wait too long until their turn. Also, short time slices can negatively affect the cache, as tasks evict each previous tasks' cached data.

Simultaneous Multi-Threading (Hyperthreading)
SMT is a mechanism of simulating two virtual/logical processors on a single physical core, alternating between them on a cycle-by-cycle basis. This helps maximize the use of each physical core, especially since threads frequently need to wait for memory at times. So, at the hardware level, we can technically have zero-overhead context switches!

RR does have weaknesses beyond just the static time quantum tunable. For a workload with several tasks that start at the same time and are the same length, RR will finish all tasks at roughly the same time. This is pessimal in average response time! Additionally, the context switching adds overhead without any benefit here.

RR is also poor when running a mixture of I/O-bound and compute-bound tasks. In these instances, the I/O-bound tasks will yield the processor frequently, whenever I/O blocks. This results in them spending much less time on the processor than their given time quantum, each time they are scheduled. Moreover, they are scheduled much later in the future, as the processor must finish letting the rest of the tasks run before getting back to the I/O-bound task. This results in the I/O-bound task being very slow to complete.

Max-Min Fairness

Fair allocation of resources is also important for scheduler design, particularly for multi-user machines or datacenters. Essentially, max-min fairness is a scheduling algorithm to fairly allocate a single resource. Max-min fairness iteratively increases the minimum allocation of any process until the resource is fully assigned (attempts to maximize minimum allocation). For instance, if all processes are compute-bound, max-min fairness will give each process an equal share of the processor, i.e. RR. If some processes cannot use their entire share, then the unused portion is redistributed equally to other processes.

Also, note that the scheduler cannot switch to the process with the least time so far at every tick—this would be incredibly inefficient, due to context switches. Instead, we take inspiration from RR, and allow each process to get ahead of its fair allocation by one time quantum.

Note that, for multi-resource fairness, schedulers like Dominant Resource Fairness can perform better in terms of fairness.

Weighted Max-Min Fairness

Weighted Max-Min Fairness is a variant of Max-Min Fairness in which each user is assigned a weight, with higher priority users receiving higher weights. Then, user kk gets an allocation of wkiwi\frac{w_{k}}{\sum_{i}w_{i}} of the resource (scheduler).

MFLQ

Multi-Level Feedback Queue is designed with several goals in mind.

MLFQ has multiple queues, typically each running RR (though there can be heterogeneous scheduling policies), of tasks, each with a different priority level and time quantum. Higher priority tasks run before lower priority tasks, and have shorter time quanta. If a task uses up its time quantum, it drops a level. If a task yields the processor to wait on I/O, it stays at the same level or is upgraded a level.

This does not quite achieve starvation-freedom or max-min fairness. To address this, the operating system can maintain two queues—one of tasks who have received their fair share, and another of those who have not—and/or periodically promote processes that have not received their fair share yet to higher priorities.

7.2 Multiprocessor Scheduling

7.2.1 Scheduling Sequential Applications on Multiprocessors

A potential solution is a centralized MLFQ, with a lock to prevent contention issues. However, this has several performance problems:

Therefore, modern operating systems use a per-processor MLFQ. These processors use affinity scheduling—once a thread is scheduled on a processor, it is always returned to the same processor when rescheduled, to increase cache reuse. Rebalancing the processors' MLFQs occurs periodically, only if the imbalance is large enough to compensate for the time to reload processor caches for each thread.

7.2.2 Scheduling Parallel Applications

Oblivious Scheduling

Oblivious scheduling describes an operating system that schedules threads while oblivious to the parallelization of the actual applications. Unfortunately, oblivious scheduling can be problematic for several reasons.

Gang Scheduling

One potential solution is gang scheduling, in which the threads of a single task/program are either scheduled together or not at all. This is especially helpful for programs with barriers. Applications can also pin threads to a specific processor and mark it as high priority. This allows reserving a majority of processors for a parallelized application.

However, gang scheduling can be inefficient when there exist multiple parallel programs. Instead, it's more efficient to divide the processors into two groups, and pin each program to one group instead of time slicing. This minimizes processor context switches, and is known as space sharing.

Scheduler Activations

Applications are essentially given execution context regarding each processor assigned to the application via an upcall. This allows the application to handle its own thread management, i.e. user-level thread management, and decide how it multiplexes its threads onto its assigned processors.

7.3 Energy-Aware Scheduling

There are several different optimizations to consider

Heat dissipation
Heat dissipation is a closely related topic. In particular, the operating system must ensure it never surpasses a certain temperature threshold.

Something important to note is a user's perception of an operating system's responsiveness. There naturally exists a limit to human perception. But, beyond this limit, the perceived delay for user increases at a logarithmic rate. We can take advantage of this to form a three prong strategy:

7.4 Real-Time Scheduling

Occasionally, the operating system scheduler must account for process deadlines. In these situations, a task is only useful if it is completed before a certain deadline. There are known as real-time constraints.

A scheduler should first consider any real-time task as a higher priority than any less time critical tasks. Beyond that, there are three techniques to increase the likelihood of tasks meeting deadlines:

Note that there also exists a distinction between hard and soft real-time constraints. A hard real-time constraint might exist in a car's braking system, where the deadline must be hit (and otherwise completing the task has no value), while a soft real-time constraint might exist for a streaming service, where it's best to hit some deadline, but not absolutely necessary. Note also the distinction between a static and dynamic scheduling policy—a dynamic scheduling policy can change task priorities (priorities set during runtime), while a static scheduling policy cannot change task priorities once set (priorities set before runtime).

For hard real-time constraints, the following scheduling algorithms have seen use:

For soft real-time constraints, the following scheduling algorithms have seen use:

7.5 Queueing Theory

7.5.1 Definitions

7.5.2 Little's Law

Little's Law applies to any stable system where the arrival rate matches the departure rate (λμ\lambda\leq \mu). In essence, it states that N=XRN=XR.

7.5.3 Response Time vs Utilization

Best Case: Evenly spaced arrivals

Worst case: Bursty arrivals.
In general, increases response time in every case due to forced queueing, even if λμ\lambda\ll \mu.

Typically, since most systems lie between these two cases, it's useful to model queueing systems with an exponential distribution. Essentially, a queueing system behaves as a finite state machine; at each change, the queue length increases or decreases, each with some associated probability (in particular, λ\lambda and μ\mu, respectively). We can thus model the average response time as R=S1UR=\frac{S}{1-U}.

7.6 Overload Management

The key idea is to design your system to do less work when overloaded. This is because, when overloaded, your system cannot feasibly serve all requests.

One step is to simply reject some requests. This preserves reasonable response time for the remaining requests, and is often applied for, e.g., streaming applications. Another step is to reduce service time per request. This might involve switching to a slightly lower quality service, e.g. reducing bit rate in streaming, to be able to serve the overload. Yet another step is to return requests early, and update later if the request processing actually returned a different result, if the distribution significantly favors a specific result.

Many systems do the exact opposite, though. Highway traffic throughput is limited significantly when there are many cars, due to stop-and-go traffic. Time-slicing also has this issue, due to fewer cache hits from rotating between many different tasks. Naive network congestion also runs into this issue, as the sender will retransmit packets dropped due to overload, further adding to the load. Though, TCP congestion control is designed precisely to deal with this.

Aside: Lottery and Stride

Lottery: Randomized

Each task/thread is given some number of lottery tickets, with higher priority threads being given more tickets. Each thread receives at least one ticket, to prevent starvation. Then, at each time slice, a lottery ticket is drawn from the remaining pool, and the corresponding thread is scheduled on the processor.

Stride: Deterministic

Stride is the deterministic version of a lottery scheduler. Like the lottery scheduler, each task is assigned some number of tickets. Then, each task is given a stride, which is inversely proportional to the number of tickets. This is typically calculated as si=Wnis_{i}=\frac{W}{n_{i}}, where WW is a really large number and nin_{i} is task ii's number of tickets. On every time slice, the task with the lowest pass is chosen. Pass starts at 0 for all tasks, and is incremented by the task's stride every time that task is chosen. Essentially, for threading, it tracks how long each thread has been scheduled so far. Notably, a smaller stride (more lottery tickets) will mean that the thread will run more frequently.

Aside: EEVDF

Fluid Flow Systems

Fluid flow systems can be likened to the wires in a network context. Each wire has a bandwidth; this is known as cc, the capacity, in a fluid flow system. Similarly, each connection traversing this wire has a packet rate; this is known as rir_{i}, the flow arrival rate. For each connection, it's also assigned a link fair rate, ff, which can be computed as cn\frac{c}{n}. Then, for fairness, each separate flow is assigned an allocation of the link, which is equivalent to min(ri,f)\mathrm{min}(r_{i},f). We can also have weighted fairness, though, in which f=ciwif=\frac{c}{\sum_{i}w_{i}} and each packet is assigned an allocation of min(ri,fwi)\mathrm{min}(r_{i},f\cdot w_{i}).

GPS

Generalized Processor Sharing is the theoretical analogue of fluid flow systems in processor scheduling. Essentially, a processor acts as a link and the threads sharing the processors act as flows. The ideal situation involves having every thread running at the same time, given exactly their flow allocation of the processor. Of course, this is not possible, as (ignoring hyperthreading) only one thread can run on the processor at a time. Instead, for GPS, we require that each thread should receive their proportional share of time on the processor. In essence, the service time of a task ii between times t0t_{0} and t1t_{1} is S(t0,t1)=fi(t0)(t1t0)S(t_{0},t_{1})=f_{i}(t_{0})\cdot(t_{1}-t_{0}), where fi(t)=wijwjf_{i}(t)=\frac{w_{i}}{\sum_{j}w_{j}}.

WFQ

We construct an algorithm known as Weighted Fair Queuing on top of the GPS model. In essence, we utilize weighted round robin (WRR), in which the quantum of each task corresponds to its weight! This successfully approximates GPS—every period of iwi\sum_{i}w_{i} units of time, the processor will have done the same work as GPS. What WFQ is doing, essentially, is just choosing the task that finishes first in GPS within each time period, provided no other tasks are added to the system.

This following visualization of GPS may help show this dynamic of a task finishing first. Note that, here, the red task has a weight of 5, while the others have weights of 1.

gps.png

Virtual Times

If other tasks are added into the system, the finishing times of each task may change! But, critically, the order in which tasks finish never changes (excluding the newly added task, of course, since it can be arbitrarily short/long). Thus, rather than scheduling tasks according to the physical finishing time, we track the number of WRR rounds needed to finish the task. This is known as the virtual finishing time. In essence, virtual time measures the service needed to complete a task. Note also that the system virtual time is thus the index of the round in the WRR scheduler.

More formally, let V(t)V(t) denote the system virtual time. Then, the slope of V(t)V(t), when graphed against tt, represents the "normalized rate at which every task receives service in the fluid flow system." Naturally, the rate of V(t)V(t) scales inversely proportional to the sum of the weights of all these tasks. In particular, dVdt=1iwi\frac{\mathrm{d} V }{\mathrm{d} t }=\frac{1}{\sum_{i}w_{i}}. Note that iwi\sum_{i}w_{i} is equivalent to 1 virtual time unit. Then, in the ideal situation, for every time unit, each task should run exactly once (as it would in WRR).

In other words, as more tasks enter the system, the virtual time slows down!! Therefore, while every task's virtual finishing time stays the same, the virtual time progresses more slowly for each time quantum. In other words, dVdt\frac{\mathrm{d} V }{\mathrm{d} t } decreases.

We can also calculate the virtual time from 00 to t1t_{1} as tasks are added/removed from the system. Just solving the differential equation above for VV provides

V(t1)=0t11iwidtV(t_{1})=\int_{0}^{t_{1}} \frac{1}{\sum_{i}w_{i}} \, \mathrm{d}t

Since tasks enter/exit the system at discrete times, we can split this integral up into multiple integrals, with a new one each time the set of tasks changes (i.e. one per interval where the set of tasks is constant).

EEVDF

Earliest Eligible Virtual Deadline First is an algorithm that builds on top of these stated ideas. Like EDF, it schedules the task with the earliest deadline. However, unlike EDF, it uses virtual deadlines instead of physical deadlines, and only considers tasks that are currently eligible, where eligibility is calculated from a task's previously served requests.

Virtual time is essentially a representation of time weighted by the threads' weights. An increment of tt in real time corresponds to an increase of twi\frac{t}{\sum w_{i}} in virtual time. In essence, you can think of it as a weighted average of all threads' progress. Then, if request rr from thread ii takes tt real time to complete, it should expect to complete it by within a duration of twi\frac{t}{w_{i}}, if provided its fair share.

The below was written using the Linux explanation of EEVDF, and represents the real-world implementation. Might help to skip this and come back to it after reading through the rest. (i.e. this might confuse you...)

To understand eligibility, consider how EEVDF would handle two threads, T1T_{1} and T2T_{2}. WLOG, in the first time period, let T1T_{1} run for more time than it should have, and T2T_{2} run for less time than it should have. Then, EEVDF calculates the lag of each thread, i.e. the difference between the expected service time and the actual service time. A positive lag denotes a thread that did not receive its fair share, and a negative lag denotes a thread that received more than its fair share. Then, for the next time period, only threads that currently have negative lags are considered eligible. In other words, T2T_{2} is the only eligible thread, and will be scheduled since there are no other threads.

We can also calculate the virtual deadline of each thread, since we seek to select the thread with the earliest virtual deadline among those eligible. The virtual deadline denotes the earliest virtual time at which this thread will be fully serviced. First, we note a new term—eligible time—which denotes the earliest time at which a thread becomes eligible. This is just equivalent to min(lag,0)\mathrm{min}(-\text{lag},0) (relative to the current time, i.e. virtual). Then, the virtual deadline of a thread is just the sum of its virtual eligible time and virtual service time. Thus, EEVDF effectively balances fairness (lag/eligible time) and the priority of each thread (virtual time, which is weighted).

More formally, we can write several recurrence relations that define the parameters of EEVDF.

Then, we can write the following recurrence relations/equations.

ve(1)=V(t0i)(1) Base Casevd(k)=ve(k)+r(k)wi(2) Virtual Deadlineve(k+1)=vd(k)(3) Virtual Eligible Time\begin{align*} ve(1) &= V(t_{0}^{i}) && (1) \text{ Base Case} \\ vd(k) &= ve(k) + \frac{r(k)}{w_{i}} && (2) \text{ Virtual Deadline} \\ ve(k+1) &= vd(k) && (3) \text{ Virtual Eligible Time} \\ \end{align*}

Let's proceed through this step-by-step.

(1) First, the virtual eligible time of the first request is defined as the V(t0i)V(t_{0}^{i}), which is essentially the virtual time at which this thread makes its first request.

(2) The virtual deadline of the kkth request is defined as the sum of the virtual eligible deadline and the virtual service time. Threads with a heavier weight will be assigned an earlier virtual deadline, since r(k)wi\frac{r(k)}{w_{i}} will be smaller. In other words, threads with a higher priority will expect to finish sooner. Note that dividing r(k)r(k) by wiw_{i} essentially converts it to virtual service time!!

The division by wiw_{i} to convert to virtual service time is critical to understand for EEVDF, and how these formulas really fairly weight threads. Make sure this makes sense before moving on!

(3) The virtual eligible time of the k+1k+1th request is equivalent to the virtual deadline of the kkth request. The updates for the k+1k+1th request only occur when the kkth request finishes! This may take longer than one quantum.

The above is also very important to understand for the fairness of the theoretical/ideal EEVDF. Consider a request that finishes—its next virtual eligible time is updated to its previous virtual deadline, which represents the virtual time it should finish at in the idealized model of a fluid flow system. However, even if it finishes earlier than this, it should not be scheduled until after its previous, fair virtual deadline has been reached! This may be difficult to understand, so let's consider an example situation. There are two threads, each with equal weights of 11 and requests of length 11. They enter the system at the same time. Then, both start with an eligible time of 00 and a deadline of 11. As aforementioned, it's not possible to have generalized processor sharing on a processor; thus, one of these threads must finish first. Let's say we schedule thread 11 first; it will finish its first request at virtual time 0.50.5. Then, its new virtual eligible time will be updated to 11. Why? Because it has already used up its fair time share in the time slice/quantum 010-1. Therefore, it should not be scheduled again until virtual time has reached at least 11, when it would be fair to schedule it again.

In a real-world CPU scheduling algorithm, step 3 changes slightly because lag can occur, since we do not know a priori how long a task will run for. Every time a quantum qq elapses, we compute ve(k+1)=ve(k)+u(k)wive(k+1)=ve(k)+\frac{u(k)}{w_{i}}, where u(k)u(k) is the actual real service time that the request received. Note how, if u(k)>r(k)u(k)>r(k), the eligible time of this request gets pushed back more than it would have originally been. This is because the request received more time than it should have, i.e. negative lag, and thus its unfairly receiving more time. Similarly, if u(k)<r(k)u(k)<r(k), the eligible time of this request gets pushed back less than intended. This is compensating for positive lag, and ensuring this request gets scheduled sooner since it didn't receive as much time as expected.

Fun Fact!
EEVDF was introduced to replace CFS as the default scheduler in Linux kernel version 6.6.

Aside: Linux O(1)O(1) Scheduler

This scheduler was used in the Linux kernel until version 2.6.

Note that the Linux O(1)O(1) scheduler is a priority-based scheduler, with nice values of [0,100)[0,100) reserved for the kernel and [100,140)[100,140) reserved for the user. A lower nice value means a higher priority, and vice versa. The priority of each task determines its timeslice, like an MLFQ. Notably, there are also two separate task queues: an active and expired queue. Tasks that use up their timeslice are placed in the expired queue; once the active queue is empty, the queues swap.

The scheduler also leverages several O(1)O(1) heuristics, including:

Aside: CFS

The Completely Fair Scheduler is a scheduling algorithm designed to split CPU time between runnable tasks as fairly as possible. It was introduced in Linux kernel version 2.6, and phased out in version 6.6. Essentially, it keeps track of the virtual time of each task, ordered in a red-black tree. Then, it simply chooses to run the task with the smallest virtual runtime, i.e. the task that has received the least execution time so far. Note how CFS essentially replaces the previous O(1)O(1) heuristics (except for RT tasks), as tasks that sleep more will naturally be scheduled sooner.

However, we don't just want fairness in the Linux scheduler. We want to prioritize low latency and starvation freedom too.

Nice values still exist in Linux, though—how does CFS account for this?

The answer is changing the rate of CPU cycles given to threads. Consider two threads AA and BB, where AA has weight 4×4\times that of BB. Then, while the virtual CPU time of both tasks in the red-black tree will be the same, the physical CPU time of AA will be 4×4\times that of BB's. (i.e., the virtual CPU time of AA advances 4×4\times slower than BB's). In fact, this is precisely the notion of weighted virtual time seen in EEVDF!