Chapter 2: The Kernel Abstraction
Protection, i.e. isolation of applications and users from each other and the operating systems, is a critical job for the operating system kernel. A process is the execution of a program within restricted rights, or the abstraction for protected execution. A process needs permissions from the kernel to perform certain operations.
2.1 The Process Abstraction
The OS allocates memory to store code (.text), data (.bss, .data, etc.), the heap (grows up), and the stack (grows down). The virtual memory addresses are in that order. The OS starts the program by just setting the stack pointer and jumping to the first instruction. To run multiple copies, the OS duplicates the memory mappings. Frequently, this memory will be deduplicated to reduce overhead until write (Copy on Write). The OS keeps track of the various processes using a process control block.
2.2 Dual-Mode Operation
The OS executes in supervisor/kernel mode and all other processes execute in user mode. User mode has protection checks in place. This is controlled by bit 0 and 1 in the CS register in x86 (supports 4 privilege rings). The hardware must support at least three things to allow the OS to properly implement protections:
Privileged Instructions.
Programs executing in user mode are not allowed to execute privileged instructions that could be dangerous. Any attempt will result in a processor exception, in which the processor transfers control to a kernel exception handler.
Memory protection.
Userspace programs should not be allowed to access or modify sensitive kernel memory or the memory of other processes.
A very simplistic approach to memory protection is having two extra registers for a processor: base and bound. The memory a program accesses must be between these two registers. This has some issues with how expanding the heap/stack should work, how programs should share memory, how memory addresses change between different processes of the same program, and how bad it is for memory fragmentation.
Modern approaches utilize virtual memory paging, and the kernel ensures that only the memory required for the process is mapped into the page table. (Excluding some subset of kernel addresses, see KPTI. Flags are set to ensure access is not allowed).
Timer Interrupts.
Hardware needs to allow timer interrupts to enable the OS to interrupt a program and regain processor control, if necessary. This is known as preemption.
2.3 Types of Mode Transfer
2.3.1 User Kernel Mode
This can occur asynchronously—interrupts—or synchronously—processor exceptions and system calls. Synchronous transfers from user to kernel mode are called traps.
Interrupt
This is when some hardware, software, or another processor triggers an interrupt, notifying the kernel that one of the processors needs to handle it. This causes the kernel to elevate to kernel mode to handle the interrupt before returning to user mode execution. An alternative to interrupts is polling, but the kernel can't run user code while polling.
Processor Exceptions
This is a hardware event caused by some error in the user program, e.g. a page fault, division by zero, etc. These usually result in the process halting. On a multiprocessor, the kernel must also send interprocessor interrupts to other processors running the same program in parallel.
System Calls
A system call is a way for the user process to voluntarily transfer control to the kernel to request some sort of restricted operation that the user cannot themselves perform.
2.3.2 Kernel to User Mode
This can happen when a new process starts, the kernel resumes after one of the aforementioned user to kernel mode switches, the kernel switches a processor to run a different process, or a user-level upcall occurs (asynchronous event notification for user programs).
2.4 Safe Mode Transfer
For safe transitions between user and kernel mode, the OS must at least provide
- Limited entry into the kernel
- Atomic changes to processor state
- Transparent, restartable execution
Interrupt Vector Table
This is a table, whose address is held in a special register, such that it points to the addresses of various interrupt handlers.
Interrupt Stack
When some context switch from user to kernel mode occurs, the kernel will use the interrupt stack. Its address is held in a special hardware register, and the kernel uses it to save user registers and the return address to the user program.
The interrupt stack is necessary for reliability—the process's original stack pointer may not be valid, due to errors or malicious behavior in the process—and security—on a multiprocessor, other threads could modify the user memory during a system call and take over kernel control flow.
Most OS kernels will actually allocate a kernel interrupt stack for every userspace process. This makes it easier to switch to a new process inside an interrupt/syscall handler. This interrupt stack will usually contain the user CPU state, the syscall handler address, and then potentially other information (e.g. I/O driver processing triggered from the syscall).
The kernel interrupt stack can be in one of several states:
- If the process is currently running in user mode, the stack is empty
- If the kernel is running in context of the user process, the stack contains the user CPU state and current state of the kernel handler
- If the process is available to run but is waiting for an available processor, the stack contains the user CPU state
- If the process is waiting for I/O, the kernel stack contains the information for the suspended code to resume when I/O finishes
Note that multiple interrupts in a row (i.e. kernel execution is interrupted) will just all be pushed onto the interrupt stack.
Interrupt Masks
Hardware allows the kernel to temporarily disable interrupts. Or, more accurately, defer interrupts to be handled at a later time. Once a corresponding instruction to enable interrupts executes, pending interrupts are processed. Interrupts are typically disabled whenever an interrupt handler is executing. This prevents confusion, e.g. an interrupt mid-handler may overwrite the previous interrupt handler's information on the interrupt stack. Hardware does notably have limited buffering for interrupts, so interrupts may be lost if disabled for too long.
Interrupt Handlers
Interrupt handlers can trigger a large amount of processing. As aforementioned, some interrupts may be lost if disabled for too long. Thus, interrupt handlers are frequently split into a top half and bottom half. (Note: in Linux, this terminology is reversed). The interrupt's bottom half executes with interrupts masked, and is typically designed to finish quickly. The bottom half then saves the hardware device state, resets it for a new event, and then re-enabling interrupts. This will return to the interrupted task or, if the event is high priority, return to the top half.
Hardware Support for Context Switches
It's always necessary to save the CPU state (registers) before an interrupt handler runs. Hardware (and assembly ISA) will typically provide some support to perform this operation very quickly.
2.5 x86 Mode Transfer
- Mask interrupts
- Save the stack pointer (
espandssregisters), execution flags (eflagsregister), and instruction pointer (eipandcsregisters) to internal, temporary registers in hardware - Switch stack pointer to kernel interrupt stack
- Push the previously saved values onto the stack
- Push an error code (or a dummy value if not applicable)
- Jump to the interrupt handler
2.6 Secure System Calls
The process is
- User program calls the user stub syscall
- The user stub executes the trap instruction
- The hardware transfers control to the kernel, uses the syscall vector table to jump to the syscall handler, which is a stub on the kernel side copying/checking args and calling the kernel's implementation of the syscall
- The syscall's completion returns control to the handler
- The handler returns to the user level stub
- The stub returns to the caller
In x86, the system call code is stored in register eax.
The kernel stub is a bit complex; it must
- Locate syscall arguments. This is on the stack for x86, but is passed via registers in modern architectures (x86-64).
- Validate parameters.
- Copy before checking to avoid TOCTOU
- Copy back any results from kernel to user memory.
2.7 Starting a New Process
In order to initialize a new process, the kernel must:
- Allocate the process control block (this should happen at kernel initialization, actualy)
- Allocate process memory
- Load the program from disk into allocated memory
- Allocate a user-level stack
- Allocate a kernel-level stack
- Copy arguments into user memory (argv)
- Switch to user mode and transfer control to the process
2.8 Implementing Upcalls
An upcall is the reverse of a syscall, allowing the kernel to "call up" into a user process. This allows the kernel to notify applications of events, which helps some applications behave more like operating systems. In UNIX, there are called signals. In Windows, asynchronous events.
The are several uses of upcalls:
- Preemptive user-level threads via a periodic timer upcall to allow applications to switch between tasks (e.g. web browser terminates third party script)
- Asynchronous I/O notification so that the application doesn't have to poll the kernel while waiting for I/O.
- Interprocess communication via a syscall from the sender and an upcall to the receiver.
- User-level exception handling, i.e. when an application implements its own, separate exception handling.
- User-level resource allocation, i.e. some applications optimize resource usage according to their resource allocation. So, an OS may upcall to inform the process of a change in allocated resources.
Upcalls are not always needed, as applications can simply poll the kernel. This is not necessarily as desired though.
We describe some features of the UNIX signals that have proved useful. Note the similarity with interrupts.
- Different signal types
- Signal handlers
- Signal stack
- Signal masking
- Saving processor state