Chapter 9: Caching and Virtual Memory

Almost all computer systems use caches in some way, shape, or form!

Chapter 8 included three widespread hardware caches:

TLBs
Virtually Addressed Caches
Physically Addressed Caches

But there exist many more!

DNS Resolution
Web content
Web search
Email clients
Incremental compilation (Makefiles)
Just in time translation (typically for memory-constrained devices)
Virtual memory (cache for disk)
File systems (also use memory as a cache for disk!)
Conditional branch prediction

All caches, though, face 3 design challenges:

Locating the cached copy. How does the system quickly determine if the cache contains a piece of data and retrieve it?
Replacement policy. The cache must be kept small, relative to the backing store. So, how should the system decide which data is valuable to cache?
Coherence. How do we keep cached data up to date? This is especially important for SMP and distributed systems.

9.1 Cache Concept

The simplest cache is a memory cache (e.g. VAC or PAC) that maps an address to the data stored at that address. A cache access either produces a cache hit (yay!) or a cache miss (sad :<).

For a cache to be useful, two properties must hold.

Cost of retrieving cached data must be significantly less than fetching data from the backing store.
The likelihood of a cache hit must be high enough to make the cache worth it.

For property 2, caches frequently have two sources of predictability: temporal locality (i.e., recently accessed data will likely be accessed again) and spatial locality (i.e. nearby data will likely be accessed). Memory cache design takes advantage of these two features: the choice of eviction algorithm improves temporal locality, while cache line size (e.g. 64 bytes) improves spatial locality. Memory caches even prefetch nearby data into the cache to improve spatial locality.

We can thus represent the expected latency of a read request to memory as Average Memory Access Time (AMAT):

\text{AMAT}=P(\text{hit})\cdot L(\text{hit})+P(\text{miss})\cdot L(\text{miss})

Where $P(x)$ represents the probability of event $x$ and $L(x)$ represents the associated average latency of event $x$ .

Write requests are a bit more complex. In particular, there are two possible cache modes for write requests: write-through and write-back. With write-through mode, write requests to the cache will propagate writes directly to the backing store. This can lead to significant extra overhead from write requests, especially if writes to the backing store are not buffered. Meanwhile, write-back caches only propagate updates to the backing store when the data that is "dirty" (written to) is evicted from the cache.

9.2 Memory Hierarchy

For hardware, there is a fundamental tradeoff between the speed, size, and cost of storage. Faster hardware is more expensive, and thus must be limited in size. Thus, there is a whole hierarchy of caches on a single chip. A computer in a data center might have

First-level cache (L1). Frequently virtually addressed.
Second-level cache (L2). Typically physically addressed.
Third-level cache (L3, typically LLC=Last Level Cache)
First- and second-level TLB
Main memory (DRAM).
Data center memory (DRAM). Facilitated by high-speed LAN connections. Going to nearby nodes' memory instead of disk is known as cooperative caching.
Local disk or non-volatile memory (SSD).
Data center disk.
Remote data center disk.

9.3 When Caches Work and When They Do Not

Caching is influenced by many factors, ranging from the cache size, cache algorithm, and program behavior. Typically, the interaction between all these factors determines cache effectiveness—for instance, some cache algorithms are more suited to certain program behaviors.

Working Set Model

A program's working set is the size of a critical mass of program data such that, when included in the cache, receives good hit rate. On a graph of cache size versus hit rate, the working set is the "knee" of the curve. Of course, a cache should, ideally, fit in the working set of a program; not doing so can significantly impact performance, as the cache will experience a lot of thrashing.

As programs change, working sets will also change. Generally speaking, as long as the cache size is greater than the working set size, cache hit rates will remain high, though they may shift between different working sets. Moreover, when a processor switches from one program to another, the cache will experience a burst of miss rates. This is just an inevitable result of the phase change behavior inherent to these systems.

Zipf Model

Zipf was a linguist who developed a model, now named after himself, to describe the frequency of individual words in text. However, it turns out it's also great for describing common real-world distributions, particularly website access patterns.

The Zipf distribution describes a set of words ranked in order of popularity, and states that the frequency of each word is proportional to $\frac{1}{k^{\alpha}}$ , where $1\leq\alpha\leq2$ . More succinctly, $f_{k}\propto k^{-\alpha}$ . An important characteristic of a Zipf distribution is that it is heavy-tailed. A significant proportion of the total words in a text are the most popular words; however, there is still a non-negligible proportion to less popular references.

In these situations, increasing the cache size improves cache hit rates, but at diminishing returns. This is contrary to the threshold behavior seen in the working set model, in which, once the cache size nears the working set size, the cache hit rates improves significantly.

9.4 Memory Cache Lookup

Caches higher up the systems stack, like page caches, have flexibility in their design—they can use a linked list, trees, hash table. However, hardware caches are constrained in their designs, since their lookup latency must be kept extremely low.

Fully Associative. Any address can go anywhere in the cache. However, upon a cache access, it's necessary to check every cache line for the address. This is great for cache hit rates, but expensive, especially as caches grow larger and larger, as it requires even more comparators (and lookups take more time!).
Direct Mapped. Each address has a specific cache line in which it can be stored. Think hash table! This is cheap and makes lookups extremely fast; however, it drastically reduces hit rates, especially if, e.g., RIP and RSP in x86-64 both map to the same cache line... oops ><
Set Associative. This is a hybrid between the above approaches, and generally the most popular. With a set associative cache, every address directly maps to a set of cache lines (of size $N$ for an $N$ -associative cache), of which it is fully associative within this set. This avoids the thrashing of direct mapped caching, while not being as prohibitively expensive as a fully associative cache.

However, direct mapped and set associative caches can pose a design challenges for the operating system. These caches are much better if a program's working set is spread across the different "buckets"/"sets" in the cache. This makes factors like, e.g., the default page size, able to greatly impact this distribution.

This is particularly an issue if the OS assigns pages that fall into the same bucket to a single process—thus, operating systems typically leverage page coloring to prevent this. In this scheme, physical page frames are partitioned into sets based on the cache bucket they fall into, and the operating system will attempt to distribute a program's page frames across a variety of colors.

9.5 Replacement Policies

There is no one right replacement policy! Some policies are optimal for some workloads, and pessimal for others. In general, choosing the right policy depends heavily on (1) the problem's design constraints and (2) the distribution of requests this cache serves.

Random

Self-explanatory...

But it is occasionally a practical design, especially for systems constrained extremely heavily by latency on the critical path.

FIFO

First-In-First-Out is a caching strategy that operates exactly like FCFS from scheduling. Just think of a queue data structure! Unfortunately, it's also pessimal for several workloads. For instance, consider a program that scans through an array of memory addresses repeatedly; however, the entire array does not fit into the cache.

Belady's MIN

Belady's MIN is known as the optimal cache replacement policy, with regards to hit rate. Essentially, at each eviction, it will replace the block that is used farthest in the future. (Conversely, the worst possible strategy is to always replace the block that is used the soonest).

Is Belady's MIN really the best cache replacement policy?

Well, in terms of cache hit rate, yes. But is that really the only measure of cache effectiveness...? Some researchers disagree!

Unfortunately, like SJF with scheduling, MIN requires knowledge of the future, so it's not practical. However, cache algorithms are frequently designed to approximate MIN.

In fact, with knowledge of the future, we could technically do even better than Belady's MIN, by prefetching everything we need—reducing cache misses to 0.

LRU

Least Recently Used is a caching strategy that takes advantage of temporal locality. In essence, it maintains a queue (really, a linked list) of cache lines ordered by their most recent access. In particular, when a cache line is accessed, it is moved to the back of the queue (noting that it should be popped off the queue last, essentially). This is known as the MRU, most recently used, position.

However, in hardware, a linked list is difficult to implement with very low latencies; thus, it's typically necessary to approximate LRU.

LRU is a great approximation of Belady's MIN in many cases. Of course, however, it is still pessimal for some workloads—most prominently, the "memory scan" workload that was also adversarial for FIFO.

LFU

Least Frequently Used is an analogue of LRU that uses frequency, instead of recency, as its metric for cache value. It's typically used for distributions that closely follow a Zipf distribution (e.g., a CDN web cache), keeping popular pages in the cache.

Additional Factors?

So far, we've assumed the cached objects are all the same size. But what about caching variable size objects? And what about parallel programs—e.g., for MapReduce, any file not being cached is as bad as if all the required files were not cached (since it is like a fork-join model). All of these factors can make cache eviction algorithms more complicated beyond what we cover here; however, we won't discuss this in much detail.

Belady's Anomaly

Intuitively, increasing cache size should improve the cache hit rate... right? Well, not quite. Belady's anomaly notes that, in some instances, adding space to a cache can negatively affect the cache hit rate.

However, this only impacts certain algorithms, and only in certain situations. Stack-like replacement policies like LRU and LFU, in which the blocks kept by a cache of size $k$ are a subset of the blocks kept by a cache of size $k+1$ at the same time $t$ , do not suffer from Belady's anomaly. This is because any cache hit that occurred in the cache of size $k$ will have certainly occurred in the cache of size $k+1$ . In contrast, the same property does not necessarily hold true for FIFO! Thus, there can be workloads such that a cache operating with FIFO increases in size and suffers a decrease in hit rate.

9.6 Memory-Mapped Files

Any modern operating system must implement some system for demand paging. Essentially, the operating system provides an illusion to applications of infinite memory; however, some of this memory actually be stored on disk, rather than in main memory! Thus, when a program tries to access unmapped memory, the operating system must go to disk and retrieve the missing memory page. In short, the operating system uses main memory as a cache for the disk!

Memory-mapped files involve mapping a file contents into a program's address space, and allowing it to directly edit a file's contents that way. This essentially transforms this memory into a write-back cache for the disk. The application also does not need to worry about how this is done; the operating system uses demand paging to retrieve files from disk as they are referenced. This offers several advantages:

Transparency. The program doesn't need to worry about if the file is in memory or not—again, the operating system handles this for it.
Zero copy I/O. The operating system doesn't need to copy file data between disk and kernel buffers and user memory; rather, it just adds an entry to the program's page table, and only needs to worry about copying data between main memory and disk.
Pipelining. The program can begin operating on data in the file once the page tables are setup; it doesn't need to wait for the entire file to load, as the operating system will handle any exceptions (unloaded segments) and provide an abstraction of the file already being there (demand paging)!
Interprocess communication (IPC). Multiple processes can share information through a memory-mapped file mapped in both processes' address spaces, without needing to send data to kernel buffers/disk.
Large files. For extremely large files, i.e. those that cannot fit in main memory (but can fit into the virtual address space), the application does not have to worry about managing which parts of the file are kept in memory, and which are on disk. The OS handles it all!

Implementation

Let's walk through the process of a program attempting to access an unmapped address.

TLB miss $\to$ page table walk
Page table exception $\to$ page fault hardware exception, traps into kernel
Conversion of virtual address to file offset (via segment table)
Disk block read—kernel allocates empty page frame an issues disk read request $\to$ thread yields processor (blocks on disk read)
Disk interrupt—disk interrupts kernel once request is serviced $\to$ kernel thread handling it is resumed
Page table update
Resume process
TLB miss $\to$ page table walk
Page table fetch $\to$ TLB update

There's one fairly important consideration that we glossed over in the above process, though—what if main memory is full? Then, when the kernel allocates an empty page frame, it must evict an existing page from main memory. More specifically, it must

Select a page to evict. How? We discuss in the next section :)
Invalidate all PTEs (page table entries) that point to evicted page. This is done via a core map, i.e. array of all physical page frames, including a list of PTEs pointing to this page. Note that the PTEs should not be removed, because, as far as the application should know, this is still in memory (abstraction of infinite memory!)
Write-back any changes to the evicted page to the disk. This is tracked via a dirty bit in the PTE metadata. This flag is also included in the PTEs in the TLB—whenever it's set (i.e. a write request generates a TLB hit), the change should be propagated to the page table by hardware. Also, note that the page must be copied back to disk if any of the PTEs pointing to this page have the dirty bit set.

Evicting a dirty page is more expensive than evicting a clean page; thus, the OS often proactively cleans pages in the background, to prevent dirty page eviction from affecting an instruction's critical path. Also, evicting a dirty bit requires a TLB shootdown, too—even more expensive!

Emulating a hardware dirty bit in software

It's also possible to emulate a hardware dirty bit in software! The OS just marks a clean page as read-only; the first write will trigger a memory exception, upon which the OS can mark the page as dirty and upgrade the protection to R/W. Cleaning the page also requires resetting the protection to read-only.

Approximating LRU

As aforementioned, the exact LRU algorithm is far too expensive to implement for hardware caches, or even page caches, for that matter. The reason is, even if the operating system manages the page caching algorithm, it would require the hardware to maintain too much metadata (i.e. linked list). Instead, hardware typically chooses to provide a minimal set of metadata so the operating system may approximate LRU or LFU.

One commonly chosen algorithm for approximating LRU (used by the Linux kernel!) is the clock algorithm, or the second-chance algorithm. It leverages the fact that most processor architectures keep a use bit in each PTE, which is set in hardware whenever the PTE is brought into the PTE. As with the dirty bit, a physical page is marked used if any of its PTEs have the bit set.

In essence, the clock algorithm attempts to approximate LRU by evicting any old page, rather than the oldest page. (Note: We will describe it according to CS 162's interpretation, which does not exactly follow the textbook). The algorithm maintains pointer into a queue of page frames (e.g., this may simply be the core map). Then, upon a...

...cache hit, the use bit of this page will be set (hardware already does this with TLB hit)
...cache miss, the pointer will be advanced until it finds a PTE that does not have a use bit set, and will evict this page. Along the way, it will clear the use bit of PTEs. Note that this can cycle through the entire queue, if necessary! Finally, the new page will have its use bit set to 1 (hardware already does this with TLB hit)

According to the textbook (not CS 162), the operating system can also this periodically, not just on every page fault. More frequent sweeps collects more fine-grained information about page usage; however, it increases software overhead.

A generalization of the clock algorithm is the $k$ th chance algorithm. In essence, the kernel will pick a page that has not been used for the last $k$ sweeps of the clock algorithm. The clock algorithm is extended to partition pages based on how recently they were used, with pages being evicted in FIFO order from the $k$ th chance partition. This may be implemented by incrementing some counter every time a page is encountered with is use bit not set (and, if its use bit is set, reset the counter and use bit).

Note: FIFO is the just the first chance algorithm!

Emulating a hardware use bit in software

Simply set a PTE to invalid even when the page is in memory; accessing the PTE will trigger a page fault, allowing the kernel to record usage statistics for the clock algorithm.

9.7 Virtual Memory

We can generalize the idea of memory-mapped files by backing every memory segment with a file on disk. This the design philosophy taken by UNIX—if you're on a Linux system, for example, you can check out /proc/<pid>/ for some files backing the program executable, memory, etc. Unlike memory-mapped files, though, this process memory is ephemeral; upon process exit, there is no need to write-back to disk.

The main advantage of virtual memory is flexibility. When out-of-memory (OOM), the system can "swap" some processes' pages to disk, and simply demand-page them into main memory when needed. However, this does introduce one critical complication: it's necessary to balance which processes' pages are paged to disk, and which are kept in main memory. After all, a modern disk can only handle so many page faults—there's several orders of magnitude difference between the throughput of handling page faults and the throughput of executing instructions.

Self-Paging

One consideration, for instance, is that the behavior of one process can hurt another process. Consider treating all pages equally, as in the clock algorithm. Then, if a process constantly accesses its pages over and over again, it'll make sure it keeps all of its pages in memory! This keeps its own latency down, but massively decreases the speed of other processes, to the point that it may seem they are making zero progress.

One popular solution to this is self-paging. In this scheme, each process/user is allocated its fair share of page frames, using max-min scheduling. Its page allocations can never exceed this fair share; extra pages will simply evict this process/user's own pages. Of course, this can detrimentally affect resource utilization; the question is whether this tradeoff is worth it! (Well, preventing malicious programs is generally worth it...)

Swapping

The simple scheme of swapping out lightly used pages to disk (i.e. outside of the working set of any process) works great for lightly loaded systems. However, under heavy loads, the union of the working sets of all active processes may no longer fit into main memory! In this instance, system performance will degrade dramatically, as page faults will increase significantly (due to the working set phenomenon described previously).

The solution is to prevent this overload condition from ever occurring by essentially starving a selected subset of tasks of their resources. Essentially, it's better to starve some tasks of their fair share if not doing so will essentially freeze the system.

This process is known as swapping. When too much paging activity occurs, the operating system prevents significant performance degradation by moving all of the page frames of some processes to disk. Eventually, when other tasks finish, the kernel can bring the swapped process's pages back into main memory. Unfair? Yes. Unreasonable? Quite the contrary.