9. Coping with NP-Completeness

In this section, we discuss how to "cope" with NP-Complete problems. In other words, how to tackle NP-Complete problems you encounter in real life.

9.1 Intelligent Exhaustive Search

Backtracking

Backtracking is a technique that involves recursively exploring a search space, and then backpedaling steps that were taken. More importantly, though, backtracking actually takes advantage of the observation that it is often possible to reject a solution by looking at just a small portion of it.

Consider, for instance, a SAT problem with the clause $(x_{1}\lor x_{2})$ . Then all assignments with $x_{1}=x_{2}=\text{false}$ can be eliminated without further exploration. Thus, here, we are able to quickly check and reject this partial assignment. So, how can we systematize this idea to effectively "backtrack"?

The critical idea is to incrementally grow a tree of partial solutions. Whenever we encounter a partial solution that doesn't work, we can immediately prune this subtree, and explore it no further. In other words, we backtrack out of this subtree where we know there are no valid solutions.

Formally, a backtracking algorithm requires a test function that receives a subproblem and quickly determines if

Failure: the subproblem has no solution
Success: a solution to the subproblem is found
Uncertainty: needs further exploration

For instance, for a SAT problem, if we recurse through the subproblem tree by, for each step/variable set, $(1)$ removing all clauses that are already satisfied by setting this variable and $(2)$ removing the variable from the remaining clauses that are not satisfied by setting this variable. Consider setting $x=1$ in the SAT problem

(x\lor y)(\overline{x}\lor \overline{y}\lor z)(y\lor z)

Then, it would produce the subproblem

(\overline{y}\lor z)(y\lor z)

Therefore, for the SAT problem, the "test" function would declare failure if the subproblem has an empty clause, success if the subproblem has no clauses remaining, and uncertainty otherwise.

Branch-and-Bound

We now generalize this principle of pruning the search space from search problems to optimization problems. WLOG, we will consider a generic minimization problem (a maximization problem will naturally follow a similar pattern).

The critical idea of branch-and-bound (for a minimization problem) is to quickly compute a lower bound on the cost function. This allows us to reject any subproblems whose lower bound is greater than the cost of an already-encountered solution (must be a full solution, not a subproblem!)—as this implies these subproblems must certainly not be optimal.

Consider an instance of TSP. We let a partial solution be a simple path $a\to b$ traveling through the set of vertices $S\subseteq V$ , where $a,b\in S$ . We denote the partial solution by $[a,S,b]$ . For TSP, $a$ will actually always be the start vertex, and the initial problem is $[a,\{ a \},a]$ . Then, the corresponding subproblem is the best completion of the tour, i.e. the minimum cost complementary path $b\to a$ whose intermediates nodes are precisely $V-S$ .

For each iteration of branch-and-bound, we extend a partial solution $[a,S,b]$ by a single edge $(b,x)$ where $x \in V-S$ . There are at most $\lvert V-S \rvert$ ways to do this, and each branch corresponds to a new subproblem $[a,S\cup \{ x \},x]$ .

Now, the important part—how can we quickly compute a lower-bound on the completion, i.e. path $b\to a$ , of the partial tour $[a,S,b]$ . There are some very complicated solutions, but we will consider a very simple heuristic. The completion of the tour must consist of a path through $V-S$ plus edges $b\to \beta\in V-S$ and $\alpha\to a,\ \alpha\in V-S$ . Thus, its cost is at least the sum of the

The lightest edge to $a$ from $V-S$
The lightest edge from $b$ to $V-S$
The minimum spanning tree of $V-S$

Make sure you understand why before moving on! But, in essence, since computing the minimum spanning tree of a graph has an efficient algorithm, this gives us our desired "quick" lower bound, which we can use to prune branches.

9.2 Approximation Algorithms

We now consider efficient algorithms that provide good approximations of the optimal solution to an NP-Complete problem. WLOG, we will consider minimization problems. First, a little bit of notation

Let $\mathrm{OPT}(I)$ denote the value of the optimum solution. We will assume that $\mathrm{OPT}(I)\in \mathbb{Z}^{+}$ to make our math simpler
Let $\mathcal{A}$ denote an arbitrary algorithm, typically an efficient one, and let $\mathcal{A}(I)$ be its produced solution to an instance $I$

Then, we say the approximation ratio of algorithm $\mathcal{A}$ is defined as

\alpha_{\mathcal{A}}=\underset{ I }{ \mathrm{max} } \frac{\mathcal{A}(I)}{\mathrm{OPT}(I)}

In other words, $1\leq a_{\mathcal{A}}<\infty$ and it measures, multiplicatively, how much worse algorithm $\mathcal{A}$ 's solution is for the worst-case input. That is, the worst possible ratio between $\mathcal{A}$ 's solution and the actual optimal solution observed across all instances $I$ .

Vertex Cover

In Section 5.4, we already observed that the Set Cover problem can be approximated within a factor of $O(\log n)$ by a greedy algorithm. Since the Vertex Cover problem is just a special case of the Set Cover, we already have an algorithm with approximation ratio $\alpha=O(\log n)$ .

However, there exists a better approximation algorithm that is based on the idea of matching, i.e. a subset of edges that have no vertices in common. We call a matching maximal if no more edges can be added to it.

A maximal matching is not the same as a maximum matching. A maximum matching is a matching with the most number of edges amongst all matchings. A maximal matching is simply a matching that cannot be expanded further without removing edges from it.

Critically, any matching in a graph $G$ provides a lower bound on $\mathrm{OPT}$ for the vertex cover problem. This is because each edge of the matching must have one of its endpoints in the set of any vertex cover, by definition!

Additionally, consider a set $S$ that contains both endpoints of each edge in a maximal matching $M$ . Then $S$ must be a vertex cover; otherwise, it must not touch some edge $e\in E$ , which would imply that $M$ is not maximal since we could add $e$ to it—a contradiction! (Since, if $e$ is not covered by $S$ , $e$ has no vertices in common with any edge in $M$ ).

Note that this cover has precisely $2\lvert M \rvert$ vertices, i.e. two vertices for each edge. (And no less because no edges can share a common vertex). From before, we know that each edge of a matching must have, at minimum, one of its endpoints in the set of any vertex cover. Therefore, any vertex cover must have size at least $\lvert M \rvert$ . Thus, the cover we found is at most twice the size of the optimal vertex cover. Therefore, an algorithm that efficiently finds the maximal matching would have an approximation ratio of $\alpha\leq2$ .

It turns out that maximal matchings are exceedingly easy to generate—just repeatedly choose edges that are disjoint from those chosen already until no more can be added. Using a disjoint set union, this can be efficiently completed in $O(n\log n)$ !

Clustering

The clustering problem asks us to divide a set of data points into groups of "close" points.

First, let's formalize the notion of distance. Let $d$ denote the distance function, which may vary depending on the scenario. It must satisfy the usual metric properties

$d(x,y)\geq0,\ \forall x,y$
$d(x,y)=0\iff x=y$
$d(x,y)=d(y,x)$
$d(x,y)\leq d(x,z)+d(z,y)$ (Triangle Inequality)

Then, the $k$ -Cluster problem may be described as follows.

Input:
Points $X=\{ x_{1},\dots,x_{n} \}$
Distance metric $d(.,.)$
An integer $k$
Output:
A partition of the points into $k$ disjoint clusters $C_{1},\dots,C_{k}$ such that $C_{1}\cup\dots \cup C_{k}=X$ .
Goal:
Minimize the maximum cluster diameter, i.e. minimize $\underset{ j }{ \mathrm{max} }\ \underset{ x_{a},x_{b}\in C_{j} }{ \mathrm{max} }\ d(x_{a},x_{b})$ .

Perhaps the easiest way to visualize this is covering a set of $n$ points in 2D space with $k$ circles of equal size, and trying to determine the smallest possible diameter of the circles.

This problem is $\mathrm{NP}$ -Hard, as you might've guessed. However, it has an exceedingly simple approximation algorithm!

The key idea is to pick $k$ data points as cluster centers and then assign the remaining points to the center closest to it, where the cluster centers are chosen by iteratively choosing the next center to be as far as possible from the centers chosen so far (until $k$ centers are chosen).

The algorithm clearly always returns a valid partition, and, in fact, the resulting diameter is guaranteed to be at most twice $\mathrm{OPT}$ .

Let $x \in X$ be the point farthest from $\mu_{1},\dots,\mu_{k}$ (i.e. the next center we would have chosen, if this was a $k+1$ -Cluster problem). Let $r=\underset{ i }{ \mathrm{min} }\ d(x,\mu_{i})$ . Then every point in $X$ must be within distance $r$ of its cluster center. This is because the definition of $x$ implies that $\nexists x'\in X$ such that $r'=\underset{ i }{ \mathrm{min}\ }d(x',\mu_{i})>r$ , i.e. there existed a point that was more than $r$ units away from all cluster centers, and thus $x$ was not, in fact, the farthest from $\mu_{1},\dots,\mu_{k}$ , a clear contradiction. Thus, every point in $X$ must be within distance $r$ of its cluster center, and, by the triangle inequality, every cluster has diameter at most $2r$ . Thus, this algorithm produces a solution with diameter at most $2r$ .

But how does this solution's diameter relate to the optimal solution? Well, consider that we have identified a set of $k+1$ points $\{ \mu_{1},\mu_{2},\dots ,\mu_{k},x \}$ that have a pairwise distance of at least $r$ . Then, any partition into $k$ clusters must place two of these points within the same cluster—thus, the optimal partition into $k$ clusters must have a diameter of at least $r$ . In other words, the algorithm $\mathcal{A}$ presented above has an approximation ratio $\alpha_{\mathcal{A}}\leq2$ .

TSP

We can exploit the triangle inequality to produce an efficient approximation of variants of the Traveling Salesman Problem where the triangle inequality is satisfied! As with the previous algorithms, we consider an efficiently computable structure that is somewhat related to the best TSP tour—a minimum spanning tree.

Consider that removing any edge from a traveling salesman tour produces a path through all vertices, i.e. a spanning tree (albeit one without any branches). However, this does tell us that

\text{TSP cost}\geq \text{this path's cost}\geq \text{MST cost}

So, an MST already provides us with a nice lower bound on the TSP cost. However, remember that the goal of an approximation algorithm is to produce a solution that approximates the optimal TSP solution, not just bound the cost. How can we do this?

One thought is to use each edge of the MST twice, which can create a tour that visits all nodes at least once, but some nodes more than once. This tour has a length at most twice the MST cost, which therefore is at most twice the TSP cost, as evinced by the aforementioned inequality; however, this tour isn't quite a legal solution to the TSP, since it visits some nodes multiple times!

We can fix this by forcing the tour to simply skip any node it is about to revisit, and directly visit the next unvisited node in the list. By the triangle inequality, these bypasses can only make the overall tour shorter; thus, this algorithm $\mathcal{A}$ has an approximation ratio $\alpha_{\mathcal{A}}\leq2$ .

General TSP

But what TSP variants which the triangle inequality does not apply to? Well, as it turns out, these variants provably* cannot be approximated.

*"Provably" as in there does not exist an efficient approximation provided $\mathrm{P}\neq \mathrm{NP}$ .

Recall in [[8. NP-Complete Problems#8.3 Reductions, Reductions, Reductions#Rudrata Cycle $to$ TSP|Section 8.3]] that we discussed a reduction from the Rudrata Cycle/Path problem to the TSP problem. In short, we can reduce any Rudrata Path problem on a graph $G$ to an instance $I(G,\alpha)$ , where $\alpha \in \mathbb{Z}^{+}$ and can be arbitrary, of TSP such that $(1)$ if $G$ has a Rudrata path, then $\mathrm{OPT}(I(G,\alpha))=n$ and $(2)$ if $G$ has no Rudrata path, then $\mathrm{OPT}(I(G,\alpha))\geq n+\alpha$ . Interestingly enough, if any efficient, approximate solution for the general TSP existed, the Rudrata Path problem becomes efficiently solvable!

Let $\mathcal{A}$ be an efficient approximation algorithm for TSP with approximation ratio $\alpha_{\mathcal{A}}$ . Consider an instance $G$ of Rudrata Path. Create an instance $I(G,C)$ of TSP using the constant $C=n\alpha_{\mathcal{A}}$ . Then, $\mathcal{A}$ will output one of two things on $I$ . $(1)$ It will output a tour of length at most $\alpha_{\mathcal{A}}\mathrm{OPT}(I(G,C))=n\alpha _\mathcal{A}$ . Or $(2)$ it will output a tour of length at least $\mathrm{OPT}(I(G,C))>n\alpha_{\mathcal{A}}$ . If $(1)$ occurs, then we can conclude that $G$ has a Rudrata Path, and return the produced TSP as the Rudrata Path!

Therefore, if $\mathrm{P}\neq \mathrm{NP}$ , there cannot exist a polynomial-time approximation algorithm for TSP since we know the Rudrata Path problem is $\mathrm{NP}$ -Hard. Note that this conclusion is drawn from the gap property hinted towards in Section 8.3.

Knapsack

Finally, we come to an approximation algorithm for a maximization problem—knapsack! Recall that the dynamic programming solution we saw to the knapsack problem in Section 6.4 had a runtime of $O(nW)$ or $O(nV)$ , which slight modification, which both are, unfortunately, not polynomial in just $n$ , since $W$ and $V$ can be exponential in $n$ .

Consider the $O(nV)$ algorithm. What if, if $V$ is extremely large, we just scale down each value? For instance, if

v_{1}=117,586,003\qquad v_{2}=738,493,291\qquad v_{3}=238,827,453

we could lose some precision but make the algorithm several orders of magnitude more efficient by using

v_{1}=117\qquad v_{2}=738\qquad v_{3}=238

More precisely, let $v_{\mathrm{max}}=\underset{ i }{ \mathrm{max} }\ v_{i}$ . Then we may rescale each value $v_{i}$ to $\hat{v}_{i}=\left\lfloor v_{i}\cdot \frac{n}{\epsilon v_{\mathrm{max}}} \right\rfloor$ , and run the dynamic programming algorithm on this rescaled version of the problem. Note that $\epsilon>0$ is some approximation factor specified in the input to the algorithm.

Since the rescaled values $\hat{v}_{i}$ are all at most $\frac{n}{\epsilon}$ , the algorithm runs in $O(nV)=O\left( n\left( n\cdot \frac{n}{\epsilon} \right) \right)=O(n^{3}/\epsilon)$ . So the algorithm itself is efficient. What about the approximation factor, though?

As you may have guessed, $\epsilon$ is related to the approximation factor of this algorithm. I'll spare the details, but it can be shown that, the approximation factor is $\alpha_{\mathcal{A}}\geq1-\epsilon$ , i.e. the value $K$ output by this algorithm is related to $\mathrm{OPT}$ by $K\geq \mathrm{OPT}(1-\epsilon)$ .

Approximability Hierarchy

All $\mathrm{NP}$ -Complete optimization problems can be classified into one of the following categories.

No finite approximation ratio is possible (e.g. TSP)
An approximation ratio is possible, but there are limits to how small it can be / it's directly proportional to the input size (e.g. Vertex Cover, $k$ -Cluster, TSP w/ triangle inequality)
An approximation ratio of $\approx \log n$ is possible (e.g. Set Cover)
The approximation ratio has no limits, i.e. polynomial approximation algorithms with error ratios arbitrarily close to zero exist (e.g. knapsack)

Remember that this all relies on

\mathrm{P}\neq \mathrm{NP}

Also, note that these algorithms often perform much better than their worst-case approximation ratio.

9.3 Local Search Heuristics

Local search is a technique inspired by evolution—incrementally introducing random mutations and keeping those that improve performance. It can be applied to nay optimization problem!

The most naive version is iteratively replacing the current solution with a better one nearby, i.e. within its neighborhood. What exactly defines the "neighborhood" of a solution is the central design decision in local search.

TSP

We'll first consider the application of local search heuristics to TSP.

The most obvious choice of definition for "closeness" in TSP is if two tours differ in only a few edges. It's impossible for two tours to differ in only one edge, so we'll first consider two tours as "close" if they differ in exactly two edges. Then, let the 2-change neighborhood of tour $s$ be the set of tours that can be obtained by removing two edges of $s$ and adding in two distinct edges.

So... how well does this do? That is, what's the runtime, and does it always return the most optimal solution?

Unfortunately, it's not too clear what the answer to this question is. Each iteration is fast, since a tour only has $O(n^{2})$ neighbors, but it's unclear how many iterations will be needed, and whether or not the final tour is globally optimal or just locally optimal.

We can somewhat reduce the local optimality limitation by, say, extending the neighborhood to 3-change; however, the size of a neighborhood increases to $O(n^{3})$ , thus increasing runtime. Moreover, extending to 3-change does not eliminate the problem entirely; globally suboptimal but locally optimal minima still exist, though fewer. Thus, there's a balance between the algorithm's runtime and its closeness to the global optimum; the right balance is typically found via iterative experimentation.

Graph Partitioning

Graph partitioning may be described as the following problem.

Input:
Undirected graph $G=(V,E)$ with nonnegative edge weights
Real number $\alpha \in(0,1/2]$
Output: A partition of the vertices into two groups $A$ and $B$ , each of size at least $\alpha \lvert V \rvert$
Goal: Minimize the capacity of the cut $(A,B)$

Note that we actually saw a special case of this in Section 8, Balanced Cut. In fact, we will consider an even stricter variant of Balanced Cut here for designing a local search heuristic: we will focus on the special case $\alpha=\frac{1}{2}$ . (Interestingly enough, the general problem of Graph Partitioning reduces to this special case).

Now, we must define what a neighborhood is / our notion of closeness. Let $(A,B)$ with $\lvert A \rvert=\lvert B \rvert$ be a candidate solution. Then, we will define its neighborhood to be the set of all solutions that can be obtained by swapping one pair of vertices across the cut, i.e.

\{ (A-\{ a \}+\{ b \},B-\{ b \}+\{ a \})\mid (a,b)\in A\times B \}

As with TSP, though, many globally suboptimal but locally optimal solutions exist. How can we fix this?

Dealing with Local Optima

Randomization and Restarts

Randomization adds a bit of variance to our local search! It's typically applied in two ways: $(1)$ picking a random initial solution and $(2)$ randomly choosing a local move when several are optimal.

Randomization is best used by running local search repeatedly with a different random seed on each invocation. If the probability of reaching a good local optimum on any given run is $p$ , then it's expected that within $O(1/p)$ runs, such a solution will be found.

Simulated Annealing

Randomization is great, but it's not always effective. For some problems, as the size grows, the ratio of bad to good local optima increases, often exponentially. Simulated annealing is a different approach that allows moves within the neighborhood that actually increase the cost—the idea is that locally suboptimal moves might actually bring the search out of a bad local optima and closer to a good local optima/the global optimum. Simulated annealing introduces a new parameter: a temperature $T$ , where a higher temperature increases the likelihood that locally suboptimal solutions are chosen for each move.

Typically, simulated annealing will start off with a large $T$ and slowly reduce it to zero (at which point it is effectively the same as normal local search). This allows the local search to initially wander around freely, and gradually begin to prefer the more optimal solutions, allowing it to converge on a solution.

Simulated annealing is a great strategy; however, it's also a much more expensive technique, as higher temperatures inevitably result in more local moves until convergence. Thus, it frequently requires significant experimentation to choose the annealing schedule, i.e. the schedule for slowly reducing the temperature. Ultimately, though, the solution quality improvements that simulated annealing provides are often worth it!

Fun Fact

Simulated annealing was inspired by the physics of crystallization. Search up "annealing" if you want to know more :)