MIT Courses #7 – Design and Analysis of Algorithms

The seventh course in my series going through MIT Open CourseWare courses looks at Design and Analysis of Algorithms, an undergraduate course on algorithms that seems to be a natural follow up to Introduction to Algorithms. This course goes further with showing techniques for solving NP-complete problems such as incorporating randomness and other tricks with dynamic programming, introduces new data structures such as B-trees and union find and goes into the differences in running algorithms across distributed systems and cryptography. Both this and Introduction to Algorithms are courses I can see returning to whenever I’m practicing with Hacker Rank and provide a lot of good examples of improving solutions with dynamic programming and data structures.

One thing I found particularly interesting, having spent a good amount of time going through and taking notes on AWS’s Builder’s Library series, was the section on distributed algorithms and how the puzzles we’ve been learning about can be run in practice. Something about having a lecture cover leader election algorithm after reading about Amazon’s strategies for using it in their systems helps to remind myself of why I’m learning this. Sometimes it feels like there’s a disconnect between the computer science algorithms on sites like Hacker Rank and seen in interviews and the day to day work of using a programming language or tools like Docker and AWS, and these connections to distributed systems and cryptography help show why they’re practical.

For the list of courses I’m currently going through, this will be the second and last one focused on these types of algorithms, although MIT has a lot more algorithm courses available with courses focused on distributed systems, cryptography and more. With the material I’ve seen so far, there’s still plenty to practice and go back to and it will definitely be worth getting in some practice before looking at more algorithm courses. Still it’s always nice to know there’s a lot more to see and learn in a field.

MIT Courses #7: Design and Analysis of Algorithms – 6.046J by Professors Erik Demaine, Srini Devadas and Nancy Lynch

Lecture 1 – Course Overview, Interval Scheduling

The theme of 6.046 is taking themes of 6.006 to a new level. Divide and conquer is used further along with more techniques for optimization as well as network flow algorithms, which focus on maximizing quantity pushed through a network. More time is spent on intractability, where tractable problems have polynomial time algorithms that can solve them and intractable problems close the gap for exponential complexity problems. Number theory and cryptography are covered further in this course as well.

Very small changes in problem statements can cause very different algorithm complexities. P is again the class of problems solvable in polynomial time. NP is the class whose solution is verifiable in polynomial time, such as the Hamiltonian cycle problem, where given a directed graph, you find a simple cycle that contains each vertex in V. Validating this is as simple as traversing the cycle which is polynomial time, but finding the cycle so far can’t be. NP-Complete defines the level of intractability for NP and these are the hardest problems in NP, where solving one will solve all NP problems.

An interval scheduling problem is used to show the transition from a problem in P to a problem in NP using simple changes. The interval scheduling problem is set up with resources and requests, 1 through n. The first example of the problem has a single resource. s(i) is the start time and f(i) is the finish, where s(i) is less than f(i). Two requests i and j are compatible provided they don’t overlap, where f(i) is at most s(j) or . The goal is to select a compatible subset of intervals with a maximum size. A greedy strategy can be used to solve this problem. A greedy algorithm is a myopic algorithm that processes the input one piece at a time without looking ahead. These are efficient since they can make a decision on a small input.

A template for a solution to this problem is to use a simple rule to select a request, then reject all other incompatible with it. Repeat the previous rules until all requests are processed. The quality of the algorithm will depend on the rule used to pick the request. Choosing the smallest request doesn’t always lead to the maximum number. Another possible rule is for each request to take the request with the largest number of compatible requests, though this doesn’t always work either. The optimum rule for this greedy algorithm is to always choose the request with the earliest finish time, which will always get the maximum number of requests included. Proving the claim that “given a list of intervals L, this algorithm will produce k* intervals where k* is maximum” by induction on k*, in the base case k* is 1 and any interval works. For the inductive case, if claim holds for k* then that proves k*+1. L’ is used to represent compatible intervals after one item chosen and is used to choose the next item. The optimal algorithm for L’ will have k*size and the proof shows that it breaks a problem into smaller problems and k=k*+1.

Making the interval schedule problem a bit trickier, the next example gives each request a weight of w(i) and the goal is to schedule a subset of requests with the maximum total weight. The same greedy algorithm will not work here prioritizing an interval that finishes with a weight of 1 earlier than an item of weight infinity that finishes right after. A simple example of solving this using dynamic programming has R^X subproblems for j in R where s(j) is at least x, and x is f(i). So R^X or R^f(i) is all requests later than f(i). This isn’t the same as requests compatible with the ith request, it’s the set of requests later than f(i). If n is the number of requests, the number of subproblems will equal n. Each subproblem is solved once, then memoized so the total work will be the number of subproblems times the time to solve each (assuming constant lookups). A dynamic programming guessing solution tries each request as a possible first request. The recursion, opt(R)=max of i between 1 and n of (w_i+opt(R^f(i))). The complexity of this is O(N²). Interval scheduling with greedy was linear time and weighted took linear time and a small change can bring this to NP-complete. If multiple resources can be non-identical and specific tasks can only be run on specific resources regardless of weight, for instance, Q(i) is a set of machines that request i can run on specific to each i. Adding this caveat leads to a decision problem asking if k (of at most n) problems can be scheduled which is an NP-complete problem. Future classes will help to get better general performance for these algorithms.

Recitation 1 – Matrix Multiplication and the Master Theorem

The first recitation continues with the weighted interval scheduling problem showing how to get an n lg n solution via dynamic programming. Requests are sorted in the earliest finish time order f(1) <= f(2)<=…<=f(n) and p(j) for interval j is the largest index i less than j where i and j are compatible (not overlapping). M is an n-sized array with optimal values and M[k] is the max weight from 1 to k. The algorithm sets M[0]=0 and for j = 1 to n, runs M[j]=max(w(j)+M[p(j)], M[j-1]). The total algorithm will take O(n lg n) time.

Strassen’s algorithm is introduced here as well and is used for matrix multiplication where you figure out how to multiply two matrices in O(n³). Strassen uses divide and conquer, but isn’t currently the fastest solution. A and B are made 2^k by 2k by filling in empty space with zeros. A, B, and C (product matrix) are partitioned into 4 matrices of 2^k-1 dimensions, so A becomes [[A_1,1, A_1,2],[A_2,1, A_2,2]] with B and C following the same structure. C_1,1=A_1,1B_1,1+A_1,2B_2,1, C_1,2=A_1,1B_1,2+A_1,2B_2,2, C_2,1=A_2,1B_1,1+A_2,2B_2,1, and C_2,2=A_2,1B_1,2+A_2,2B_2,2. Since matrix addition is far easier than multiplication, this process simplifies things, despite going from 8 multiplies to 7 with 18 additions. The overall run time is theta(n^2.8074).

The general form of a recurrence is T(n)=aT(n/b)+f(n) and if f(n) is polynomially larger than n^logb(a), then time is theta(f(n)) and theta(n^logb(a)) if the reverse is true. If f(n) is theta(n^logb(a)lg^k(n)) and k isn’t negative, T(n) is theta(f(n) lg(n)). A precise bound cant be determined if n^logb(a) is greater but not polynomially. For Strassen, run time is theta(n^{lg2(7)). This is also used to show dn as an upper bound for median finding.}

Lecture 2 – Divide and Conquer: Convex Hull, Median Finding

This lecture is the start to a module on divide and conquer algorithms. The divide and conquer paradigm is the idea of dividing a problem of size n into “a” subproblems of size n/b. Solve each subproblem recursively and it will be really easy to solve the smaller problems. Combining the smaller solutions into the overall solution is the core of the problem. In general, T(n)=aT(n/b)+[work for merge].

The first problem is a convex hull problem where given n points in a plane S={(x_i,y_i) where i is from 1 to n} assuming no two have the same x coordinate, no two have the same y coordinate and no three are in a line. Squares, triangles, hexagons, and so on are examples of convex polygons having no concave angles. The convex hull is the smallest convex polygon containing all points in S which is denoted CH(s). CH(s) is a sequence of points on the boundary in clockwise order and represented as a doubly-linked list. A brute force algorithm shown tests drawing lines between pairs of points and checks if all other points are on one side of the line. If they are then the line can be a side of the convex hull, if not then it can’t be. For n points, this will lead to O(n²) segments, O(n) for test complexity and a total of O(n³).

A divide and conquer algorithm sorts the points by x coordinates (this is done once). For the input set, S, divide into the left half A and right half B by x coordinates, compute CH(A) and CH(B) recursively and combine. How you merge will determine the complexity. If a convex hull is generated for each subproblem, merging will need to take both and generate the new tangents that are part of the overall hull. A trivial merge algorithm looks at every pair of points between the two polygons with O(n²) complexity. What if you just looked at connecting the highest points between the polygons as well as the lowest to construct the convex hull. This seems like an option from one example, but another shows it to be incorrect. A better algorithm starts with trying to connect the two closest subhulls.

i=1, j=1
while(y(i,j+1)>y(i,j) or y(i+1,j)>y(i,j)):
    if (y(i,j+1)>y(i,j):
       j=j+1 mod q
    else:
        i=i-1 mod p
Return a_i,b_j as upper tangent

In this case, a and b are the two convex hulls being combined and this merge algorithm is order n, where a has p points and b has q points and y is a function for finding the height. So, in total, T(n)=2T(n/2)+theta(n)=theta(n lg n) which is the best solution for a convex hull algorithm. For generating the circular output, if you find the upper tangent (a_i, b_j and lower tangent, (a_k, b_m) to find the correct representation of the overall convex hull, do a cut and paste of the upper tangent and go down the b’s until you see the lower tangent and loop back to the upper. The complexity of walking through the list is O(n). A gift wrapping algorithm is an example of a 3d convex hull algorithm.

Another divide and conquer algorithm is “median finding” which looks for a better solution to finding a median than sorting and looking at the n/2 position. Don’t be satisfied with a standard algorithm if you think you can do better. In this case, the goal is to be better than theta(n log n) time. Given a set of n numbers, rank(x) is the numbers in the set that are at most x. The goal is to find elements with rank (n+1)/2 floor and ceiling in linear time. For this example, most of the challenge is in the divide instead of the merge. A select routine takes a set, S, and a rank, i, and picks an element x in S, computing k which is rank(x) and generales subarrays B and C where B is y in S where y is less than x and C is where y is greater, with k-1 elements in B and n-k in C. B and C are the two subproblems. If k is i, then return x, otherwise if k is greater than i, return select(B,i) or if k is less than i, return (c,i-k).

The worst case complexity of this algorithm currently is theta(n²) since you could constantly pick extremely unbalanced subarrays. To get the desired complexity, you need a way of guaranteeing balanced subarrays. One way of picking x is to arrange S into columns of size 5 (ceil(n/5)). Sort each column, with big elements on top, in linear time. Find the “median of medians” by looking at middle elements of each column and the middle one. Half of the n/5 groups have at least 3 elements greater than x except for one group with less than 5 elements and one with x and the same for less than x. With this, T(n)= O(1) for n<140 or T(ceil(n/5)+T((7n/10)+6)+theta(n) and the whole thing is theta(n) time.

Lecture 3 – Divide and Conquer: FFT

The Fast Fourier Transform is the most taught algorithm at MIT for a variety of uses. Today focuses on it in the context of divide and conquer and polynomials, showing all n terms in a polynomial instead of the usual dominant term. Polynomials can also be written in summation notation k=0 to n-1 of a_kx^k and a vector notation, 0…a_n-1>. One operation run on polynomials is evaluation, where given an x₀ value, you calculate A(x₀). You could try adding up all individually or use multiplication to compute in linear time. Horner’s rule is a neater way of writing this, A(x)=a₀+x(₁+x(a₂+…x(a_n-1)…)) with O(n) time assuming constant time to multiply or add two real numbers. Linear time is good this week and quadratic is bad. The second key operation on polynomials is addition, where C(x)=A(x)+B(x) for all x, so c_k=a_k+b_k for all k. Lastly, multiplying two polynomials is a bit more tricky as all terms get multiplied together, c_k = sum j =0 to k of a_jb_k-j. Computing this takes O(n²) though O(n lg n) will be achieved this week.

Polynomial multiplication is important since it’s equivalent to convolution (used in signals and image processing, deep learning, etc.) Convolution is usually thought of as an operation on vectors A and reverse(B). For instance, this could take an original signal and another signal used to clean it up. Another example combines an image with another to make the original look out of focus. Dot product/inner product is the same as multiplying the corresponding positions and adding them up and the convolution is the dot product of all shifts. If polynomial multiplication can be solved in this time, so can convolutions.

Different representations of polynomials can be used to help solve this. The standard way of writing down the addition of a_i is the coefficient vector representation. Another representation is point or sample representation, (x_k, y_k) for k=0 to n-1. If A(x_k)=y_k and x_k’s are distinct, this uniquely determines the polynomial. Polynomials can also be represented as a sequence of roots and every polynomial of degree n has n roots. So, r₀ to r_n-1 are roots and the polynomial is c(x-r₀)(x-r₁)…(x-r_n-1). It’s impossible to get roots given a coefficient vector with the three given operations for values of degrees 5 and higher. Multiplication between root representation is done by taking the union of the vectors and multiplying constants. There aren’t relations between roots of sums of polynomials and original making it a bad representation in this case.

The representations are compared based on the operations, evaluation, addition and multiplication. For the coefficient representation, evaluation and addition are linear and multiplication is quadratic. For roots, multiplication and evaluation are linear and addition is really hard, noted as infinite time. For samples, addition and multiplication are linear but evaluation requires polynomial interpolation which is possible but the best known algorithms are quadratic. In summary, no representation is perfect. O(n lg n) time is accomplished by converting between coefficient and sample representations quickly to get the benefits of both, via the Fast Fourier Transformation.

Sampling itself is not so easy and takes quadratic time, as there are n samples to do, each costing linear time. Converting coefficients to samples requires multiplying a V and A vector taking O(n²) and the reverse requires matrix multiplication, V\A which can be done with Gaussian elimination which is cubic or V^-1A which is O(n²). Converting between representations from what’s been shown is quadratic. The possibility to make this transformation easier comes from our ability to choose k values. Divide and conquer is used to accomplish this. The goal is to compute A(x) for all x values in a set X. The set X is the set of x_k. The divide part of the algorithm, divides the set into even and odd coefficients, A_even(x)=sum k=0 to n/2 -1 of a_2kx^k=0, a₂, a₄…> and a_odd(x)=sum k=0 to n/2 -1 of a_2k+1x^k=1, a₃, a₅…>. The conquer step recursively computes A_even(y) and A_odd(y) for y in X²={x²|x in X}. The combine step computes A(x) = A_even(x²)+x*A_odd(x²). Evaluating this process, T(n, |X|)=2*T(n/2, |X|)+O(n+|X|) which comes out to O(n²).

This can be improved if X would become X/2 in the recursion to give T(n)=2T(n/2)+O(n) which would be O(n lg n). In the base case for this problem, |X|=1, X={1}. For |X|=2, X={-1, 1}, two values when squared leading to 1 value. If there are two square roots for every number, when you square them the set collapses by a power of 2. A set is collapsing if |X²=|X|/2 and x² is collapsing or |X|=1. The square root of -1 is i or -i, so for |X|=4 has {i, -i, 1 and 1}. Complex numbers are needed when square roots are taken. The square root of i is +/- sqrt(2)/2 * (1 +/- i). With these and an |X|=8, the values create a unit circle. Squaring a number is doubling the angle and these values are known as nth roots of unity and all numbers can lead back to 1 by taking a certain power. Is this the first time trig is coming into the algorithms courses? (cos theta, sin theta) = cos theta + i sin theta and for theta=0, tao/n, 2tao/n,…n-1/n * Tao, Tao = 2pi. Cos theta+ i sin theta = e^{i theta} and these help show how squares work. Theta is a k*tao/n and e^{i theta})² is the same as e^{i(2 theta)} = e^{i(2 theta mod tao)} mod tao is going around the circle again. e^{i tao}=1 and e^{i * kTao/n}=ei(2k)Tao/n. As long as n is a power of 2, then nth roots of unity are collapsing. With this pattern, setting x_k=e^{ik tao/n}, and choosing k values to be around the unit circles. Once you use complex numbers, when you have a vector of size n, you only have to deal with nth roots of unity and can divide the value of X with each run of the algorithm.

Defining the algorithm, Fast Forier Transform is the divide and conquer algorithm for DFT. Discrete Fourier Transform is V*A for x_k+e^{ik tao/n}, v_jk=x_j^k=e^{ijk tao/n} which is the matrix, multiplying it by a vector is the DFT of the vector and FFT is a way of computing it in n lg n time. This converts a coefficient vector to a sample vector thought that needs to be transformed back into a coefficient vector. This process is almost identical. The polynomial multiplication algorithm in n lg n takes two coefficient-form polynomials A and B, samples A* and B* calculated with A*=FFT(A) and B*=FFT(B) can be used to compute C*_k=A*_kB*_k for all k. Then C = IFFT(c*) where IFFT is the inverse of FFT and is the missing piece. This is done by changing the x_k values. V^-I=V (complex conjugate, denoted with a line over the V)/n where the complex conjugate means reflection through the x-axis. For a+ib, the complex conjugate is a-ib. So to get the inverse, x_k^-I=e^{-ik tao/n}. This gives (nV^-I*A*=nA and dividing by n gives the inverse transform. Flip the sign in the exponent of x_ks, do the same matrix vector product and divide by n. He goes on to prove that this algorithm does compute the IFFT.

One application for FFT one editing audio which uses frequency space, viewing the wave as trigonometric functions and angle of the vector represents how it shifts over time. Any audio stream can be converted and manipulated in this space. Audacity converts to this space to edit and then converts back. This only takes n lg n time, and is used in hardware as well. FFTW is the fastest transformation and was made at MIT and is the best implementation in software, used commonly in the world.

Recitation 2 – Trees and B-Trees

2-3 Trees are balanced search trees where every node that isn’t a leaf has 2 children with 1 piece of data or 3 with 2, all leaves are at the same level, non leaves branch, data is kept sorted, leaves have one or two fields and the height is at most lg n. Search starts at the root and traverses down in order in O(lg n), looking at one node per level. Inserting a node is done by searching for where it would go and inserting it. Then if and while a node has more than 3 elements its split into a left half median and right half and move the median up a level adding it to its parent or creating a new root. Delete is done by swapping the desired node with its next successor until it’s a leaf and then merging to get back to a 2-3 tree. Insert and delete are both O(lg n).

More generally, B-Trees are tree data structures that store sorted data and a generalization of BSTs where nodes can have more values and children, while supporting search, insert and delete functions in O(lg n). A B-tree’s minimum degree (or branching factor) B means non-root nodes contain at least B-1 keys and all non-leaf and non-root nodes have at least B children. Nodes contain at most 2B-1 keys and all nodes have at most 2B children. Nodes can have children equal to 1 greater the number of keys they store. B-trees are useful for caching where an entire block is useful by setting B to block size and are used in practice by most databases and file systems. MySQL and SQLite along with MacOS’s HFS, Windows NTFS, Linux ext3 and more make use of B-Trees.

Lecture 4 – Divide and Conquer: van Emde Boas Trees

van Emde Boas Trees are an old data structure named after Peter van Emde Boas. The goal of the data structure is maintaining n elements among {0, 1,…, u-1} and run insert, delete and successor operations on them. Successor is the opposite of a predecessor function though they’re done in roughly the same way. These operations are known in log n time from the previous course, but this shows how to do them in O(lg lg u) time which is often smaller than log n. If u is polynomial in n, n^O(1) or n^{lg^O(1)*n}, then lg lg u is O(lg lg n), allowing for superfast operations. Most network routers use this data structure to store routing tables.

Where does this lg lg u bound come from? Organizing the height of the tree into a tree, since the height of a tree is lg u, where the base is length u. So this problem involves a binary search on levels of a tree. In terms of recurrence, for binary search, T(k)=T(k/2)+O(1)=O(lg k), so the goal is to get T(lg u)=T(lgu/2)+O(1). Rewriting in terms of u, T’(u)=T(sqrt(u))+O(1)=O(lg lg u). The goal is to take a problem of size u and split it into problems of size sqrt(1). A simple data structure for representing numbers 0 to u-1, and initially handling insert and delete in constant time, is a bit vector. A bit vector is an array of size u where each slot is filled with a 0 or 1 denoting if the value is absent or present. Inserting or deleting changes a 1 to a 0 or 0 to 1 in constant time, though the successor function takes O(u). This is a good starting point for constructing the data structure. The universe is split into sqrt(u) clusters with sqrt(u) values. For an example of a 16 length array, the universe is split into galaxies of 4 and the bit vector is represented by a summary vector containing the “or” value for each cluster, telling whether or not a value is present or not with a 1 or 0. With the above strategy insert is still constant time and successor(x) first looks in x’s cluster and if target isn’t found it looks for the next 1 bit in the summary vector and the first 1 in that cluster. In each case, this takes sqrt(u) time. This can be improved with the recursive step.

In general if x=i*sqrt(u)+j where j is at least 0 and less than sqrt(u) and i is the cluster number and j is the position in that cluster, high(x)=floor(x/sqrt(u)) and low(x)=x mod sqrt(u). index(i,j)=i*sqrt(u)+j. High(x) corresponds to the high half and low to the low half. This divides the section of bits in half and looks at the earlier (high) or later (low) bits. For the recursive step, V of size u represents the overall van Emde Boas structure which represents an array of all clusters in V.cluster which recursively are represented by sqrt(u) smaller V structures each of size sqrt(u). V.summary is a size sqrt(u) V object as well. Looking at inserts,

Insert(V,x):
    Insert(V.cluster[high(x)], low(x))
    Insert(V.summary, high(x))

High(x) keeps tracks of which clusters aren’t empty. For successor,

Successor(V,x):
    i=high(x)
    j=Successor(V.cluster[i], low(x))  // look in x’s cluster (high(x))
    If j=inf:
        i=Successor(V.summary, i)
        j=Successor(V.cluster[i],-inf)
    return index(i,j)

This algorithm is still pretty bad since it makes multiple recursive calls, so T(u)=2T(sqrt(u))+O(1) which in terms of lg u is T’(lg u)=2T(lgu/2)+O(1). To get the desired result, we need to get this down to one recursive call. Successor can be improved by removing one call by updating the data structure. Update V to also store the minimum.

Insert(V,x):
    If x<V.min:
        V.min=x
    …

In successor, replace “j=Successor(V.cluster[i],-inf)” with “j=V.cluster[i].min”. This is down to lg u performance which is partial progress. Adding similar code to insert to store V.max, you can update Successor to get O(lglg u).

Successor(V,x):
    i=high(x)
    if low(x)<V.cluster[i].max:
        j=successor(V.cluster[i], low(x))
    else:
        i=successor(V.summary,high(x))
        j=V.cluster[i].min
    return index(i,j)

Unfortunately, insert still takes lg(u) time but it can be improved by not recursively storing the minimum. If the min field is blank, store the new value as the minimum and don’t recurse.

Insert(V,x):
    If V.min=None: 
        V.min=
        V.max=x
        return
    If x<V.min: swap x<->v.Min
    If x>V.max: V.max=x
    If V.cluster[high(x)].min=None:
        Insert(V.summary, high(x))
    Insert(V.cluster[high(x)], low(x))

This is constant time when inserting into an empty array so in either case there’s really only the cost of 1 recursive call. Successor needs to be updated as well by adding a first line of “if x<V.min: return V.min”, since V.min may be represented nowhere else. Finally getting around to the code for delete,

Delete(V,x):
    If x=V.min:
       i=V.summary.min
       If i=None: 
           V.min=V.max=None
           return
       V.min=index(i,V.cluster[i].min)
    Delete(V.cluster[high(x)], low(x))
    If V.cluster[high(x)].min=None:
        Delete(V.summary, high(x))
   If x=V.max:
       if V.summary.max=None:
           V.max=V.min
       Else:
            i=V.summary.max
            V.max=index(i,V.custer[i].max)

Most of the time, lg lg u is the right answer and it is the lower bound for these algorithms. The other issue is space, and to store the tree takes O(u) where O(n) is desired. To fix the space bound, the idea is to only store nonempty clusters. There’s no point in storing the non-empty clusters. Making V.cluster into a dictionary instead of an array will accomplish this and gives O(n lg lg u) space. This can be reduced further to O(n) by stopping the recursion when n is small such as lg lg u.

Lecture 5 – Amortization: Amortized Analysis

This week looks at analysis techniques when using data structures to implement an algorithm. Amortization is a technique from financial analysis and is the idea of worrying about the summarized cost rather than individual operation costs. Table doubling was an example from Intro to Algorithms that made use of this. With n items in a table of size m, the estimated cost is O(1+m/n). If n grows to m then the table is doubled which takes O(m) work log n times. The total cost for n insertions is O(n) so the amortized cost per operation is constant. The aggregate method for amortization measures the total cost of k operations and divides by k to get the amortized cost per operation. Using amortized bounds, you can assign an amortized cost for each operation such that the sum of the amortized costs is at least the sum of the actual costs.

An example using 2-3 trees looks at three operations. Creating an empty tree takes constant, each insertion takes O(lg n*) amortized time and deletion can be viewed as free since the deletion cost can be bound by the insertion cost making the total cost O(c+ilg n&+dlg n*). Since d is at most i, O(c+ilgn*+d0). Since at different times n is a different values, actual costs will change at different points. n* represents the maximum size over all time. A method of computing a sum is the accounting method, where an operation can store credit in a bank account which must keep a non negative balance. An operation can pay for time using credit stored in the bank, so credit would be stored on each insertion to keep the deletion cost zero. Each insertion costs O(lg n) and each delete costs zero amortized. If you put 1 coin worth theta(lg n) for each insertion, deleting consumes the coin. The amortized cost of the operation is the actual cost plus deposits minus withdrawals. For table doubling, when inserting an item, a coin is added to the item with a constant value, c and at each doubling the coin is used to pay for the cost of the doubling. The last half of any array will have coins that can be used to pay for the next doubling. When doubling, the last n/2 items have coins, so the amortized cost is theta(n)-c*n/2 (c*n/2 is 0 if c is large).

Another method of amortizing is the charging method. This doesn’t explicitly keep a bank balance but operations can retroactively charge part of their cost to the past. The amortized cost of an operation is the actual cost minus the total charge to the past plus the total charge of the future. In this example during a doubling operation, you charge the cost to all n/2 insert operations since the last doubling which leads to each being constant. With the “since” clause, each insert will only be charged once. Looking at both inserts and deletes, both doubling and halving are taken into consideration. If m=theta(n), when a table is 100 percent full you double its size and when it’s 25 percent full you have the size. Regardless of doubling or halving an array, the new array will always be 50 percent full. A halving operation is charged to the last m/4 deletions since the last resize.

The last method is the potential method which stores an account balance as a potential function of the data structure state. The potential function maps the data structure configuration to a non negative integer. An amortized cost is the actual cost plus the change in the potential, which is the potential(after) – potential(before). The sum of the amortized costs is the sum of the actual costs + potential(end)-potential(start). The potential for the start should be constant or zero. A classic example of amortization is incrementing a binary counter where you may have to change different places when you increment a binary counter such as 001101. The cost of incrementing is O(1+the number of trailing 1’s). In this case, the potential function, phi, is the number of 1 bits. In general, with t trailing bits, an increment destroys t 1’s and creates a single 1. The amortized cost is O(1+t)-t+1=2.

Back to analyzing 2-3 trees, the next example looks at inserts. There are log n splits per insert in the worst case and the amortized number of splits is constant. There are two types of nodes in 2-3 trees. One has a value and two children and the other has x and y values and 3 children. Given a tree with x,y,z values and 3 children, splitting makes x and z children of y each with their own children. So splitting a 4 node makes two 2 nodes. A good potential function for this is the number of 3 nodes or nodes with 3 children. The only way expensive inserts doing many splits happen is if you have a chain of 3 nodes. After a split there will only be one 3 node even with k before, similar to a binary counter creating at most 1 bit. In an insert, if the number of splits is k, then the change in potential, delta phi=-k+1. The amortized cost becomes k-k+1=1.

The next example looks at inserts and deletes in (2,5)-trees where the number of children of a node is between 2 and 5, then when a node goes from 5 to 6 children, it’s split into two 3 nodes. If you just delete a key in a leaf it may become too empty, for instance a 2-node becomes a 1-node with no value. When a node gets two small it tries to merge with a sibling or parent node, which may result in another split. In this case, the potential function is the number of nodes with 2 children + the number of nodes with 5 children. For insert with k splits, delta phi=-k+1 and for delete with k merges, delta phi=-k+1 and amortized, splits and merges are constant per insert and delete. A generalization of b trees where you can specify upper and lower bounds is (a,b)-trees. While the structure mostly makes sense from these examples, the introduction of b-trees is covered in recitation so I may want to come back to those if this is a theme.

Recitation 3 – Union-Find and Amortization

Union-Find, also called a disjoint-set is a data structure for keeping track of a collection of pairwise disjoint sets, S with n elements. Each set has an arbitrarily chosen element that represents the whole, rep[S_i]. Union-find supports 3 key operations. Make-Set(x), which adds set {x} to S with rep[{x}]=x. Find-Set(x) determined which set S_x in S contains x and returns rep[S_x]. Lastly, Union(x,y) replaces two sets x and y with one set with their union for x,y in two distinct sets. To make this perform well, amortization is made use of. In practice, union-find is good for keeping track of connected components in undirected graphs that can be altered. If a node and edges are added, make-set can make a new component for edges and find-set can find what the node will be connected to and union can connect components connected with the node.

One solution for creating the data structure is to use linked lists, where a doubly linked list is used for each set. A hash table can be used to access each node in constant time. rep[S_i] is set to the element of the head of S_i’s list. Make-set(x) initializes x as a node in worst case constant time. Find-set walks left in x’s list in worst case linear time. Union walks to S_x’s tail and S_y’s head adding the two together in linear time. Giving each node a pointer to the list’s representative and keeping track of a list’s tail and length (weight) can improve the process so find-set is constant. Union can be updated to update rep pointers as well. If union mergest the smaller list into the larger then n-1 unions will be done in linear time. With these improvements, the amortized runtime of n calls to make-set and n calls to union is O(n lg n).

Union-find can also be represented as a forest of trees where rep[x] will be the root of the tree containing x. Make-set initializes a node x, find-set climbs the tree with x to its root in theta(height) time and union climbs to roots of trees with x and y and merges, setting rep[y]’s parent to rep[x] in theta(height) time, where x is smaller than y. This gives a height of at most O(lg n). Find-set can be improved by keeping track of rep pointers it sees for future calls, giving an O(m lg n) amortized run time for m operations with find-set. With these changes, the amortized cost of each operation is almost constant.

Lecture 6 – Randomization: Matrix Multiply, Quicksort

The next few lectures talk about randomized algorithms, looking at different ways of solving previously seen algorithms and analyzing them. A randomized or probabilistic algorithm is an algorithm that generates a random number usually from a range and makes decisions based on that value. With recursion different levels can take different actions and the same problem in total can have varying results across runs. On the same input for different executions, the algorithm may take a different number of steps and can even produce different (sometimes incorrect) outputs. With these problems it’s important to analyze and reduce the probability of getting an incorrect output. Monte Carlo algorithms are “probably correct,” Las Vegas algorithms are “probably fast” and expect to run in polynomial time. Some algorithms that are probably correct and probably fast are classified as Atlantic City algorithms, though these won’t be covered here and can be both slow and incorrect.

Matrix Product is the first example of a Monte Carlo algorithm where the goal is to find C=A*B. A simple algorithm runs in O(n³) multiplications. Another algorithm called Strassen multiplies 2×2 matrices using 7 multiplications which can be done in O(n^{log2 7})=O(n^2.81). Until 2010, Coppersmith-Winograd was the best algorithm with O(n^2.376). The downside is that each of these come with a large constant factor requiring unrealistically large matrices to be efficient. The goal in this course is to get a O(n²) algorithm where if AxB=C, then prob[output=Yes]=1, but if AxB != C, then prob[output=yes] is at most ½. What’s interesting is that when they aren’t equal you may get an incorrect answer. Since the algorithm can be run multiple times, you can drive down the probability of an incorrect answer as seeing a No means No is the answer. The probability of error is 1/2^k and the overall algorithm is O(kn²).

The specific algorithm is called Frievald’s Algorithm and it works by choosing a random binary vector, r[1…n] such that Pr[r_i=1]=1/2 independently for i=1,…n. If A(Br)=Cr, then output “Yes” else output “No.” If AB=C, then A(Br)=(AB)r. This analyzes the correctness when AB=C, but if AB != C, then Pr[ABr != Cr]>= ½. To show this, let D=AB-C, if there’s a non-zero entry in D, then multiplication was done incorrectly. There are many r’s where Dr != 0. Pr[Dr != 0] is at least ½ for a randomly chosen r. A bad r value doesn’t discover an incorrect multiplication, so D isn’t 0 but Dr is. These lead to false negatives. D=AB-C != 0 means there’s an i,j pair where d_ij!=0. Making a vector v where all positions are 0 except v_j which is 1 and (Dv)_i != 0 and Dv != 0. Take any r that can be chosen by the algorithm such that Dr=0 and r’=r+v. Dr’=D(r+v)=0+Dv != 0. r to r’ is 1 to 1 if r’=r+v and r’’+v. This means r = r’’. With one tweak a bad r becomes a good r. In the case where Dr=0, discover an r’ such that Dr’ != 0 and r to r’ is a 1 to 1 mapping. The number of r’ for which Dr’ != 0 is at least the number of r for which Dr=0. This implies that Pr[Dr != 0] for a randomly chosen r. In summary, running this over and over reduces the probability of missing a false value to a very small amount.

An example of a Las Vegas algorithm is quick sort. Merge sort doesn’t work in practice because of the auxiliary space it required. Quicksort is a sorting algorithm which gives practical performance as opposed to asymptotic complexity. This is a randomized divide and conquer algorithm invented in 1962 which is run in place without needing O(n) auxiliary space. All of the work of quick sort is in its divide step. Given an n-element array A, divide is run by picking a pivot element x in A, partitioning A into subarrays L(ess than x) and G(reater than x). Conquer recursively sorts L and G and combining is trivial. A basic quicksort algorithm sets pivot x to A[1] or A[n] and partitions the given x in O(n) time. In the worst case for the algorithm that choses A[1], the algorithm is O(n²) in the case that the array is sorted or reverse sorted as 1 side can have n-1 elements while the other has 0. T(n)=T(0)+T(n-1)+theta(n)=theta(n²). Basic quicksort looks bad but works well on random inputs in practice. Often inputs will be shuffled in theta(n) time to avoid sorted arrays.

An intelligent pivot selection algorithm will find a median to ensure balanced partitions. Using median selection balanced L and G can be guaranteed in linear time as shown a few lectures ago with median of medians. Using this with quicksort, T(n)=2T(n/2)+theta(n)+theta(n). The 2T(n/2) is thanks to median-based pivoting, the first theta(n) is the recursive median selection and the second is the divide or partition. This puts a lot of work into the divide step. For a randomized quicksort algorithm, x is chosen at random from array A and the expected time is O(n log n) for all input arrays. Lastly, an algorithm called paranoid quicksort which will keep trying to get balanced partitions, checking after and trying again. Once achieved, divide and conquer will work well. To do this, repeat choosing a pivot that’s a random element in A and performing the partition, until the resulting partition is such that |L| and |G| are at most ¾|A|. A call is good at least half the time, so if T(n) is the time to sort the array, T(n)=T(n/4)+T(3n/4)+2Cn. The expected number of iterations is 2 which is 2’s role in the analysis. There’s a maximum of log 4/3 (2 cn).

Lecture 7 – Randomization: Skip Lists

A skip list is a randomized data structure that requires a probabilistic analysis. The notion of “with high probability” is used instead of a guaranteed order for operations. William Pugh invented the data structure in 1989 and it’s relatively easy to implement especially compared to balanced trees. The skip list maintains a dynamic set of n elements each having different characteristics in O(log n ) time per operation. This is the first randomized data structure in the algorithms courses. It’s relatively easy to show the expected time for a search is O(log n) but the second half of the lecture shows the “with high probability” result which is a stronger analysis.

The example starts with an unsorted doubly-linked list with a pointer to the first element and goal of being able to search for an element. With just the linked list, searching will take O(n) time. Even if the linked list is sorted, it’s still O(n). A skip list is shown where a second smaller linked list contains only some of the values in a first, with the explanation being that a train line may only stop at some stations and all stations are represented by a longer list. If you want to go farther, you can use a skip list and then go back to the same position in the regular list. Values in the skip list are also linked to the regular list to allow for transfer between them to get to values not in the skip list.

An algorithm for search(x) with 2 linked lists is to first walk right in the top linked list (smaller, L₁), until going right again would be too far, then walk down to the bottom list (larger, L₀) and then continue going right until x is found. Analyzing this, the total cost of searching is linear in the top list but only looks at a fraction in the bottom, if spacings are roughly even. To minimize this cost, terms should be equal. For instance, |L₁|²=|L₀|=n and |L₁|=sqrt(n). With two sorted lists the cost for a search is O(2*sqrt(n)). With 3 sorted lists, the cost is 3*cube-root(n) and for k, you have k*kth-root(n). Based on this, lg n sorted lists would give O(2 lg n). The current problem is that this cost is based on a static set that doesn’t change, when the goal is a dynamic data structure.

Another rule of skip lists is that if a value is present on a higher level, it’s also present on all levels that are lower which can also mean a number needs to be inserted or deleted from multiple lists. For insert(x), you first have to search(x) to see where it fits into the bottom list. You always insert into the bottom list but then need to decide which other lists to insert into. In this case, coins are flipped to determine how much to promote elements. If heads, promote x to the next level up and repeat this process until tails, where you stop. A heads can lead to a new level being created. Delete(x) searches for and deletes the value from all levels.

The number of levels in an n-element skip list is O(lg n) with high probability, (1-1/n^alpha) where a constant that log n is multiplied by is related to alpha and this is related to Chernoff bounds (rings a bell). This is proved since the failure probability is the probability that there aren’t at most c lg n levels or Pr(>c lg n levels) = Pr(some element gets promoted more than c lg n times. Using a union bond, this is at most n*Pr(element x gets promoted more than c lg n times) which is n*(½)^{c lg n}= n/n^c=1/n^alpha where alpha is c-1.

A theorem proved is that any search in an n-element skip list costs O(lg n) with high probability. To show this, the professor analyzes the search backwards. The backward search starts (ends) at the node in the bottom list. At each visited node, if the node wasn’t promoted higher then a tails was flipped and we go (came from) left. If the node was promoted higher, then heads was flipped and we go (came from) up. This process stops once the top level is reached. Proving this, the backward search makes up and left moves each with probability ½ (based on heads or tails). The number of moves going up is less than the number of levels with high probability. The total number of moves is the number of moves until you get c lg n up (heads) moves which is O(lg n) with high probability.

Chernoff bound says if Y is a random variable representing the total number of tails in a series of m independent coin flips where each flip has a probability p of coming up heads, then for all r greater than 0, Pr[Y>= E[Y]+r]<=e^{-2r2/m. This tells you the probability as you get further away from the expected value. Chernoff bounds are a good tool for showing “with high probability” results. For any C, there is a constant d (invoking Chernoff bound) such that with high probability, the number of heads in flipping d lg n fair coins is at least c lg n. Lastly, we need to show that Pr(event A and B) happen with high probability.}

Lecture 8 – Randomization: Universal and Perfect Hashing

This shows a randomized implementation of a hashing algorithm with constant time expected. This dictionary problem uses an abstract data type. The hash table is the data structure while the dictionary is the ADT. The goal of this problem is to maintain a set of items each with a key, supporting insert, delete and search functions. As opposed to a predecessor or successor searches, here we just want to know if there’s an item with a specified key. This is an “exact search” and if the key doesn’t exist you learn nothing else about the structure. This is what a Python dictionary implements. For insertion, key is assumed not to be in the table for this example though normally it would overwrite. An AVL tree could solve this in log n time, but this example will do better.

The simplest way to do this previously was with “hashing with chaining” given expected constant time per operation using linear space. Variables were u(niverse size), n(umber of keys currently in the data structure), and m (slots in table). Each spot in the array stores a linked list of items mapped to that slot. With this, we achieve theta(1+alpha) where alpha is the load factor, n/m. This was proved in Intro to Algorithms by assuming simple uniform hashing, which is an assumption that is basically cheating to make things simple. This assumption was that the probability of two distinct keys mapping to the same slot to be 1/m and requires assuming that keys are random leading to an average case analysis. This lecture shows how to achieve constant time even in worst case keys.

Hash is an english word from the 1650’s meaning to cut into small pieces typically in the context of food. The word comes from the French word hacher meaning to chop up and the base of the word hatchet, which comes from hache which is “axe” in old French. Two ways of achieving strong constant time bounds are with universal hashing, guaranteeing few conflicts, and perfect hashing, guaranteeing no conflicts in static sets (without inserts and deletes). Universal hashing is a good technique for getting constant time without assumptions about the input. This works by choosing a h(ash function) randomly from a set of H(ash functions). H is required to be a universal hash family (set of hash functions). The goal is for the probability that two keys hash to the same value to be small (at most 1/m). The first theorem this lecture states that for n arbitrary distinct keys, for a random h in H where H is universal, the expected (over choice of h) number of keys colliding in a slot is at most 1 plus alpha. To prove this, considering keys k₁ to k_n, compute the expectation. Let the indicator random variable I_i,j=1 if h(k_i)=h(k_j), 0 otherwise. E[# keys colliding with k_i] = E[sum (where j isn’t i) of I_i,j+I_i,i] = sum (j != i) of E[I_i,j] + 1=sum (j != i) of Pr{I_i,j=1} + 1 which is at most sum (j != i) 1/m + 1 or n/m + 1.

The next question is how to design a universal hash family. An example of a bad universal hash family is H={h: {0,1,…,u-1} -> {0,1,…,m-1}. The problem with this is that it’s time and space are theta(u). To get consistency, all hash function values need to be kept track of. You can’t use a hash table to store a hash function, since that would be infinite recursion. The challenge is finding an efficient hash family not requiring much space to store or time to compute. A better hash family is a dot-product hash family where the table size m is assumed to be prime. When it doubles, it finds a nearby prime number. It’s also assumed that u=m^r where r is an integer. A key, k, is viewed in base m, so k is a vector of subkeys, 0, k₁,…,k_r-1>, where 0<=k_ia(k)=(a*k)mod m = sum i=0 to r-1 of a_ik_i mod m. h_a is chosen randomly from H which should take constant time. In general we’re in the Word RAM model which assumes we’re manipulating integers which fit in a word and manipulating a constant number of words takes constant time, and data values (keys) fit in a word. Another example has h_ab(k)=[(a*k+b) mod p] mod m, where H={h_ab | a,b in {0,…u-1}}, where p is a prime number bigger than m.

The next theorem stated and proved is that dot product’s H is universal. Given that k isn’t k’ and they differ in digit d, so k_d != k’_d. Computing the probability, Pr (over a){h_a(k)=h_a(k’)}=Pr (over a) {sum i=0 to r-1 of a_ik_i=sumi=0 to r-1 of a_ik’_i (mod m)} = Pr{a_d(k_d-k’_d)+sum over i != d of a_i(k_i-k’_i)=0 mod m} = Pr{a_d=-(k_d-k’_d)^-1 * sum i != d of a_i(k_i-k’_i) mod m} = E over a_{i != d} of [Pr over a_d{a_d=f(…) mod m}] = 1/m. The change of any uniformly random integer hitting any value mod m is 1/m. This is universality and f(…) is f(k,k’,a₀…a_r-1), as long as this function that chooses random numbers doesn’t depend on the choice of a_d, which is why a_d was separated from the rest of the function.

The next problem shown is the static dictionary problem where given n keys, you build a table to search with perfect (or FKS) hashing, achieving constant time for search and linear space in the worst case. Building this data structure takes polynomial time with high probability. Performance is great, though it doesn’t support inserts or deletes as it’s static. The big idea for perfect hashing is to use two levels. A universe is mapped with h₁ into a table of size m which is theta(n). Instead of storing colliding elements in a slot with a linked list, each slot will map to another hash table, and guarantees in the second level of hashing there are no collisions. First, h₁ is picked from a universal hash family. Then if the sum of j=0 to m-1 of l_j² > cn, then pick a new hash function to ensure linear space. The first level is basically hashing with chaining, then for each slot j in {0,…,m-1}, let l_j be the number of keys (among n) hashing to slot j, then pick h_2,j mapping the same universe to a {0,…l_j²-1}. This is based on the birthday paradox where with n people and n² birthdays, the probability of a collision is ½. There should be zero collisions in each of the subtables, so while h_2,j(k_i)=h_2,j(k_i’) for some j, k_i != k_i’ where h₁(k_i)=h₁(k_i’), rechoose h_2,j. After this while loop, there will be no collisions (2 different keys mapping to the same value. How many times do these steps need to be redone to get a perfect hash table?

The probability of a collision at the second level is at most the sum of all probabilities of collisions over the choice of the hash function which comes out to at most a half. Similar to last lecture this requires 2 expected trials and O(lg n) trials with high probability. This is done for each secondary table and there are linear tables. l_i is O(log n) for all i with high probability. The total build time for repicking hash functions is O(n lg² n). The expectation for the sum of having to rechoose the first hash function for linear space is expected to be 1 by counting how many pairs of items map to the same slot and the probability is at most 1/m and the expected space is linear. Last class showed high probability bounds when working with logs which required a Chernoff bounds. Showing linear bounds with respect to n can be done with a Markov inequality, claiming Pr of h₁ of {sum j=0 to m-1 of l_j² >= cn} is at most E[sum of l_j²]/cn. Which is O(n)/cn and at most ½. The expected number of trials is at most 2 and there are log n trials with high probability. Linear time is spent for each trial, so the total work is O(n lg n) for these steps. Overall, this is n poly lg or polynomial time to build the structure.

Recitation 4 – Randomized Select and Randomized Quicksort

Randomized-select selects the kth order statistics of an arbitrary array by partitioning array A with randomized-partition and recurses one of the results.

Randomized-select(A,p,r,i):
    If p == r:
        Return A[p]
    q = randomized-partition(A,p,r)
    k=q-p+1
    if i <= k: return randomized-select(A,p,q,i)
    else: return randomized-select(A,q+1,r,i-k)

Randomized-partition(A,p,r):
    i = random(p,r)
    Exchange A[p] and A[i]
    return partition(A,p,r)

Recursing on the larger subarray, the expected run time is linear. Randomized-quicksort works by partitioning array A and recursively sorting each partition. T(n) will be theta(n lg n).

Randomized-Quicksort(A,p,r):
if p < r:
    q = randomized-partition(A,p,r)
    randomized-quicksort(A,p,q-1)
    randomized-quicksort(A,q+1,r)

Lecture 9 – Augmentation: Range Trees

Data structure augmentation is the idea of taking an existing data structure and adding other features and to store additional information. These additional features can be improved searches and updates. The first example of this is called “easy tree augmentation” which includes the size of a subtree as a special case. You start with a tree and store at each node x a function f() of it’s subtree. The goal is to store f(subtree rooted at x) at each node x in a field, x.f. Suppose x.f can be computed constant time from x’s children and the direct children’s f values. If this can be computed in constant time, then if a set of nodes is modified, then it costs O(the number of ancestors of S) to update f fields. All parents from the modified node up to the root will need to be updated. This will require O(lg n) updates in AVL or 2-3 trees since it will typically be a direct path to the root from the node.

In 6.006 we looked at order-statistic trees, which was an abstract data type or interface for the data structure that could do insert, delete and successor searches, along with rank which gives the index of key x in sorted order. The converse operation of rank(x) is select(i) which finds the key with rank i. All of these should be doable in O(lg n) per operation. Using easy tree augmentation with f(subtree) equal to the number of nodes in the subtree, we can achieve this. X.f = 1+sum(c.f for c in x.children), and as long as the number of children is constant, this will take constant time. Using this, rank and select will be log n time.

rank(x):
    rank=x.left.size
    Walk up from x to root
    Whenever we go left from x to x’:
        Rank += x’.left.size + 1

This walks up a path from a node x to a root, this takes log n time using AVL trees. Similarly,

select(i):
    x = root
    Repeat the following until return
    rank = x.left.size+1
    if i=rank: return x
    if i < rank: x=x.left
    if i > rank: 
        x=x.right
        i = i-rank

Augmenting with the rank of the node would make select a regular search but could be bad for updates such as if a new minimum was added and all other nodes needed to be updated.

Some other examples of using augmentation for more sophisticated operations are shown. The first of these is level-linked 2-3 trees. Level linking is the idea of adding links between nodes on each level. So each level points to its children and the node next to it on the same level. Can we do this and what’s it good for? Additional pointers will need to be changed each time nodes are added or split but these operations are still easy to do in constant time. Splitting or merging only changes things by a constant factor where constant time is already taken. The goal with this data structure is to achieve the “finger search property,” where in the case where you find a key x in a search and want to find y which is nearby, this will be fast. Search(x from y) gives a search a starting point y. This should take constant time if x and y are neighbors and log n in the worst case.

The goal is for this search to take O(log |rank(x)-rank(y)|) time. To achieve this, data should be stored in the tree’s leaves. A drawn example shows a 2-3 tree where nodes with children don’t have values and only leaves have values. With level linking all of these leaf nodes with values are linked to nodes next to it. To help with search, use data structure augmentation to store the minimum and max values of a node’s subtrees at each node (not just leaves). This is easy to store as you can compare a new node to min and max values for each node and works in constant time. To search for a key you can look at all children and their min and max to see what interval a node falls in given that the tree’s values are sorted from left to right. Search takes O(lg n) with minimums and maximums stored. This is regular search, not “finger search.” To insert at a leaf node, you may need to have additional splits propagated up to the root. For the finger search operation,

search(x from y):
    v = leaf containing y
    repeat until return:
        if v.min <= x <= v.max:
            return regular (downward) search in v’s subtree
        elif x < v.min: v=v.level-left
        elif x > v.max: v=v.level-right
        v = v.parent

The total cost will be O(lg d) where d is the number of elements between x and y. Finger searching is good for many consecutive searches where values are close to each other. The next example (covered further in Advanced Data Structures course (6.851)) is range trees which solve the problem of Orthogonal range search. The example shows a bunch of points in a 2d point set and a 3d point set. A query is an axis aligned box (x min/max and y min/max) and checks if the target is in the box. 3d uses a 3d box. The goal of this problem is to preprocess n points in d-D to support the query where given a box, find the number of points in the box and k points or all points in the box (takes O(k) to list points). Generally, this should take O(lg^dn + size of the output).

The first example is shown in 1D where you have a line with points on it and are given an interval as a query. The points can be represented in a sorted array or binary search tree. A level linked tree (O(lg n+k) total) or sorted array could go from point to point in constant time, while BST would achieve O(lg n + k lg n (where k is output size). An example is a 1D range tree that uses a perfectly balanced binary search tree.

range-query([a,b]):
    search(a) - walk up from a and when you get a right parent, look at it and its right subtrees                       
    search(b) - same looking at left parents
    trim common prefix - stop when searches converge
    Returns O(lg n) nodes and rooted subtrees - predecessor and successors if not found

This is an implicit representation of the answer which can be improved further. For a 2D range tree, a 1D range tree is stored on all points by x. Doing a search checks in each of O(lg n) matching nodes in x also match in y, so for each node v in the x-tree, store 1D range tree by y on all points in v’s rooted subtree. There’s a lot of data duplication in this process though is polynomial and preprocessed, but allows for this search. With log n triangles for the first and second sets of trees this takes O(lg²n) and each key is in O(lg n) rooted subtrees for total O(n lg^d-1n) space.

Lecture 10 – Dynamic Programming: Advanced DP

This lecture looks at three examples of dynamic programming. Reviewing dynamic programming, the idea is to characterize the structure of the optimal solution, then recursively define the value of an optimal solution based on optimal solutions of subproblems. With that you can compute the optimal solution via recursive memoization to get an efficient algorithm, avoiding repeating subproblems. You can also do an iterative bottom up version of dynamic programming instead of recursion and memoization.

The first problem shown is to find the longest palindromic sequence. A palindrome is a string that is the same as its reverse. For this problem, given a string x[i..n] of length at least one, find the longest palindrome that’s a subsequence of it. It seems that this can skip letters for instance “underqualified” has “deified” and “turbo ventilator” has “rotator.” To solve this problem, the strategy is to find L(i,j) which is the length of the longest palindromic subsequence for x[i…j] where i is at most j. Define a recursive algorithm to compute L(i,j) as

def L(i,j):
    Look at L[i,j] (cache for subproblem results) and don’t recurse if already computed. Else:
    if i==j: return 1
    if x[i]==x[j]:
        If i+1==j: return 2
        else: return 2 + L(i+1, j-1)
    else: return max(L(i+1,j), L(i,j-1))

The number of subproblems times the time to solve each subproblem given that smaller ones are solved, where lookup is constant, gives theta(n²).

The second problem is optimal binary search trees. With keys k₁ through k_n where each key is smaller than the one coming after it and k_i=i (to make life easier). Weights are associated with each of these keys (w₁ through w_n) which can be thought of as search probabilities. The goal is to find a Binary Search Tree “T” that minimizes sum i=1 to n of w_i(depth_T(k_i)+1). The depth of the root is zero, the depth of 1 below the root is 1 and the goal is for the high weight nodes to have low depth to minimize this. A concrete example would be if you had a dictionary between languages and some words were more commonly searched, so search probability would be associated with how common the word is. If weights are search probabilities then the cost function would minimize the expected search cost.

There are exponentially many trees in n. If n is 2, there are 2 nodes and two possible trees since one would be the root and the other would be the child. If 1 is the root and 2 is its child node, then the first tree’s value is w₁+2w₂. If 2 is the root and 1 is its child node, then the first tree’s value is 2w₁+w₂. If n is 3, there are 5 possible binary search trees. An algorithm to compute all trees and find the minimum would take exponential time normally, but we can do it in polynomial. Despite being in a dynamic programming section of the course, a greedy solution is first looked at, picking k_r in a greedy fashion, such as setting it to the root. Then this can be done for every subproblem, looking at the left and right subtrees as their own subproblem. e(i,j) is the cost of the optimal BST of k_i…k_j. Initially, e(1,n) is the cost of the original problem. For an example where root 2 (w=10) has children 1 (w=1) and 4 (w=9) and 4 has a child 3 (w=8). This value would be 54, even though the optimum tree would have 3 as a root with 2 and 4 as children and 1 as 2’s child which would be 49. Because of the way weights are set up, the top three spots should have minimum depth for the highest weight nodes. The greedy algorithm isn’t sufficient and dynamic programming is needed. For dynamic programming, we can guess all possible root nodes and all other combinations similarly. Sub problems will all start at one level below the current root, and that position will need to be tracked. e(i,j)= w_i if i =j, otherwise min over i<=r<=j of (e(i,r-1)+e(r+1,j)+w(i,j)), where w(i,j) is the sum of all weights between i and j.

The last example is for an alternating coins game where there’s a row of n coins of values v₁…v_n, where n is even. Each player selects either the first or last coin from the row during their turn, removing it and gaining its value. The person who goes first can’t lose the game. For example, with v₁ to v_n, the first player computes the odd position values or even and can guarantee to get all of whichever is higher. What if you want to maximize value (assuming you move first) instead of just winning? v(i,j) is the maximum value we can guarantee if it is our turn and only coins v_i to v_j remain. v(i,i) means there’s only one coin left and i is picked (this would be during the opponent’s turn). You would see v(i, i+1) with one more coin and you’d pick the max. During your move, you’re looking at v(i,j) = max{(range is (i+1,j)+v_i, range(i,j-1)+v_j}. Assuming the opponent will try and maximize their move, there’s now a v(i+1,j) subproblem with the opponent picking. We are guaranteed min{v(i+1,j-1), v(i+1,j)} as the opponent will pick the other. This lets you calculate your next move. v(i,j)=max{min{v(i+1,j-1), v(i+2,j)}+v_i, min{v(i,j-2), v(i+1,j-1)}+v_j}. The complexity is the number of subproblems times the time to solve them. In this case, there are n² constant time subproblems, so this is theta(n²).

Lecture 11 – Dynamic Programming: All-Pairs Shortest Paths

The next few lectures go back into algorithms for graphs. Introduction to Algorithms showed the problem of single source shortest paths while this lecture goes into All-Pairs, finding shortest paths from all A’s to all B’s. Scenarios for shortest path blems are unweighted, nonnegative, general and acyclic. BFS solves single source shortest paths in unweighted graphs in O(V+E) or linear time. Dijkstra is used for non negative edge weights to solve in O(V lg V+E). For general graphs, Bellman-Ford solves the problem in O(VE). Lastly, topological sort and Bellman Ford are used together for solving acyclic graphs or DAGs in O(V+E). For all-pair shortest paths, we can get better time. For this problem, given G=(V,E,w), find S(u,v) for all u, v in V.

All-pairs shortest paths are useful for preprocessing such as with map problems to precompute shortest paths from any point to another using points in the middle. Routing is similar in computers. An obvious solution might be to run the single source algorithm from each source. For an unweighted graph with BFS, this will give O(VE). Using Dijkstra, this will be O(v²lg V+VE) and general will be O(v²E) and can be as slow as O(v⁴). In the general case, Johnson’s algorithm can achieve O(v²lg v + vE). This means even with negative edge weights, you can achieve Dijkstra’s runtime. This means in dense graphs, O(V³) can be achieved in all situations.

Dynamic programming is a natural approach for solving shortest paths. Most dynamic programs can be converted to single source shortest paths. The five steps to dynamic programming are brought back from the previous class. The first step in dynamic programming is finding the subproblems, second is to determine what to guess, third is writing a recurrence, fourth is to ensure the graph is acyclic or write the topological ordering, and fifth is solving the original problem. The subproblem here is finding d_uv which is the weight of the shortest path from u to v using at most m edges. The original problem will be to solve d_uv^(n-1). d_uv^(n-1)=d_uv⁽ⁿ⁾… if there are no negative weight cycles which only occurs when there isn’t a negative diagonal entry (d_vv^(n-1). For this you guess the last edge (x,v) and the recurrence is d_uv^(m)=min(d_ux^(m-1)+w(x,v) for x in V). In the base case, d_uv⁽⁰⁾={infinity if u isn’t v or 0 if it is. To get an acyclic ordering, loop through m=0,1…,|V|-1 and do a nested loop for u,v in V.

For m in range(1,n):
    For u in V:
        For x in V:
            If d_uv>d_ux+d_xv:
                d_uv=d_ux+d_xv

The last if statement is the relaxation step and updates newly found shortest paths and is about fixing the triangle inequality. Relaxation is used for Dijkstra, Bellman Ford and other shortest path algorithms to improve performance. The problem is that the algorithm above is O(v⁴) which doesn’t improve anything.

Shortest paths are closely linked to matrix multiplication, where given A and B which are n by n matrices, C=AB. C_i,j=sum k=1 to n of a_ik*b_kj. Replacing a dot with a plus and a plus with min converts the problem of matrix multiplication into shortest paths and gets exactly the same algorithm. Repeated squaring can be used for matrix multiplication to improve time by taking W¹*W¹ to get W², W²*W² gives W⁴, doubling each time. This algorithm is O(v³lg v) which is closer. This process of switching the operations creates a “semi-ring” which leads to no way to get a minus sign. Another problem is transitive closure where you want to find if there’s a path from i to j returning 1 if yes, 0 if no. This example compares to shortest paths converting dot to and and plus to or which creates a whole ring allowing for negations and let’s you solve the problem in O(n^2.3728…) which is better for dense graphs, though we can do better for sparse graphs.

The next goal is to get O(v³) using the Floyd-Warshall algorithm. The key to this algorithm is that for a subproblem, c_uv^(k) is the weight of the shortest path whose intermediate vertices {1,2…,k}. U and v may be larger than k, but the vertices in the path will be in 1 through k and this method performs better. Guessing is easier as you just guess whether k is in the path. The recurrence is c_uv^(k) = min{ c_uv^(k-1), c_uk^(k-1)+c_kv^(k-1)}. The number of subproblems is the same but guessing and recurrences are reduced giving O(v³). Base case c_uv⁽⁰⁾=w(u,v). The base case is

c=(w(u,v))
for k=1,2...n:
    for u in V:
       For v in V:
           If C_uv > C_uk+C_kv:
               C_uv=C_uk+C_kv

This is a simple algorithm that is the best solution for finding all-pairs shortest paths in dense graphs. For sparse graphs, we can get closer to Dijkstra by using Johnson’s algorithm. First, find a function h that maps V(ertices) to R(eal numbers) such that w_h(u,v)=w(u,v)+h(u)-h(v)>=0 for all u,v in V. Step two is to run Dijkstra’s algorithm on (V,E,w) to get the shortest path delta_h(u,v). Third, claim delta(u,v)=delta_h(u,v)-h(u)+h(v). Proving this claim, looking at a path p from u to v, u=v₀, v₁,…v_k=v, compute w_h(p) = sum i=1 to k (w_h(v_i-1, v_i) = w(p)+h(v₀)-h(v_k) = w(p)+h(u)-h(v). Shortest paths are preserved and the claim holds.

The main problem with Johnson’s algorithm is finding the function h can can map values successfully. This problem of finding h is very close to shortest paths. This means using shortest paths to solve shortest paths. To find h, first find w(u,v)+h(u)-h(v)>=0. h(v)-h(u)<=w(u,v) for all u,v. W’s are given and h’s are unknown. This is a “system of difference constraint” which is a special case in linear programming as well. Unfortunately, if the graph (V,E,w) has a negative weight cycle, there is no solution to the difference constraints. If (V,E,w) has no negative weight cycles, then there is a solution. “Crazy but correct, that’s Johnson.” The whole goal of shortest paths is to find a point where triangle inequality isn’t satisfied and fix it, and since this problem breaks down to the triangle inequality, we can use Bellman-Ford to fix it or tell us if there is a negative weight. In O(VE) time, we’ve reduced the problem to the non-negative algorithm and can run Johnson with O(V²lg V+VE).

Recitation 5 – Dynamic Programming (more examples)

This recitation shows more examples of dynamic programming. Rectangular blocks is a problem where given a set of n rectangular 3D blocks, where a block has a length, width and height, find the maximum height of the tallest possible tower from a subset of the blocks. Blocks can’t be rotated and length is east-west, width is north-south and height is up-down and the length and width of a block on top can’t be larger than one below. Given blocks in a non-increasing order of length and then breadth, H[i] is the height of the tallest tower with Block, B_i at the top and equals h_i+max(all legal blocks) H[j], run recursively. This requires O(n) sub-problems, each also requiring linear time. The total running time is O(n²).

Counting boolean parenthesizations is a problem where given a boolean expression consisting of a string and symbols True, False, AND, OR and XOR, count the number of ways to parenthesize it to get true. For instance, True and False XOR True can be parenthesized in two ways to be true. Defining the sub-problems, if T[i,j] is the number of ways string S[i,j] can be true and F[i,j] is the same for false. Total[i,j]=T[i,j]+F[i,j], which can both be recursed on by passing the whole string into a function calculating based on if else statements for each operator and solvable in O(n³) time.

The third problem detailed in the recitation is “make change” where given value N and given set S of different valued coins (of infinite supply in order of value), what’s the minimum number of coins needed to get N. C[p] is the minimum required for p cents, so recursively, if p >0, C[p]=min(i:S_i<=p) C[p-S_i]+1 and the base case is C[0]=0. With linear sub-problems each taking O(m), this is pseudo-polynomial in O(mN).

Lecture 12 – Greedy Algorithms: Minimum Spanning Tree

This lecture shows two greedy algorithms for solving a minimum spanning tree problem. Greedy algorithms make the locally best choice not considering what could happen in the future. A tree is a connected acyclic graph and a spanning tree is a tree that contains all vertices. Given a graph, a spanning tree of that graph is a tree living inside the graph to hit all of its vertices. For a minimum spanning tree, we have a weighted graph G=(V,E) and edge-weights w, which are real numbers for each edge, and the goal is to find a spanning tree T of the minimum weight for the sum over e in T of w(e). So, the spanning tree with the minimum total weight. One solution is to compute each tree and return the minimum, but this can be exponential which is not so good. This week will show a linear solution to the problem. Two weeks ago showed an example of failing to do greedy and falling back on dynamic programming. This week we’ll get a large time with dynamic programming and fall back on greedy to go from exponential to polynomial time.

If your problem can be solved by a greedy algorithm, it usually has two properties, optimal substructure and a greedy choice property. Optimal substructure is essentially an encapsulation of dynamic programming and is where the optimal solution to a problem incorporates the optimal solution to subproblems. With greedy algorithms, instead of guessing we choose the best seeming option via the greedy choice property. If we keep making locally optimal choices, we end up with a globally optimal solution. These properties will be present in minimum spanning trees as well. The optimal substructure for a minimum spanning tree is “if e={u,v} is an edge of some MST, it can be contracted by merging u and v.” Contracting u and v will remove u and edges may need to be combined where u and v each connected to a node. In these cases, take the minimum weight edge between the two. If T’ is an MST of G/e then T’ union {e} is an MST of G. This is similar to a recurrence in a dynamic program which would work by guessing an edge e, contracting it, recursing and then deconstructing and adding e to the MST. If we start by an edge guaranteed to be in an MST, contract and find another MST, putting back E will be an MST. There can be multiple MSTs for an ST and if all edges have the same weight, all STs are MSTs.

The problem with the above dynamic algorithm is that it’s exponential. Luckily it can become polynomial by removing the guessing. Another greedy choice property for MSTs says that considering any cut (S, V-S), where S is a subset of vertices. If e is a least-weight edge crossing the cut, e={u,v}, u in S, v in V-S, that edge is in an MST. Cutting and pasting is usually how to prove a greedy algorithm and should be the first choice. Assume you have an optimal solution that doesn’t have the property you want and modify it by cutting and pasting in a different part to prove there’s some MST or optimal solution with the desired property. Let T* be a MST of G. Say e isn’t in T* and modify T* to include e. There has to be an edge e’ in T* that crosses the cut. T*-e’ union {e} is a spanning tree. Proving this is minimum, w(T*-e’ union {e})=w(T*)-w(e’)+w(e). The weight of e is at most w(e’) since it’s the smallest weight of crossing the cut, so T*-e’ union {e} is at most w(T*). The only modifications are edges crossing the cut.

Prim’s algorithm is used to choose what cut to make. This algorithm starts with a single vertex s.

Maintain a priority queue with every vertex in V-S where v.key = min{w(u,v)| u in S}.

Initially, Q stores V, s.key = 0 for arbitrary start vertex s in V. 
For v in V-{S}: v.key=inf
Until Q is empty:
    u=Extract-Min(Q) (adding u to S)
    for v in Adj[u]:
        if v in Q:
            if w(u,v)<v.key:
                v.parent = u
return {{v, v.parent} | v in V}

This return statement gives the edges forming the minimum spanning tree. A correctness invariant states that the tree T_s within S is always found in an MST G. Just looking at V in S and the edge from V to it’s parent, we’ll get T_s. By induction, assuming there’s a MST T* containing T_s and grows to T_s’ = T_s union {e}. The greedy choice property guarantees a MST containing e, and T* can be modified to include e and T_s. This property was proved by showing we could remove an edge crossing the cut and replacing it with e. We can do the same here, though the removed edge can’t be a T_s since it’s not crossing the cut. The time of Prim’s algorithm is the same as Dijkstra, so with Fibonacci heaps, O(v lg v + E).

Kruskal’s algorithm is another algorithm used to make cuts, based on the idea that a globally minimum weight edge is valid for all partitions. At each step, this algorithm finds the minimum weight edge at each step. The catch is that you need to keep track of previously connected nodes to avoid cycles. Union find can be used to fix this.

Kruskal’s algorithm: 
    maintain connected components in MST-so-far T in a union find structure
    T=0
    for v in V: Make-Set(v)
    Sort E by weight
    for e={u,v} in E (increasing order by weight):
         If Find-Set(u) != Find-Set(v):
             T=T union {e}
             Union(u,v)

Union Find was invented to implement Kruskal’s algorithm faster. Lots of data structures come from algorithms. For instance, Fibonacci heaps were invented to improve performance on Dijkstra’s algorithm. Union find grows very slowly, but the sorting is long here. The overall run time is O(sort(E)+E*alpha(v)+V), where E is the number of edges and alpha(v) is the time of a union. This algorithm is really good when weights are integers and radix sort can be used in linear time. To show this is correct, in invariant is that tree T so far is at most some MST T*. Assume T is at most T*, then T to T’=T union {e}. Use the greedy-choice property with S=CC(u) (connected component), which means T’ is in T*’ At all times, edges chosen are in T*. e also needs to be the minimum weight edge crossing the cut. Overall this runs in nearly linear time.

Recitation 6 – Greedy Algorithms (more examples)

Process scheduling is a problem where given a computer and n processes taking times t₁ to t_n, pick the order to run them to minimize average completion time C_pi where p_i is the processing time for i. A greedy solution is the shortest processing time first (SPTF) rule where processes are performed in order of lowest processing time, swapping elements one by one and solving in O(n log n) time. For an “online” algorithm, if the processing time is less than the time remaining for the current process then that one is completed first.

Event overlap is a problem where given n events with start and finish times which could overlap, how many clones (k) would you need to attend all. A greedy solution sorts intervals by start time assigning each interval to a clone when possible. When an event appears that’s not compatible with an existing clone, a new one is created and assigned to it. An O(n log n) implementation uses a min-heap of finish times for each clone’s last interval.

Fractional make-change is a problem similar to make-change where with m kinds of metals and a value N to make change for in cents and metal i costs S_i per kilogram and there are n_i kilograms of it. Weights don’t have to be integrals and the goal is to minimize total weight of metals used or sum(k_i). A greedy solution takes the most valuable metal by weight and tries to use as much as possible of it first.

Lecture 13 – Incremental Improvement: Max Flow, Min Cut

A lot of setup is required for flow networks and max flow is an optimization problem. Max-flow/Min-cut theorem is a crucial theorem for flow networks and will be used to get to the Ford-Folkerson algorithm, the first max flow algorithm. Max flow is an algorithmic hammer used to solve a variety of problems. A flow network is a directed graph with vertices and edges with two distinguished vertices, Source S and Sink T, where a flow comes out of S obeying rules set by edges to get to T. Law of conservation is associated with this flow, and for any vertex other than S or T, anything that enters must exit. Each (directed) edge (u,v) is in E and has a non-negative capacity, C(u,v). If (u,v) doesn’t belong to E, assume its rate is zero. This is a directed graph and cycles are allowed. Each edge has a capacity value and a flow value and there’s a constraint that the flow can never exceed the capacity for each edge. The capacity can be thought of as a rate only allowing a capacity over a specific time. Effectively, if a node has one exiting edge with a capacity 3 and two incoming edges with flow 2 and 1, it’s fine, but both incoming edges having 2 would be over capacity, since values can’t stay leftover on a node.

The goal of the problem is to maximize the flow from the source that can be brought into the sink, where S can provide infinite flow. Decreasing the flow in specific edges can increase the overall flow. For a formal definition of the max flow problem, given a graph G, find a flow with the maximum value on G. The maximum possible initially will be the sum of all edge capacities exiting the source, but this can be taken down later due to future capacities. Some assumptions for the problem to make the problem easier are that self loop cycles aren’t allowed and that cycles between two nodes are expanded to have a third node and a triangular flow. Cycles are allowed but not of length one or two. If you do this, you’ll only have a single notion of flow even though both positive and net flow exist. Disallowing these allows for a single flow to be the same as net or positive flow.

A (net) flow on graph G is defined as a function, f where vxv->R that satisfies the capacity constraint, for all u,v in V, f(u,v) is at most c(u,v). Flow conservation says for all u in V – {s,t}, sum over v in V of f(u,v)=0. Skew symmetry says that for all u,v in V, f(u,v)=-f(v,u). There has to be a path from u to v to have a non-zero flow from u to v. This can be used to prove theorems to show how algorithms on flow networks work. The value of a flow f is denoted |f| and |equals sum over v in V of f(s,v)= f(s,V). This is the single quantity that needs to be maximized for the max flow problem. Implicit summation notation says that a capital value refers to a set and needs to be enumerated over, explaining why the last function is true. A simple property of this problem is that f(X,X)=0, f(X,Y)=-f(Y,X), and f(X union Y, Z) = f(X,Z)+f(Y,Z) if X intersection Y is null.

Implicit set notation is used to elegantly prove theorems, the first example saying the value of the flow is exactly what comes out of the sink, |f|=f(V,t). The proof of this is |f|=f(s,V) = f(V,V)-f(V-s,V). Since f(V,V) is 0, |f| = f(V,V-s) due to skew symmetry, which equals f(V,t)+f(V,V-s-t) and due to flow conservation, the second term is 0, showing |f|=f(V,t).

Nodes can be cut or broken into two distinct parts such that S and T are on separate sides. A cut is defined as (S,T) of a flow network G=(V,E) and a partition of V such that s is in S and t is in T. If f is a flow on G, then the flow across the cut is f(S,T). If nodes S and d are part of the cut with other nodes a, b, and c, then f(S,T)=(f(S,a) + f(S,b))+(f(d,a)+f(d,c)+f(d,b)+f(d,t) which are all flows of edges between nodes. Every pair of vertices needs to be looked at and incoming nodes flow values need to be subtracted or multiplied by negative 1. When you don’t have an edge from S to c, you can use skew symmetry to argue that f(S,c) and f(c,S) cancel out. There’s a process for defining the value of a cut which will be come back to next time. The capacity of a cut is c(S,T)=(c(S,a)+c(S,b)+(c(d,b)+c(d,t)) only caring about capacities. The value of any flow is bounded by the capacity of any cut.

Another characterization of the flow value is that for any flow f and any cut (S,t), |f|=f(S,t). It doesn’t matter the cut that’s chosen, the flow in the network will equal the flow across the cut because the source and sink are on opposite sides. This is proven with implicit summation notation. f(S,T) = f(S,V)-f(S,S) = f(S,V) = f(s,V)+f(S-s,V). Since the second term doesn’t contain t, you can use flow conservation to show f(s,V)=|f|. The capacity of any cut will bound the flow of a network since the flow of a cut is the flow through the network. The min-cut will be an upper bound on the max-flow.

A residual network is a tool for finding min-cuts to find max-flow. This points you to places where capacities aren’t maxed out, for instance if capacity is 3 and flow is 2, there’s a residual capacity of 1. A residual network, G_f=(V,E_f) contains strictly positive residual capacities and C_f(u,v)=C(u,v)-f(u,v)>0. Edges in E_f admit more flow. If (v,u) doesn’t belong to E, then c(v,u)=0, but f(v,u)=-f(u,v). The residual network will contain extra edges that don’t exist in the original network. A residual network is based on the original’s flow. It has the same vertices, but different edges. An edge’s capacity, E_f can be decreased from the original capacity by the flow going across it with flow or used capacity represented as an arrow back. Edges at their capacity don’t show up in the residual network. The Ford-Fulkerson algorithm looks for augmenting paths, which are from S in G_F to T_F, and if it exists, then the flow isn’t maximum. This can be used to update the original graph and maximize it.

Lecture 14 – Incremental Improvement: Matching

The lecture on network flow goes more into algorithms. As a review of flow networks, a flow network G(V,E) is a directed graph with two numbers associated per edge. 1:3 means 1 flow, 3 capacity. Other than source s and sink t, all flow entering a vertex must leave. The goal is to find the maximum possible flow, |f|=f(s,V)= f(V,t). You can have an arbitrary cut in the flow network, which corresponds to any partition (S,T) such that s belongs to S and t belongs to T. Regardless of actual flow values, |f|=f(S,T) for any cut (S,T). The corollary is that |f|<= c(S,T) for any cut(S,T). A residual graph, G_f(v,E_f) has edges with strictly positive residual capacities, so c_f(u,v)=c(u,v)-f(u,v)>0. An augmenting path is any path from s to t in G_f. A residual capacity of an augmenting path is C_f(p)=min (u,v) in P of C_F(u,v).

The Ford Fulkerson Algorithm simply sets leftover capacity going from one node to the next and the capacity it could be reduced by in the opposite direction.

Ford Fulkerson Algorithm:
    sets f[u,v] to 0 for all (u,v)
    While an augmenting path in G_f exists
        Do augment f by C_f(p)

The Max Flow Min Cut theorem says the following equivalent. |f|=c(S,T) for some cut (S,T), f is a maximum flow (meaning some cut is saturated), and that f admits no augmenting paths. To prove this, you can show that each statement implies the next. Proving 1 implies 2, since |f| is at most c(S,T), for any (S,T), the assumption |f| = c(S,T) implies f is a max flow (since it can be increased). 2 implies 3 is easy as well, if there were an augmenting path, the flow value can be increased, contradicting the maximality of f. Showing the third statement implies the first, if f admits no augmenting paths, define S={u in V : there exists a path in G_f from s to u. T=V-S : t is in T and s is in S, so (S,T) is a cut. There can’t be edges in the original network between a u in S and v in T unless it’s saturated meaning that the capacity is reached. Summing over all u belonging to S and v belonging to T yields f(S,T)=c(S,T).

Unfortunately, the Ford Fulkerson Algorithm’s performance can depend on the ordering of vertices. An example shows a depth first search for a 4 vertices graph with capacities of 10⁹ that can reach 2 billion iterations to solve when only two are needed. Using the Edmonds-Karp algorithm, do a breadth-first augmenting path, which is a shortest path in G_F from S to T where each edge has weight 1. Only O(VE) augmentations are needed in the worst case provided you use breadth-first augmentation. The overall complexity is O(VE²). Assuming capacities are unbounded, the fastest algorithm in 2011 was King, Rao, Tarjan, running in O(VE log_E/VlgVV). Recently Orlin had an O(VE) algorithm using fast matrix multiply. The O(VE) algorithm is shown in recitation.

The next problem is the baseball elimination problem. Each of 5 teams has wi wins, li loses and ri games remaining to play. R_i,j are the games that 5 noted teams play against each other, with a value per team combination. Team i is eliminated if w_i+r_i is less than w_j for some j. This is sufficient, but not necessary as it could give you false hope. Based on a table with these values, a network flow can be used to represent the problems. The specific one drawn is to determine if team 5, Detroit, is eliminated. Node S has edges going to nodes containing pairs of teams that play each other with edges having capacity of the number of times each play. The edges between those nodes go to a layer of nodes 1, 2, 3, and 4 representing the other teams, each game node with an edge to each of its two teams. Edges going to these have capacity infinity. Edges going from team nodes to t have capacity of w₅+r₅-w_i. The capacities are the number of games each team can win without having more wins than team 5.

If you figure out if team 5 wins all remaining games and then divy up the remaining games so all teams have at most w₅+r₅ wins, you’ll know if it’s possible for team 5 to win. Team 5 will be eliminated if and only if the max-flow doesn’t saturate all edges leaving the source. Saturation of those edges corresponds to playing all remaining games. If you can’t play all remaining games without exceeding the capacity of i to t edges, then team 5 is eliminated. With the given numbers, team 5 is eliminated.

Recitation 7 – Incremental Improvement: Network Flow and Matching

Edmonds-Karp is an efficient implementation of Ford-Fulkerson which selects the shortest augmenting paths in the residual graph and assigns a weight of 1 to each edge and runs through a BFS to find a shortest path from s to t in G_f. Analyzing this, the monotonicity lemma says that delta(v)=delta_f(s,v) is the breadth-first distance from s to v in G_f and delta(v) will increase monotonically during the algorithm. The number of flow augmentations in Emonds-Karp is O(VE) and its maximum flow algorithm runs in O(VE²) time.

Some applications of network flow are for vertex cover algorithms, where given an undirected graph G=(V,E), a set S of vertices covers G so each edge (u,v), S contains u or v. The goal is to minimize |S|. This is NP-Hard for general graphs and polynomial in bipartite graphs. Given a bipartite graph, G=(L union R, E in LxR), to define the flow network H, create a source vertex s with edges of capacity 1 to each vertex in L. Create sink, t and add capacity 1 edges from it to each R vertex. Direct edges in E from L to R with infinite capacity. Running maximum flow will give the minimal vertex cover.

Lecture 15 – Linear Programming: LP, Reductions, Simplex

Linear programming is a general purpose technique that can help solve a variety of problems. Max flow, shortest paths and more can be done with linear programming, though algorithms are a lot more complex. A simplex algorithm, for instance, runs in exponential time in the worst case but has been good in practice and held its ground. A simple example of an optimization problem is politics. How do you win? You buy an election. How can you spend the least amount of money while still winning? The goal is to estimate votes obtained per dollar spent advertising for a particular cause. Demographics are divided into urban, suburban and rural. Policies are building roads, gun control, farm subsidies, and gasoline tax. Different amounts of votes are gained or lost for each group for each policy. Ideally you can figure out how to win a majority in every demographic while spending the minimum amount of money. There are population sizes for each group as well. The algebraic setup lets x₁ through x₄ denote the dollars spent per issue. Minimize x₁+x₂+x₃+x₄ subject to ax₁+bx₂+cx₃+dx₄ where a, b, c, and d are the votes gained or lost based on the policy in a demographic. This is also subject to the same for each other demographic. There are n variables and m constraints and the goal is a runtime that’s polynomial in n.

In general, linear programming variables are real numbers and can be solved in polynomial time. If values are forced to be integral, then linear problems are np-complete. The standard form of a linear program is first to minimize or maximize a linear objective function subject to linear inequalities or equations. Variables X=(x₁ to x_n) and the objective function is CX=c₁x₁ to c_nx_n. Inequalities are represented as a matrix AX which is at most B. Linear programs will come in minimum or maximum forms and can have at most, at least, and equality constraints. The goal in standard form is to maximize CX so inequalities hold and X is at least 0. Is there a “certificate” that shows the LP solution is optimal? If after the three constraints from the politics problem, you get x₁+x₂+140/222 x₃+x₄>= 2800000/111 and a previously unproven claim was that 3100000/111 was optimal, this shows the latter is optimum.

LP duality says there will always be a short certificate of optimality you can use to show you have a lower bound and have found an optimal solution. The basic theorem is if you have max CX such AX is at most B and X at least zero, then the dual form of the problem flips everything. Max CX = min BY such that A^TransposeY>=C and Y is at least 0. You can always do this and the original is known as the primal forms. Lot’s of algorithms flip between these forms for efficiency. Converting “minimize -2X₁+3X₂” to standard form, you can negate this to 2x₁-3x₂ and maximize it. Converting “Suppose X_j doesn’t have a non-negativity constraint, you can replace it with X_j’-X_j’’ such that both are greater than zero. If you have an equality constraint X₁+X₂= 7, you need X₁+X₂ at most 7 and -X₁-X₂ at most 7. A greater than equal to constraints translates to less than or equal to by multiplying by negative 1. Linearizing non-linear constraints can be done to use powerful LP algorithms to solve them. The best ones are commercial.

Maximum Flow is a maximization problem finding max sum of v in V of f(S,v)=|f|. Skew symmetry, conservation and capacity constraints exist in this problem which were seen last week and all are linear constraints naturally. This is simple enough you probably wouldn’t get much performance gain with an LP solver though a multicommodity max flow problem with multiple kinds of flow create a situation more appropriate to LP. For instance, f₁, c₁, f₂, c₂, you’d have two distinct disjoint optimizations that can be solved by using max flow twice. With two commodities and a single capacity, c, the capacity constraint looks like f₁(u,v)+₂(u,v) is at most c(u,v) and it’s all linear.

The shortest path problem is a bit trickier. For a single source, from vertex S, d[V] is the distance from the source and minimized to eventually be the shortest path. Converting to LP, the triangle inequality can be used as a constraint, so we want d[v] such that d[v]-d[u] is at most w(u,v) for all u,v belonging to E. d[s]=0. Thinking about this, if you have a node v and can only get to it from u1 and u2, where w(u1,v) and w(u2,v) are the edges, this constraint is written out and followed for each edge. One will be the limiting constant and d[v]=min(d[v]-d[u1] at most w(u1,v),min(d[v]-d[u2] at most w(u2,v). Shortest paths is a minimization problem though an objective function is max of sum over v of d[v] since mins are already present. Translating problems to max-flow or LP is something students will likely do in their career apparently.

How can you build the standard LP setting itself? An example in the lecture is the simplex algorithm. Simplex represents LP in slack form and converts one slack form into an equivalent slack form whose objective value has not decreased and has likely increased. Keep doing this until the optimal solution is obvious. The simplex algorithm is simple but is exponential in O(m+n choose n) in the worst case. An example problem is to maximize 3x₁+x₂+x₃ subject to x₁+x₂+3x₂<=30, 2x₁+2x₂+5x₃<=24, and 4x₁+x₂+2x₃<=36 and x’s are at least zero. Slack form introduces a variable per constraint equation. Z = 3x₁+x₂+x₃. Basic variables x₄, x₅ and x₆ represent slack and how much room we have to optimize. x₄=30-x₁-x₂-3x₃, x₅=24-2x₁-2x₂-3

5x₃, and x₆=36-4x₁-x₂-2x₃. Now the solution space has 3 added variables which are basic versions of non-basic variables. Running simplex, it takes 3 iterations to get to an optimum solution. A basic solution sets all non-basic variables to zero and computes basic variable values. The objective function is 3(0)+1(0)+2(0)=0, non-basic values are all 0 and basic variables become 30, 24, and 26. The key step is pivoting, where you swap basic and non-basic variables. Pivoting selects a non basic variable X_e whose coefficient in the objective function is positive and increases X_e’s value as much as possible without violating any of the constraints. Variable X_e becomes basic and another variable becomes non-basic. The values of non basic variables and the objective function may change.

Running through a step, if x₁ is selected, it needs to be increased without violating constraints. Everything starts at zero and x₆’s constraint is likely going to be the limit since it has the largest multiplier on x₁. X₁=9-x₂/4 – x₃/2 – x₆/4. Rewrite the other equations with x₆ on the right hand side and x₁ is replaced with the above equation. The point of this is the objective function value is increased. The original basic solution was (0,0,0,30,24,36) and satisfies the equations after the pivot and has an objective value of 27 much better than 0. Basic solutions for step 2 are (9,0,0,21,6,0). With two more pivots on other variables, the objective function gives an optimum value of 30.

Lecture 16 – Complexity: P, NP, NP-Completeness, Reductions

The entire field of NP-Completeness in one lecture. This problem is all about converting one problem into another. P is all problems solvable in polynomial time and NP is the set of decision (yes or no) problems solvable in in nondeterministic polynomial time. In a non-deterministic model, you can guess one of polynomially many options in constant time, and if any guess leads to a yes answer, we can get that guess. These algorithms bias toward yes and are also thought of as “lucky.”

3SAT is a problem where given a boolean formula of the form (x₁ or x₃ or !x₆) AND (!x₂ or x₃ or !x₇)… A literal is a variable or negation of a variable and “ors” of three things are clauses. Can you set the variables to true or false such that the formula comes out true. If you guess if each value is true or false and check if you satisfy the formula, it returns yes or no based on true or false. Since NP is biased toward yes, if it’s possible for it to come out yes, the program will return yes, otherwise it returns no. A satisfying assignment is one that makes the statement true. You can check if solutions are possible if yes in polynomial time given a satisfying assignment. There isn’t an easy way if it’s not possible.

NP complete problem is the set of decision problems with polynomial size certificates and polynomial time verifiers for yes inputs. X is NP-complete if X is in NP and is NP-hard. X is NP-hard if every problem Y in NP reduces to X. Reduction from problem A to B is a polynomial time algorithm converting A inputs to equivalent B inputs. If problem X is NP-hard, it’s at least as hard as all problems in NP. If you can solve B, then you can solve A and if one is in P or NP, the other is in the same. X is isn’t in P unless it’s proved that P equals NP. When proving something is NP-complete, you prove it can’t be solved with a polynomial time algorithm. You can prove X is NP-complete by proving that it’s in NP, is NP-hard (by reducing from known NP-complete problem to X). To show it’s in NP, you can give a nondeterministic algorithm or a certificate and verifier.

For a reduction example, 3SAT is reduced to Super Mario Bros. The problem is generalized to an arbitrary screen size of nxn. An input X causes mario to fall to the left or right for x₁. There are n instances of x variables for a level and each has a true or false output. The specific level has three clauses and the output will be true if all are true. Next he shows if there’s a way to satisfy 3SAT’s algorithm it’s equivalent to the Mario problem. Given that it’s a lucky algorithm given one can be solved the other can. The example focuses more on the general strategy such as big mario breaking things and small mario going ways git mario can’t, and the if/else statements involved.

The next problem is 3D matching, or 3DM. Matching problems are usually about pairs (2D matching), tripling becomes NP-complete. Given disjoint sets X, Y and Z, each of size n and triples T <= X*Y*Z, the goal is to choose a subset S of T such that every element in X union Y union Z is in exactly one s in S. This is proved NP-hard by reducing from 3SAT. Drawing out the points, it’s shown that choosing 1 grouping eliminates at least one other. A clause shows the or options for grouping and garbage collection has nx clauses unchosen. Most NP-complete problems can be reductions of 3SAT.

Subset sum is a problem where given n integers, A={a₁ to a_n and target sum t, find if there’s a subset S <= A such that sum S = sum over a_i in S of a_i=t. This is “weakly NP-hard.” Given three dimensional matching problems, we reduce it to subset sums, constructing integers representing triples. Numbers are viewed in base b=1+max nx_i. A triple (x_i, x_j,_k) can be represented 0000100001001000 of bⁱ+b^j+b^k. The target t is 1111111111 = sum over i of bⁱ. Since the base is 1 more than the total number of colliding 1’s, for each of the 1’s in the target, a triple with those 1’s must be found and the triples can’t overlap. The number of digits is O(n) and the values of the numbers are exponential. With strong np-hardness, values should be polynomial. This problem is basically knapsack and is polynomial in values but not digits. Weak NP-hardness is usually related to pseudo-polynomial.

Partition is a problem given n integers, A={a₁ to a_n}, find if there’s a subset S in A such that sum S = sum(a-S)=(sum A)/2. Partition is a special case of subset sum where t is sum(A)/2 making it easier than subset sum. Partition could be reduced and solved with subset sum, though the example reduces from subset sum to partition instead. Let sigma be sum A, add a_n+1=sigma + t and add a_n+2=2 sigma – t. Adding sigma – t to the left side and t to the right, the values will equalize. To solve the partition problem, you have to solve a subset sum problem.

Rectangle packing is a problem where given a large rectangle T and small rectangles R_i’s, pack the small rectangles into T without overlap. This is weakly NP-hard and can be solved by reduction from partition. Each a_i is converted into a 1 by 3a_i rectangle, T=2 by 3t=3/2 * sum of A. Packing rectangles requires solving the partition problem. Lastly, is an example with jigsaw puzzles using 4 partition where given n integers in A, split them into n/4 quadruples of the same sum, t = sum(A)/(n/4). The values are at most a polynomial in n. Integers are represented as a number of tiles in jigsaw and packing them into a rectangular board is similar to rectangle packing. Converting from a number to non number problem requires strong NP-hardness.

Recitation 8 – Complexity: More Reductions

Comparing the “hardness” of different problems can be done with reductions. If A can be solved by a black box that solves B then A’s difficulty can be viewed in terms of B’s along with the work for a transformation. Karp-reduction is a method where given decision problems A and B, A is polytime reducible to B if there’s a function R that maps A’s inputs to B’s so A(x)=B(R(x)). If this is true, then they will be comparable and both in P, NP, etc. This recitation focuses on showing reductions between problems. A Hamiltonian Cycle where given a directed graph, asks if a cycle visits each vertex exactly one, can be reduced to a Hamiltonian Path which asks the same of a path. Given that Cycle is NP-Complete, Path is shown to be in NP-Complete and reducible in polynomial time to Cycle. 3 Dimensional matching is reduced to 4 partition, showing both to be NP-hard, clique to independent set, vertex cover to set cover, and clique to Max2SAT with detailed explanations for all.

Lecture 17 – Complexity: Approximation Algorithms

This is the first of two lectures on approximation algorithms, with this being an introductory lecture. Approximation algorithms think of a greedy heuristic and prove they’re somewhat optimal for solving problems. These will give approximate solutions in polynomial time that give very close to the answer or perform well given limitations. Take a problem, define it, think of an interesting heuristic, do a proof. For approximation algorithms and schemes, an algorithm for a problem of size n has an approximation ratio rho(n) if for any input, the algorithm produces a solution with cost c where max(c/copt, copt/c) is at most rho(n). Rho(n) can be either a constant or a function of n. An approximation scheme takes an additional input epsilon greater than 0 and for any fixed epsilon, the scheme is a (1+epsilon) approximation algorithm. So, the previous algorithm is a rho(n) approximation algorithm. A O(n^2/epsilon) running time scheme is an example of a probabilistic time approximation scheme of PTAS, which are polynomial in n but not epsilon. A fully PTAS scheme is polynomial in N and 1/epsilon, such as O(n/epsilon²).

Vertex cover is a problem where given an undirected graph, G(V,E), the goal is to find a set of vertices covering all edges. A cover is a set of nodes where each edge connects to a node in the set. In other words, find a subset V’ in V such that if (u,v) is an edge of G, then either u is in V’ or V is in V’ or both. The optimization problem is finding a V’ so |V’| is minimum. A polynomial time algorithm for solving this problem isn’t known. One solution is to repeatedly pick a max degree at each step leading to a lg n algorithm that could be lg n off of optimal. If you have ties with maximum degrees, you may end up picking the wrong node. The algorithm could pick all bottom vertices in a set of two rows when the top are optimal and k! = n, while the bottom gives k!(log k). This is pretty good but the approximation factor grows with the problem size. As a practitioner, max degree is most likely what you’d code.

Another possible heuristic for the vertex cover problem is to look at random edges. If you set the cover to null and edges to E’, then while E isn’t null, pick(u,v) in E arbitrarily and set C to C union {u} union {v} and delete from all edges incident on u or v. After the loop, return C. This is a simple iterative algorithm that picks edges randomly and keeps going. This algorithm will always be a factor of 2 of optimal. Proving the approx-vertex cover is a 2-approximation algorithm. The factor of 2 comes from the idea that the edges picked don’t share vertices as they’re deleted when an edge is picked. Letting A be the edges picked, 2|A| vertices will be selected. The cost C will be 2|A| as well. Showing that Copt is at least |A|, since every edge needs to be covered including all edges in A, meaning a different vertex for each edge must be picked. This shows it’s guaranteed to be within a factor of 2 and is the best approximation algorithm shown for vertex cover today.

Set cover is a similar problem where given a set X and a family of (possibly overlapping) subsets S¹ to S_m of at most X, such that union from i = 1 to m of S_i = X, find C of at most {1,2,…m} while minimizing |C| such that union of i in C * S_i=X. This seems to take exponential time but can be approximated. An obvious and best heuristic is to take the largest subset at each step. Approximate-set-cover is a (ln(n)+1) approximate algorithm. To do this, pick the largest S_i, remove all elements in S_i from X and other S_j. If an optimal cover Copt exists that’s equal to t, let X_k be the set of elements in iteration k (X₀=X). For all k, X_k can be covered by t sets. So, one of the t sets will cover at least |X_k|/t elements based on the pigeon hole principle and the algorithm will pick a set of (current) size at least |X_k|/t. For all k, |X_k+1| s at most (1-1/t)|X_k|. This shows the problem size gets smaller with each recursion and it ends when there’s nothing to cover (|X_k=0|). The cost is k, since k subsets are selected and e^-k/tn is less than 1 is the stopping condition and k/t is more than ln(n), so k/t is at most ln(n)+1 since the algorithm will terminate after.

The next problem is a partition problem and the cost is the imbalance between two sets. Given a set S of n items with weights S₁…S_n, assuming earlier ones are at most later ones, partition into sets A and B to minimize max(sum i in A of S_i, sum i in B of S_i) or max(w(A),w(B)). If 2L is sum i=1 to n of S_i, then the optimal solution would be to get two sets of L. The goal is to minimize the maximum of these two quantities. The worst you could get is 2L as all elements could be on one side making this a 2-factor approximation algorithm. The approximation partition algorithm defines m as ceil(1/epsilon) – 1. And E is 1/(m+1). This algorithm has two phases. The first finds an optimal partition A’,B’ of S₁ through S_n, through exhaustive search for an m less than n. This seeds the algorithm with a partial solution and work required to create the seed is O(2^m) making this PTAS and not FPTAS. The second phase sets A to A’, B to B’ and loops for i in m+1 to n, if w(A) is at most w(B), A=A union {i}, else B = B union {i}.

Showing that approximate partition is PTAS, without loss of generality (WLOG), assume W(A) is at least W(B) and approximation ratio is w(A)/L. If k is the last element added to A, and it could have been added during either phase, both cases need to be analyzed. If k is added to A in the first phase, then A=A’ and there’s an optimal partition since we can’t do better than w(A’) when there are n>=m items and w(A’) is optimal for m items. If k is added to A in the second phase, we know w(A)-s_k is at most w(B) and this is why k was added to A. We know w(A)-S_k is at most 2L-w(A) since w(A)+w(B) is 2L and since items are sorted in order, items S₁ to S_m are at least S_k. 2L is at least (m+1)S_k since k is at least m. w(A)/L is at most (L+S_k/2)/L which is equal to 1+S_k/2L and at most 1+S_k/(m+1)S_k which equals 1+epsilon.

Lecture 18 – Complexity: Fixed Parameter Algorithms

This is the second lecture on using solving NP-hard problems, this time focusing on fixed-parameter tractability instead of approximation. Ideally, you could solve hard problems in polynomial time with exact solutions, but only 2 of these can be achieved. Polynomial problems can have exact solutions in polynomial time. Approximation algorithms solve hard problems in polynomial time, and fixed parameter tractability gives exact solutions to hard problem. The idea for fixed-parameter tractability, FPT, is to find an exact solution for an NP-hard problem and confine the exponential dependence to a parameter. In general a parameter is a size or complexity measure. Parameter k(x) will be a non negative integer where x is the input. The desired running time will be exponential in k and polynomial in problem size n. As long as k is small enough, the solution can still be quick even with exponential time. A parameterized problem is a problem plus a parameter, typically written “problem with respect to parameter k.” Usually there’s one obvious parameter to consider.

The problem k-Vertex Cover is a parameterized problem where given graph G=(V,E) and a non negative integer k, the goal is to find if there is a vertex cover S at most V of size |S| at most k. The parameter function here is just k. An obvious brute force solution for the problem would be to try all (v choose k) subsets of k vertices and test each for coverage. This could be used to show it’s in NP, but makes an exponential algorithm, O(EV^k). The exponent depends on k in this case which is bad. A parameterized problem is FPT if it can be solved in f(k)*n^O(1), which would mean that the exponent of n doesn’t depend on k. The f(k) will be the exponential function and there is an algorithm to solve vertex cover in this time.

The bounded-search-tree algorithm can solve k-Vertex-Cover better. First, consider any edge e=(u,v) in the graph. Either u or v (or both) is in S, and you can guess by trying each. Guess that u is in S and delete u and incident edges creating a subproblem with a new graph. In the new graph k is decremented and recurses the algorithm on G’, k’. In the second case you guess v is in S and do the same thing. The overall answer is the OR of these cases on the highest level. This is similar to dynamic programming but is exponential and memoization isn’t used. At each recursive step, k and n go down by one. O(V) time is spent at each node in a recursion tree. For the base case, when k is zero, return whether |E| equals zero. If so, there’s a vertex cover, if not there isn’t. There are 2^k nodes in the k height tree and this is O(V2^k) which is a good algorithm.

You can solve a problem in f(k)*n^c (an FPT problem) if and only if you can solve it in f’(k)+n^c’. If you have an instance of size n with parameter k, either n is at most f(k), in which case f(k)*n^c is at most f(k)^c+1 or n is at least f(x), in which case f(k)*n^c is at most n^c+1. In both cases, f(k)n^c is at most max {f(k)^c+1, n^c+1} and at most f(k)^c+1+n^c+1. Applying this to the bounded search tree bound, n2^k is at most n²+4^k.

A kernelization algorithm is a polynomial time algorithm that can be thought of as a self-reduction which converts an input (X, k) into an equivalent small input (X’, k’) for the same problem. Equivalent means the answers will be the same. Small means |X’|=n’ should be at most f(k). Bringing an input down from size n to k in polynomial time can be a huge change. A problem is FPT if and only if it has a kernelization. It’s FPT if it can be solved in f(k)*n^c, so proving these can be done by kernelizing and running any algorithm on the kernel. If you have an FTP algorithm you can prove it can be turned into a kernel. If n is at most f(k), it’s already kernelized. If n is at least f(k), the FPT algorithm runs in O(n^c+1) meaning it runs in polynomial time. The algorithm should output a yes or no input accordingly and the new kernel will have a constant size smaller than f(k). Exponential size kernels are equivalent to FPTs, but polynomial kernels may not be.

Applying a polynomial vertex cover for the k-Vertex-Cover problem, if u is in the vertex cover, delete it and incident edges and decrement k. If the graph isn’t simple and multiple edges can connect two nodes, you can delete all but one of the edges between the same nodes to make it simple. A lot of simplification rules will be guaranteed to be correct and make the graph simple. With any vertex with a degree higher than k, it must be in the cover S since if not, you’ll have to put more than k. Finding a vertex with degree k can be done in linear time and the overall running time will be O(VE) though can be improved to linear with a data structure. After all reductions are done, you’ll have a bounded degree graph where all vertices are at most k and each vertex in S covers at most k edges. Since |S| is at most k, |E| is at most k². Zero degree vertices should be deleted as well, so |V| is at most 2k² and n=|V|+|E|=O(k²). If the produced graph is of the correct size, if it’s still too big then the assumption that |S| was at most k, then there is no vertex cover of size at most k and you can output “no” to the problem. If a smaller graph is produced, output that. This means a quadratic (in k) sized graph is produced in polynomial time. These are good reductions to make for solving vertex cover and either the brute force or bounded search tree algorithms can be run faster. Kernelization is similar to how circuits were reduced in EECS.

One example FPT algorithm is to kernelize in O(VE) time and use the brute force algorithm for O(K²(2k²)^k)=O(k²2^k2k). This shows that you can get a good run time with a bad algorithm if you kernelize it. Kernelizing and running bounded-tree-search will give O(k²2^k) and a total running time of O(VE+k²2^k) which is still an improvement. The best possible solution not covered here is O(kV+1.274^k). If you know there will be a relatively small vertex cover, these are the algorithms to use.

Approximation and fixed-parameter algorithms are closely related. You can take an optimization problem with an integral OPT. Consider a decision problem, such as if OPT is at most k. The parameter here is k. The optimization problem has EPTAS (Efficient polynomial time approximation scheme) and the decision problem is FPT. EPTAS means f(1/epsilon)*n^O(1). EPTAS isn’t quite as good as an FPTAS but is close. Proving this, say the problem is a maximization problem, you can run the EPTAS with epsilon equal to 1/2k in f(2k)*n^O(1) time. Absolute error is the optimal solution minus the approximate solution and the relative error is the absolute error divided by the optimal solution. The relative error is at most epsilon which is strictly less than 1/k and absolute error OPT/k and is less than 1 if OPT is at most k, otherwise OPT/k. So, if we find an integral solution of value at most k, then OPT is at most (1+(1/2k)*k which is less than k+½ then OPT=k, since no integers are between k and k+1/2.

Recitation 9 – Approximation Algorithms: Traveling Salesman Problem

The traveling salesman problem is a problem where given an undirected graph G(V,E) with non-negative integer cost c(u,v) for each edge, find the minimum cost Hamiltonian cycle. The traveling salesman problem is NP-complete and generally has no good approximation. A known 2-approximation for the metric traveling salesman problem (metric TSP) is known. In metric TSP, the cost function satisfies triangle inequality and shortest paths will as well. Metric TSP is still NP-complete, but an approximation algorithm works by removing an edge from a Hamiltoinan cycle to get a spanning tree. Find the minimum spanning tree T of G rooted at a node T. H is the list of vertices visited in pre-order tree walk of T starting a r and the cycle that visits vertices in H order is returned.

Christofides Algorithm can improve on this by modifying the MST. A euler tour of a graph is defined which visits each edge exactly once. Find the MST T of G rooted at r and compute minimum cost perfect matching M of odd degree vertices. Add M to T to get T’ and H is the list of vertices of the Euler tour of T without duplicate vertices. Return the cycle that visits vertice in H order. This is a 3/2-approximation algorithm for the problem.

Lecture 19 – Synchronous Distributed Algorithms: Symmetry-Breaking, Shortest-Paths Spanning Trees

Distributed algorithms are algorithms that run on a network of processors or one machine with multiple processors. Much of computing is about distributed algorithms. There will be many processes running in different places requiring preparing for timing uncertainty, failures and inputs and outputs between machines. These can be harder to design, test and analyze. These have been studied since 1967 and Dijkstra was one of the first people to study them. This week is an introduction to the field along with two models and common algorithms. The models are synchronous and asynchronous distributed algorithms. Some synchronous algorithms are leader election, maximal independent set, breadth-first spanning trees and shortest path trees. and asynchronous examples are breadth-first spanning trees and shortest path trees.

Getting the formal models and real proofs is very important as its easy to make mistakes in distributed problems. Interacting state machines with inputs and outputs are used for distributed proofs. Proofs use invariants and are proved by induction and abstract descriptions are proved as well. Different complexity measures are used such as rounds or real time for time complexity and messages or bits for communication complexity.

Distributed networks are represented as an undirected graph, G=(V,E). n=|V| (total number of nodes in the network, gamma(u) is the set of neighbors of vertex(u) and the degree of u is the number of neighbors in its vertex, deg(u)=|gamma(u)|. Nodes are vertices in the graph. A process is associated with each vertex and two directed communicated channels are added as edges between nodes. Failure isn’t talked about this week. The teacher this week wrote a book called Distributed Algorithms. Processes at nodes of an undirected graph communicate with messages, each with input and output ports that connect to communication channels (edges). Processes don’t need to be distinguishable via IDs but should know local names and the ports they have. An algorithm executes in synchronous rounds, with processes determining messages to send with each port each round, these go on to the channels and delivered to processes on the other end, which change their state based on the arriving messages. Costs of local computation are usually ignored in favor of worrying about communication costs in this lecture.

Leader election is a problem where in a connected graph G=(V,E), one node elects itself a leader gaining special permissions for helping run the system such as coordinating tasks or allocating resources. In the simple case of a clique network, where all vertices are directly connected to all others, it will generally be impossible to elect a leader as every node is essentially the same and they won’t be able to distinguish themselves. If processes always do the same thing and remain in the same state, they will all think they’re the leader if one does. Something more is needed to leader election to work such as a unique identifier or randomness. With UIDs, assume processes have one and that they are ordered. If there’s an algorithm with identical processes with UIDs, then it can elect a leader using n² messages in one round. The process with the maximum UID can elect itself the leader. With UIDs and ways to exchange them, leader election is possible. With randomness, processors can just choose random identifiers, though this may require a reroll if there are duplicates. This algorithm can take expected time under 1/(1-epsilon) and finishes in one round with at least 1 minus epsilon probability. For this algorithm, processes choose random ids from a sufficiently large space, exchange ids and if the maximum is unique, it wins. Otherwise, repeat as many times as necessary.

The second problem is a maximal independent set (MIS), where given an undirected graph you choose the maximal independent set. Independent means no two neighbors are in the set and maximal means you can’t add nodes without violating independence. For a distributed version, assume there are no UIDs and the process should have a good upper bound on n. The goal is to compute MIS S of the entire network and each process in S should output in or out. This is unsolvable by deterministic algorithms in some graphs where randomized algorithms can be used instead. Distributed MIS come up in communication systems to allow chosen nodes to communicate on an overlay network. These also come up in environmental biology where cells distinguish themselves in fruit flies.

Luby’s MIS algorithm executes in phases each with two rounds. All nodes start as active and at each phase some decide if they’re in or out of the MIS and others remain unclear. The next phase continues with the remaining nodes. In each phase an active node u in phase ph picks a random value r in a large space (1 to n⁵) and sends it to its neighbors. If it’s value is larger than its neighbors it’s a local maximum and outputs in. The joining node sends messages to its neighbors who output out. New UIDs will be chosen each round. If this terminates, we know two neighbors won’t join as all neighbors will be out. A node will only become inactive if it or a neighbor joins so no more could possibly be added. This is guaranteed to terminate and with probability 1-(1/n), it will finish in 4 lg n phases. For each phase, the expected number of edges reduces by a factor of two. If a node u has a neighbor w that choses a value than all of its own neighbors and bigger than u’s other neighbors, w will join and u will not. The probability of this happening is 1/(deg(u)+deg(w)) and the probability node u is out is the sum over its neighbors of 1/(deg(u)+deg(w)). The probability of an edge dying is the probability of one of its endpoints dying and the expected number of edges dying is ½ the number of undirected edges. At the end of each phase the expected number of live edges is at most half the number at the beginning with a small probability otherwise.

Breadth-first spanning trees are a familiar problem with a new setting. Assuming a connected graph G=(V,E) with a distinguished vertex v₀ as the root of the BFS tree. Processes don’t need knowledge about the graph and know their own UID. We can assume (WLOG) processes know UIDs of neighbors and connected ports and algorithms are deterministic or nondeterministic but not randomized. Processes produce a BFS tree rooted at v₀. Spanning means they should reach all vertices and shortest paths from the root should be found. Every node will just know its parent in the tree. To run a simple algorithm, processes will mark themselves as they get incorporated in the tree starting with i₀ marked. For round 1, if “i” is i₀ then it sends a search message to its neighbors. If a process receives a message then it marks itself and sets i₀ as its parent which it outputs and sets itself ready to send a message next round. For higher rounds, if process “i” planned to send it sends a search message to its neighbors. If it’s not marked and receives a message, it marks itself and selects a sending neighbor (arbitrarily) as a parent to output and plans to send next round. Once a node is marked it doesn’t care about future messages received.

This is nondeterministic as a process can choose among possible parents though you could add determinism choosing based on UID. The algorithm should work correctly no matter how nondeterminism is applied. The time complexity is the number of rounds for all nodes to output their parent information which is the maximum distance of any node from v₀ which is at most the diameter. The message complexity is O(|E|) and is the number of messages sent throughout the execution. The algorithm can be augmented to return that a node isn’t its parent or can set its distance based on a received variable. To ensure termination, if no node gets a response from a node saying it’s a child, then it knows it’s a leaf and leaves can say it’s done at the end. Once the root node receives done messages from its children, the tree is complete and the root will need to tell the nodes the entire tree is complete as nodes previously only knew of the completion of subtrees. After a tree is constructed it can be used to send messages to all other nodes. Each takes rounds equal to the diameter and n messages. The root node can also collect data from nodes by sending them up the tree in a “converge cast.”

Weights can be added to each edge to make this a shortest paths problem. Given a graph G=(V,E) where processes have UIDs and know their neighbors and own weights and a root vertex v₀ is chosen. This is a familiar problem that can be distributed. Each node should output its parent and distance (if not root). The distributed algorithm here is also theBellman-Ford shortest paths algorithm. Each node keeps track of distance, parent and UID. At each round, each node sends its distance to its neighbors and receives messages from it’s neighbors. As before, a relaxation step is done where you improve your distance and reset parent information to a node sending a shorter distance. Each node here does its own thing, sending out its information and waiting for a better distance to be sent to it. Each node starts off knowing a best distance from the root of infinity.

Lecture 20 – Asynchronous Distributed Algorithms: Shortest-Paths Spanning Trees

Continuing from the same shortest paths problem, this is a bit trickier than BFS. Shortest-paths spanning trees take n-1 rounds with O(n|E| message complexity and is more expensive than BFS with rounds equal to the diameter and O(|E|) messages. Converge cast can be trickier for sending messages back can be trickier since it a node waits for messages from its children while it’s children can change. There can be lots of false starts though eventually the right information will make its way up. The video with this one is off sync quite a bit with video of the slides occurring in the wrong section from what I can tell.

Looking at asynchronous algorithms, the ordering rounds provided disappear and messages can go between systems whenever. Instead of understanding exactly how they execute, you look at abstract properties of their execution. Processes at nodes of an undirected graph communicate with messages. Communication channels are associated with edges with one in each direction. Each process has input and output ports to communication channels and processes don’t need to be distinguishable. A channel is an automaton denoted C_u,v. It’s an “infinite state automaton” with inputs and outputs. Inputs are messages being sent and outputs are received messages. A state variable mqueue is a FIFO queue which starts empty. Transitions occur with sends or received adding message m to mqueue with send and receive removing the head of the queue if it exists.

The rest of the system consists of processes, where a process automaton at vertex u is represented P_u. Each vertex in a graph is associated with a process. The process has send(m)_u,v outputs and receive(m)_v,u inputs along with additional inputs and outputs and state variables and takes steps. An example of a process automaton is a Max_u process automaton with the same input and output actions along with state variables “max” which is a number and a send variable for each neighbor which is initially true. Transitions occur with each send and receive. For a receive, if m is greater than max then max becomes m and send is set for each of its neighbors. For send, if send(neighbor) is true and m is max, send a message and set send(neighbor) to false.

There aren’t synchronous rounds any more so the system performs enabled steps one at a time in any order. This is still a sequence of individual steps but not concurrent. Each process will eventually deliver its first message in the queue and perform an enabled step. Each process starts with an initial value x_u and send initial values and update their max values until everyone has globally maximum values. Sending and receiving can happen in different orders but the global maximum will eventually be found. Message complexity is the total number of messages sent which is O(n|E|) and time complexity isn’t as obvious. In this case, real time is used instead of the standard asymptotic complexity or rounds. If d is how long a channel takes to deliver a message and l is the time for a process to perform a step, then ignoring local processing time and considering sending time, the max system’s upper bound could be O(diameter*nd). Considering time for the max to reach a vertex on the shortest path it can take up to n*d time per channel at worst. These times are only for analyzing complexity and not known to an actual system.

The next example looks at using a breadth-first spanning tree on an asynchronous network. Running the same synchronous algorithm can be a problem because it doesn’t make corrections and can get stuck with the wrong parent based on time. There’s a lot more nondeterminism with the order of message deliveries and process steps (though not in process decisions). A process automaton for this example receives and sends search messages along with another output action parent(v)_u to output its parent. Initially, parent is gamma(u) union {upside down T} and initially T, reported is initially false and for each neighbor v in rho(u), send(v) in {search, T}, initially search if u is v₀, T otherwise. When the program receives a message, if it doesn’t have a parent and isn’t root then it sets the parent to v and sends a search to its neighbors. Send sends a message and becomes T. Parent(v)_u reports and sets reported to true if its parent is v and reported is false. The upside down T symbol seems like it might mean false. This algorithm can produce algorithms longer than shortest paths as longer paths can allow messages to propagate faster. Message complexity is O(|E|) and time is O(diameter*d).

For a second attempt at this problem, a relaxation algorithm appropriate to asynchronous problems is used. Each process keeps track of hop distance and changes its parent when it learns of a shortest path, propagating improvements. Eventually this will create a BFS tree. This adds a distance state variable to the tree which starts at 0 for the root and infinity otherwise and changes receive to check if m(essage)+1 is less than dist and sets dist to m+1 if so. It also adds dist to send(neighbor). Correctness can no longer be sown after a number of rounds. To show that you get the right answer you have to say that by a certain time you’ll get the right result. Message complexity becomes O(n|E|) and time complexity is O(diam*n*d). You can use convergecast again to send done messages from children, though this may run multiple times before correct.

Asynchronous Bellman-Ford will have to handle long smallweight paths and asynchrony, leading to a high message and time complexity that’s exponential. Changes here are that for each v in gamma(u), send(v) is a FIFO queue of N, initially (0) if root or empty. Receive says if m+weight_{v,u} < dist, then dist=m+weight_{v,u}, parent=v and for each neighbor, add dist to send(neighbor). An invariant for showing correctness says if dist isn’t infinity then it’s the distance on a path and a timing property says that if p is an at-most-r-hop path then by a time, node u’s dist is at most weight p. The problem with this solution is that there are O(n!) simple paths from v₀ to any other node u which is O(nⁿ) which could be the number of messages sent on a channel. Message complexity is O(nⁿ|E|) and time complexity is O(nⁿ*n*d). Emptying a channel with exponential messages can take exponential time which can be a possibility with repeated small updates and is pretty bad despite being a correct algorithm. Converge cast can once again show the algorithm has finished though. The moral is that unrestrained asynchrony can cause poor performance and there’s a whole course called Distributed Algorithms which goes further with more synchronous and asynchronous algorithms and honestly is something I should check out.

Recitation 10 – More Distributed Algorithms

Given a ring of nodes (with left and right pointers) with UID’s leader election is possible by each node sending their UID clockwise until they get theirs back and will know if theirs is the largest making them the leader. This would take n time and require n² messages. The Hirschberg-Sinclair algorithm can possibly improve on this by searching to successively doubled distances. With this algorithm, each process sends its own UID and a message hop count h and direction d in a message. If a process receives another node’s message, it decreases the hop count by 1 and forwards it while the hop count is greater than 0 and its UID is greater than or equal to any UID seen. If its the largest UID seen and hop count is zero, it’s forwarded in the other direction. If a node doesn’t receive a message that it sent back it stops sending messages. If a node receives its (UID,h,left) message on its right and its (UID,h, right) message on its left then its the largest and becomes the leader. The total number of messages will be O(n lg n) in this case and an asynchronous ring would work as well.

The next problem shows an asynchronous algorithm for counting nodes. If a graph has two or more nodes and a root process i₀ at v₀, use an algorithm to let i₀ compute the number of nodes in the network. This sets up a spanning tree and adds a convergecast to accumulate the sum on its way up. The tree is set up so the first search message becomes a node’s parent and messages are passed on for v₀, for each v in children, send(v) is a queue with an initial value of (search). For other nodes, they have a parent value and a send queue for their children that starts empty. When they receive a search, if they don’t have a parent they set the previous node to their parent and add search to the send queue for each of their children. Next, parent(true) or parent(false) responses are sent and kept track of. Nodes will have responded and children variables keeping track of these and will add children who responded true when a message is received. Lastly, converge cast works by adding a done and total variable to each node and outputting the final value to total and sending to parent when done.

Lecture 21 – Cryptography: Hash Functions

This lecture talks about a different application of hash functions and additional required properties, focusing on cryptography. There’s no real change to the definition of a hash function which still maps arbitrary strings of data to a fixed length output in a deterministic, public, and random manner. These outputs will be thought of as large sequences of bits and hashes will need to be computed and used for other applications rather than stored. It should be impossible for collisions to be discovered. For instance h: {0,1}* gives {0,1}^d which is a string of length d given an arbitrary length input. There are no secret keys in the examples and all operations are public, meaning anyone can compute h in polynomial time. Since no one should be able to discover a collision, O(1) time hash functions won’t work here. MD4 and MD5 were hashing algorithms assumed to have collision resistance but ended up broken. SHA-1 is a 160 bit algorithm that is assumed to be broken soon and SHA-3 is recommended at the time of this lecture.

The ideal would be to build a “random oracle” with the desired properties, though this isn’t achievable in practice. For the oracle, on an input x in {0,1}*, if x isn’t in a book (storing all previous calculations), flip a coin d times to determine h(x) and record(x,h(x)) in the book. If x is in the book, return y where (x,y) are in the book. You’ll get a random answer each time unless an answer exists. Everyone would be able to query this book and this would be impossible to get to this consistency. Instead, the professor tries to approximate a solution. The random oracle is useful since it’s easy to understand and see desired properties.

One desirable property is one-way-ness (OW) which says its infeasible given a d-bit vector y to find any x such that h(x)=y (where x is pre-image of y). Given a specific hash or 160 bit string, it should be impossible to discover the x value that produced it. Since a lot of hashes would be added, you’d have to query a potentially infinite capacity book to find y which would take a long time and look at each match. This would take worse than exponential time. A hash function such as y or h(x)=x mod p would be discoverable by plugging values for y and looking for x’s. With simpler hash functions looked at before, it’s easy to plug in guesses. As hash functions get more complicated, it becomes harder to break them and not all hash functions will have simple inverses. Functions without accessible inverses will satisfy one-way-ness.

The next desirable property is collision resistance where it’s infeasible to find x, x’ such that x isn’t x’ and h(x)=h(x’), meaning you won’t find a collision. TCR or target collision resistance is a property saying it’s infeasible given x to find an x’ that isn’t x such that h(x)=h(x’). Pseudo-randomness is a desired property that says that the behavior is indistinguishable from random via pseudo random functions. Lastly, the desired property non malleability says its infeasible given h(x) to produce h(x’) where x and x’ are related in some fashion. For instance, if x’ is x+1, someone shouldn’t be able to look at your hash function and know their input is higher or lower based on the hash (an example is an example with hashed bidding). For relationships between these properties, if h is CR, then it’s also TCR as CR is strictly stronger. The reverse isn’t true.

Some applications for hash functions in cryptography are given. Password storage is an example where you compute a hash of a given password and the computer checks its stored hash compared to the hash of the password you entered to avoid having to store the password. If the domain is small enough, it’s easy to enumerate the domain which is part of why longer passwords are better. One-way-ness is important as the hash shouldn’t be usable to get the password and collision resistance helps avoid a random guess, though isn’t necessarily required as systems can do things such as lock a user out after a certain number of guesses.

Another use case is a file modification detector where for each file F, h(F) is stored securely in a way it can’t be changed. You can check if F is modified by recomputing h(F) and comparing it to the stored hash. A break will be successful if an attacker modifies the file and keeps h(f) the same. This requires TCR for a hash function to protect against. The next example is of digital signatures which are a way of digitally signing a document using a secret private key while anyone with a public key can verify the signature’s authenticity. Signing creates a signature sigma which is sign(private_key, message) and verifies the outputs are true or false given inputs, message, sigma, and public key. Public and private keys are distinct and reveal nothing about the other though have some mathematical relationship to make the process work. For a large message m, it’s easy to sign a smaller h(m) and messages in the functions can be replaced with h(m). Alice signs h(m) but Bob claims she signed m’ since hashes of both messages are equal. For this process to work, it will require TCR.

Another problem is that of commitments and keeping promises. If Alice and Bob are in an auction where Alice’s bid has value X and computes C(x) and submits it. An auctioneer can see C(x) and when bidding is over, Alice will “open” C(x) to reveal x. If Bob had information on x, they’d be able to submit the bid plus 1 to win. C(x) can’t reveal x and needs to be one way and collision resistant so Alice can’t change a bid. This also requires non malleability since given C(x), Bob shouldn’t be able to bid one higher. Even with these, it’s possible to come up with a bad hash function that can give away information. Lot’s of hash functions get broken and this is a changing field.

Lecture 22 – Cryptography: Encryption

This week is the second lecture on encryption and talks about encryption. The first example of encryption is symmetric key encryption which assumes there’s a secret key k which is a 128-bit number. Cipher text c=e_k(m) where e is the encryption function and the message is plain text. The encryption will be a polynomial time algorithm as there could be streams of messages. The reverse is that m=d_k(c) where d is the decryption function. This is required to be reversible as opposed to the one way hashes from last lecture. AES is a well used symmetric key cipher running in 128 or 256 modes and encryption and decryption are identical and reversible by switching signs. The main ideas are symmetry and reversibility here.

Key exchange is the process of sharing the secret key. This needs to be secure so there aren’t eavesdroppers and can’t be on a public website. An example has Alice and Bob on islands with a pirate ship going back and forth. It will deliver a locked box, open an unlocked one and steal a sent key. How can Alice and Bob exchange a message in a box successfully? Alice puts a message in a box and locks the box with K_A and sends to Bob who locks the box with K_B and sends it to Alice. Alice unlocks K_A and sends the box to Bob who unlocks K_B and reads the message. This would be a problem is there were nested locks where you couldn’t remove K_A without removing K_B. Commutative locks and keys are required for solving this problem.

Diffie Hellman key exchange is an algorithm for secure key exchange assuming you have commutative locks. G=Fp* which is a finite field that’s mod p where p is fine and only invertible elements are considered, looking at elements 1 to p-1. For this, Alice selects a random a and computes g^a and sends it to Bob who selects b and computes g^b and sends it to Alice. Alice can compute (g^b)^a mod p = k and since Bob can compute (g^a)^b mod p = k. This creates a shared secret between the two. The discrete log problem, given g^a, compute a. The Diffie Hellman problem is given g^a, g^b you shouldn’t be able to compute g^ab. These two problems should be hard to reverse for the protocol to be secure.

What if the pirates had their own lock and key for the lock? Alice locks the box and sends it to the pirates who lock it with their own keys and send it back. Thinking this is Bob’s lock, Alice unlocks her lock and sends it back to the pirates who unlock it and retrieve the message. This is a man-in-the-middle attack. To solve this, Alice needs to be able to authenticate Bob and know that a lock came from Bob. This is where public keys come in. Putting a public key online is a way of identifying yourself. With public key encryption, given a message and public key, you can obtain cipher text. So Alice could encrypt her message with Bob’s public key and send it to Bob. Bob’s distinct private key allows him and only him to decrypt the message. The public key and secret key can’t reveal information about each other but will need to have a mathematical relationship.

The RSA algorithm is about primes and using number theory to accomplish this. It’s functionality and security should be distinct. RSA’s functionality is shown with Alice picking two large secret primes p and v and computes N=pq and chooses an encryption exponent e which satisfies gcd(e,(p-1)(q-1))=1. e will be a small number to make encryption faster. Alice’s public key is (N,e). Alice gets the decryption exponent with the Extended Euclidean algorithm where e*d=1 (mod (p-1)(q-1)). Alice’s private key is (d,p,q). If an adversary sees N, they shouldn’t be able to factor it to p and q. Showing this works requires Fermat’s little theorem which says m^p-1 = 1(mod p) where p is prime. If phi=(p-1)(q-1), since ed=1(mod phi) is given, ed=1+k phi. c=m^e(mod N) and m=c^d(mod N). m^ed should equal m. Given p and q are primes, there are two cases. First, gcd(m,p) is 1 and by Fermat’s theorem, (m^p(q-1)*m=m(mod p) = m^ed. The other case is where gcd(m,p)=P, where m mod p is 0 and m^ed=m trivially. In both cases of p and q, m^ed=m(mod p)=m (mod q) and since p and q are distinct primes, m^ed=m(mod N). RSA is the first public key algorithm to stand the test of time, although the parameters have increased in bits. Given N, it should be hard to get p and q. If you are given c and e where gcd(e, (p-1)(q-1))=1. If you can find m, where m^e=c(mod N). This is the RSA problem and breaking it can solve for N.

The problem of asking if N is composite is in NP, but not known to be NP-complete. Checking if a graph is “3-colorable” is NP-complete however. A reasonable question to ask is if we should be building cryptography algorithms based on NP-complete problems. Knapsack is NP-complete as well. Unfortunately, the computational hardness for schemes based on knapsack were easier to crack. This is because NP-Completeness is based on the worst case and in the average case a problem like 3-colorability is easy. Factoring problems are hard in the average case and problems that are hard in the average case are more appropriate for cryptography algorithms. A special type of knapsack problem called the superincreasing knapsack problem is solvable in linear time. A good crypto system could have encryption based on superincreasing knapsack and decryption based on knapsack. The idea is to use a private key which is a superincreasing knapsack problem and a private transform to a hard general knapsack problem corresponding to the public key. You need a situation where in the average case it’s hard to find what generated the ciphertext.

Recitation 11 – Cryptography: More Primitives

A digital signature scheme has a pair of functions sign and verify where alpha=Sign(sk_A,m) and beta=Verify(pk_A,m,rho). This can be used to verify the sender and beta should only be true if alpha is a signature of m. Forgeries shouldn’t be possible either although there isn’t a known way to stop someone from passing a message-signature pair its seen before. In RSA, sign works by setting alpha = m^d mod n and verify sets b to true if alpha^e equals m mod n where d is the secret key and (n,e) is the public. This approach thinks of sign as decryption and verify as encryption and compare, where digital signatures used to be intended as the inverse of public key encryption. If m is viewed as a ciphertext, then decrypting and encrypting will get it back and it can’t be decrypted without the key. Unfortunately, if an attacker had two pairs they could easily compute a forgery. Categorizing encryption methods, for confidentiality in symmetric systems, private-key encryption is best while public is best for asymmetric. For integrity, digital signatures work for asymmetric while the best solutions for symmetric are message authentication codes or MACs. These are similar to digital signatures but have one key, k. Alpha = MAC(k,m) and verification checks if that’s true.

Given a scenario where Alice stores files on a server, how can she tell they aren’t modified and are the latest version. A simple method stores a hash of every file locally and assuming it can’t be modified, she can detect modification. This is trusted or local storage but will require linear storage. Computing a single hash of all files gives O(1) storage complexity yet O(n) time complexity since verifying or updating one file requires downloading all. A Merkle Tree is a solution where data blocks are hashed at leaves and an intermediate node stores a hash of its leaves. The root hash is stored in trusted storage and intermediates can be in untrusted. This requires constant storage and O(lg n) time complexity.

Lecture 23 – Cache-Oblivious Algorithms: Medians and Matrices

In real computers accessing each node doesn’t necessarily have the same cost. In an actual computer, there’s a memory hierarchy dictating which memory stores are fast and which are slow. The CPU can access 4 levels of caches, main memory or RAM, flash and disk memories each with different amounts can be stored. Going in order, data gets slower to access but more data can be stored. Reading from an on chip cache is very fast while the disk is very slow. Latency is how long it takes data to come back although bandwidth is another measurement. Levels get bigger but have higher latency. Bandwidths can be made pretty large by adding additional memory stores. You can’t increase latency but can improve bandwidth with parallel processes. Blocking is a tactic for mitigating against latency, so when you ask for a single word, you’ll actually get a whole block of data. Systems can be designed with different block sizes based on this. The amortized cost over a block is latency/block-size + 1/bandwidth.

Better algorithms are needed for this to work, which make use of all elements in a fetched block, known as spatial locality. If the rest of the items in it aren’t needed then there isn’t necessarily an improvement. Blocks in the cache should be re-used as much as possible, known as temporal locality. The external-memory model is a model for making use of memory. This takes two levels into consideration. For instance, the CPU has constant registers and a fast pipe into the cache divided into M/B, B-size blocks. There’s a slow connection from the cache to the disk but data has to be returned in full B size blocked. Eventually the cache will get full. Count the number of memory transfers between levels which equals the number of blocks read from or written to the disk, viewing cache accesses as free. The number of operations will be counted on the CPU. The algorithm will explicitly read and write blocks.

Another model is the cache-oblivious model which leads to cleaner algorithms that are trickier to analyze. It’s almost exactly the same as the external-memory model except that the algorithm doesn’t know the cache parameters, B or M. The algorithm has to work simultaneously for all values of B or M, therefore. The entire memory hierarchy is viewed as a giant array of words divided into blocks. When you access a word in memory, the system fetches the block containing it. This model can’t explicitly read or write blocks. In this case, the system will evict the least recently used block. Each level in cache or type or memory type will have its own latency and block size, so having to explicitly adjust settings for each can be tedious. The cache-oblivious model has to work for any size and will solve for all levels at once. Assuming an memory array is stored contiguously, without holes, one potential algorithm is

Scanning:
    for i in range(N):
        sum+=A[i]

This just accesses items in order and the first item in each block costs one. The cost of this will be ceil(N/B) in the external memory model and at most ceil(N/B)+1 in cache-oblivious. In general, this is viewed as O(N/B + 1). Another algorithm can run a constant number of parallel scans in an array A[0:N]. As long as M/B is at least 2, this is also O(N/B + 1).

parallelScans:
    For i in range(floor(N/2)):
        Swap A[i]<->A[N-i-1]

Better examples for cache oblivious algorithms use divide and conquer. Here, divide and conquer algorithms will recurse all the way down to a constant size base case as usual. What’s different is the analysis which will tell us what B and M are and looks at recursive levels where the problem is in constant blocks O(B) and where they fit in the cache. The first divide and conquer algorithm looked at is median finding where given an unsorted array, you try and find a median. Lecture 2 showed a linear time worst case. The same algorithm works here. First, view the array as being partitioned into five columns. Sort each column and find the median, then recursively find the median of the medians. Partition the array by x and recurse on one of the halves. Because one array is much smaller than the other this comes out to linear time instead of n lg n which two recursions would suggest. This requires arrays to be contiguous and comes out to MT(N)=O(N/B + 1) where MT is the number of memory transfers.

Matrix multiplication is another example with divide and conquer. Given two N by N matrices X and Y, Z=XY. With the standard matrix multiplication algorithm, computing Z_i,j costs O(N/B +1) which means the total cost is O(N³/B + N²). This isn’t optimal and you can do better via divide and conquer using a block algorithm similar to Strassen’s, costing O(N³/sqrt(M)B) memory transfers. In the last minute he talks about LRU block replacement where looking at a sequence of block accesses for cache of size M, it will be in a factor of 2 of the optimal solution for m/2. This approximates the resources available to the algorithm which works with a polynomial algorithm.

Lecture 24 – Cache-Oblivious Algorithms: Searching and Sorting

This week looks at searching and sorting with cache-oblivious algorithms and talks about future algorithm classes to potentially take. LRU is a good eviction strategy because the number reads LRU has to do on a cache of size n is at most 2 times the optimum solution for one of size m/2. LRU is an algorithm in a field called online algorithms. A CPU doesn’t know what will come in and has to make a guess for what will be best. OPT is an offline algorithm which knows what the best answer will be by seeing the future. Proving this, the timeline can be divided into maximal phases of M/B distinct block accesses. The timeline can be viewed as a sequence of block IDs viewing a block stored in cache is free, otherwise it’s O(B). Two claims about these phases are that LRU_m(phase) is at most M/B and OPT_m/2(phase) is at least half M/B. The best case for OPT is that the entire cache is useful and contains blocks in the phase we’re looking at, at the start of the phase. At most ½ M/B blocks will be free and the other half it has to load in. OPT has to pay at least half what LRU is paying.

A search algorithm preprocesses n elements in a comparison model to support predecessor search(x). Given an element x, what’s the largest element smaller than it. Elements can be assumed to be static. After sorting, you can binary search in the array. The problem is that almost all accesses will be in a different block up until the end when everything will be free. T(N)=T(N/2)+O(1). Base case is T(O(B))=O(1) and T(N)=lg N/B which is a small improvement over log n but we can do better with a B-tree. The cost will be proportional to the height of the tree and O(log_BN) memory transfers which is the optimal solution. The problem is B trees need to know what B is.

Van Emde Boas is a data structure that can be used to solve this problem. Take the elements you want to store and put them in a binary search tree where O(log_BN) can be obtained by storing in an optimal order. The van Emde Boas order splits the tree in a top and bottom halves and recurses on the top half until it gets down to one. It marks the root 1 and then looks at its left and right children in that order marking each split from left to right looking at the root in the level and any children in the layer from left to right. Each triangle of at most b elements is consecutive and occupies at most 2 blocks. The cost of a search will be at most twice the number of triangles visited by a root to node path. The number of visited triangles will be at most (lg N)/2log_BN and MT(N) is at most 4 log_BN which is optimal. This gives an optimal running time of O(log_BN) ignoring constant factors.

Moving on to sorting, the obvious way to sort using B-trees is to do N inserts into a B-Tree (cache-oblivious or otherwise) in MT(N) = O(N log_BN). Mergesort is a divide and conquer algorithm and cache-oblivious but it costs MT(N)=2MT(N/2)+O(N/B + 1). The base case is MT(CM) where you read the whole thing in O(M/B). Mergesort can be improved by splitting into a different number of groups. M/B-way merge sort works particularly well here as merge costs O(N/B + 1) and M/B is how many scans the problem can handle. MT(N)= M/B MT(N/(M/B))+O(N/B + 1). The recursion tree will have log_m/bn/m + 1 levels which leads to MT(N)=O(N/B log_M/B N/B. Cache-oblivious-ly, you can achieve the same time, requiring that the assumption that M is at least B². An N^epsilon-way merge is required and accomplished with an algorithm called a funnel sort. You can also build a priority queue where each operation costs O(1/B * lg_M/B M/B) amortized memory transfers.

Lastly, the professor recommends other algorithm related courses to take. Advanced Algorithms, 6.854 is the obvious follow on and is a graduate class covering advanced algorithms. 6.047 is Computational Biology, 6.850 is Computational Geometry, 6.049 is on folding geometric algorithms and 6.851 is Advanced Data Structures, 6.852 is Distributed Algorithms, 6.853 is Algorithmic Game Theory is about optimal algorithms for games with lying players. 6.855 is Network Optimization and goes further into graphs and network flow. 6.856 is Randomized Algorithms. 6.857 is Applied Cryptography and 6.875 is Theoretical Cryptography. 6.816 is Multi-Core Programming and focuses on parallel computation. 6.045 and 6.840 are general complexity classes and 6.841 is advanced complexity theory, 6.842 is Random Complexity theory, 6.845 is Quantum Complexity Theory and 6.441 is Coding Theory and closely related to complexity theory.

Conclusion

After spending my free time over the last month going through algorithm courses, I’ll be working soon on getting more practice in the space before looking at any of the courses recommended in the last lecture (assuming they’re on OCW). It’s crazy that there are more related follow ups than courses I’ve gone through so far, with many leading to follow ups in their own specific niches. This will definitely be a course I come back to throughout my career and hopefully once at some point after I’ve finished my current list I can come back and learn more in this area. The next few courses will be mostly focused on software and programming in a less abstract manner and I’m curious to see how MIT teaches these topics.