MIT Courses #6 – Introduction to Algorithms

Welcome to the halfway point in my project of going through MIT Open CourseWare courses to try and figure out what I missed by not going through a traditional computer science degree program. Initially I had a list of 12 courses required for MIT’s computer science bachelor’s degree and Introduction to Algorithms is the sixth course in that list. Perhaps more exciting, I watched some of the lectures from this course awhile ago when practicing algorithm problems on HackerRank, and was something that I’d been meaning to come back to after I got busy with work again. I remember being confused by some of the mathematical notation as well, even though a lot of the algorithms made sense at the time and this confusion was part of what made me decide to approach these courses by order of prerequisites.

Algorithm problems are something I’ll likely be coming back to many times throughout my career and this course is a good overview of a lot of them. Some of this was review for me and some of it was entirely new. Reviewing topics such as depth-first search, breadth-first search and dynamic programming is always useful and this course goes deeper on them than most of my other exposures to them, expanding on shortest path problems with algorithms such as Dijkstra and Bellman-Ford. There’s a four part series on dynamic programming as well and some of the later lectures talk about NP complete problems in more detail than I’ve seen before as well as some use cases of algorithms in chip design.

This course is probably the most relevant to me so far and Design and Analysis of Algorithms seems to be a promising follow up, along with some other courses mentioned in this one which aren’t on my list but I may go through anyway. With 24 lectures and a lot more on my plate, I only covered the lectures for now but at some point may want to go back to the recitations and the textbook to double check if I missed anything important. For the previous lectures, recitation videos usually gave the same information in a different way, slowing down to explain things that may not have been obvious from lectures. This was helpful in Calculus and Physics where I was less familiar with the language, but given that this course’s examples were mostly Python or pseudocode and I’m now familiar with the notation used for proofs, I didn’t find myself confused by the lecture very often. Plus, with the constant pitches for future algorithm courses, it’s hard not to want to keep going for now.

MIT Courses #6: Introduction to Algorithms – 6.006 By Prof. Erik Demaine and Prof. Srini Devadas

Lecture 1: Algorithmic Thinking, Peak Finding

The lecture starts off analyzing the asymptotic complexity of “relatively simple algorithms.” “This course is about efficient procedures for solving problems with large inputs.” Large input examples are facebook or the US highway system. The world is moving faster and we have the ability to compute on large inputs, but exponential complexity can still create problems. Two concerns are efficiency and scalability of problems on large data sets. The course will also go over “classic” data structures such as binary search trees, dictionaries/hash tables and more which have stood the test of time and use and modify them to solve problems. If we like this course, we should do 6.046, Design and Analysis of algorithms. This course will do real implementations of classic data structures and algorithms in Python. Each lecture will have a programming and theory component. The course has 8 modules and the first is Algorithmic Thinking. The other modules are Sorting and Trees, Hashing, Numerics, Graphs, Shortest Paths, Dynamic Programming and Advanced Topics.

Starting off the Algorithmic Thinking module is the problem of a peak finding. In a one-dimensional version, a peak finder works on a 1 dimensional array of numbers. A-i are numbers at position 1 to 9 in an array. Position 2 is a peak if and only if b is greater than or equal to a and b is greater than or equal to c. “If you’re equal to elements on your left and right you’re a peak.” For a and i, they only have to look to their one neighbor. The goal is to find a peak if it exists. The goal is to find a straightforward algorithm and find its complexity as n (the number of elements in the array) grows. With greater than or equal to, any array will have a peak. Without the “equal to”, you can’t make that argument. A straightforward algorithm starts from the left and walks across the array. In the case where the first peak is the middle, you look at n/2 elements but in the worst case it could be all the way on the right and complexity is theta(n). Theta gives both the upper and lower bounds. How can we do better than linear on the one dimensional problem?

The next algorithm is a divide and conquer algorithm which looks at the n/2 position or middle element and looks at left then right. If a[n/2] < a[n/2 – 1] then look at the left hand side for a peak, otherwise if a[n/2]<a[n/2 + 1], look at the right. If neither condition is met, then n/2 is a peak. If T(n) is the work an algorithm does on an input of size n, then T(n)=T(n/2)+theta(1). The base case is a 1 element array which is returned as a peak. The recursive case splits log₂n times, giving theta(log₂n). You can’t do better for the one-dimensional problem, but algorithms get more interesting for the two-dimensional version. In a two-dimensional array or matrix with n rows and m columns, a peak is a hill. For instance, a is a peak if and only if its greater than or equal to the items above below and to its left and to its left and right. The greedy ascent algorithm tries to solve this by picking a direction and going in it to try and find a peak. There are choices for where to start and which way to go and what to do when you hit an edge. Similar to the 1d’s linear approach, here you could end up with theta(nm) or theta(n²) where the array is a square. Trying something similar to the binary search algorithm, pick the middle column, j=m/2 and find a 1D peak at (i, j) to use as a start to find a 1D peak on row i. This is an incorrect strategy as it doesn’t work even if it only takes log(n) time. The problem is a 2D peak may not exist on row i. A recursive strategy that’s better than the greedy algorithm and also works is to pick the middle column, j=m/2 and find the global max on column j at (i,j) and compare (i,j-1), (i,j) and (i,j+1) and pick left if it’s bigger than (i,j) otherwise check the same for the right. If (i,j) is bigger than both, you found a 2D peak. Otherwise, solve the new problem at (i,j-1) or (i,j+1) with half the number of columns. The complexity is T(n,m)=T(n, m/2)+theta(n). The base case, T(n,1) = theta(n), so with the recursive case, T(n,m) = theta(nlog₂m).

Lecture 2: Models of Computation, Document Distance

The word algorithm comes from al-Khwarizmi, the founder of algebra who wrote a book on solving linear and quadratic equations. Algebra and algorithm both come from him. An algorithm is a computational procedure for solving a problem. Generally it has an input that leads to some output to solve the problem. Basically, it’s the mathematical analog for a computer program. The program is built on top of a language whereas algorithms are written in pseudocode which is structured English. A computer in the mathematical world is a model of computation. The model of computation specifies what operations you can do in an algorithm and how much time they cost. Two models of computation seen as styles of programming that give ways of thinking about algorithms are a random access machine or RAM and a pointer machine. Styles such as OOP have analogs as well. RAM(machine) is the mathematical analogic of RAM(memory). In the computer, it’s basically a giant array that can go from 0 to 4 gigabytes in computer RAM. Mathematical RAM is also an array. In constant time, an algorithm can load a constant number of words, do a constant number of computations and store or write a constant number of words. There are a constant number of registers to load words into. Words will be stored in locations specified by registered. This is very similar to assembly programming. A word in this case is w bits, which should be at least lg(size of memory) to be able to specify an index in the array.

Pointer machines are more abstract models which aren’t as powerful but offer simpler ways of thinking about things. This corresponds to object oriented programming and has dynamically allocated objects. An object has a constant number of fields where a field is either a word (such as an integer) or a pointer. A pointer points to another object or has a special value, null. Linked lists have fields in each node, a pointer to the previous and next nodes along with a value of its own. Following a pointer or changing a field takes constant time. Pointer is a weaker model as it could be implemented with RAM. These models have been used since the 70’s or 80’s while modern versions of Python are more recent and can be used to model either one.

Python’s lists are implemented as an array, not linked lists. MIT used to teach Scheme where lists were linked lists. In Python, altering an element in a list takes constant time while a linked list would take linear time. With a reasonable amount of attributes, object operations also take constant time. How do you do L.append(x) on a list, L. Copying an array would take linear time. Python uses table doubling which can basically be done in constant time to be shown later. L=L1+L2 takes a copy of two lists and adds them. This is the same as saying, L=[], for x in L1, L1.append(x) and the same for L2. This would take theta(1+len(L1)+len(L2)). Just seeing the + doesn’t mean that it’s constant time. What about “x in L”? This is linear time if you have to scan through a list. If x is complicated and “==” costs a lot, that can be worse. Lists in python are implemented with a built-in length counter, so looking up length is constant. L.sort() is theta(|L|lg|L|) where |L| is the length of a list. For dictionaries, you can store an arbitrary key with a value in constant time. A hash table is one of the most important data structures in computer science and is covered in lectures 8-10. Dealing with a single key in a dictionary is constant time (with high probability). Longs in Python are long integers. If x+y where x is |x| words long and y is |y| words long, it takes O(|x|+|y|) time to add. x*y takes O((|X|+|y|)^lg3) from Python’s implementation, a little better than quadratic. Lg is base 2 by default if mentioned. Heapq in the Python standard library implements a heap and is covered later.

The document distance problem takes D1 and D2 and has you find d(D1,D2), the distance between them. For instance, how different are two articles on a website to see if they’re basically identical. MIT uses an automated test that’s a similar problem to find plagiarism. If searching Google, you can think of your search as a small document to compare to others though this isn’t exactly how Google works. A document is a sequence of words, where a word is a string of alphanumeric characters. One idea is to look at shared words between documents, though there are lots of possibilities. A document can be thought of as a vector, where D[w] is the number of occurrences of w in D. For example, if D1 is “the cat” and D2 is “the dog”, the professor draws a 3D graph with axis for “the”, “dog” and “cat” and the entries as vectors in the appropriate directions. d’(D1,D2)=D1 dot product D2 = sum over w of D1[w]*D2[w]. The problem is that it’s not scale invariant, a longer document with more matched words but lower percent will be ranked better than even two identical smaller documents. What if instead, the equation is D1D2/|D1|D2|, which is the angle or arc cos. An algorithm for this is, split a document into words, then compute word frequencies and third compute the dot product. Each of these steps can be done in a lot of ways which can range from 0.2 seconds to 228.1 seconds for the same example. To compute word frequencies, loop over words in the document and update a count dictionary which is likely linear time. If you sort the words, that will be almost as fast. To split a document into words, you could split on spaces in linear time. Regex isn’t always linear time and you should be careful with it.

Lecture 3 – Insertion Sort, Merge Sort

One use for sorting is for applications such as phone books and spreadsheets which need to be searched in order. Another use is that some algorithms become easier when a list is sorted, such as finding the median item of an array of numbers, B. If sorted, you can just look at B[n/2] in constant time, otherwise you might need to compare each item. You can sort a list once, or use methods of maintaining the sort as items are added. Binary search is a search algorithm that shows the power of a sorted list. If you have a list of n items and are looking for a specific item k, if the list wasn’t sorted you would have a linear time to check each item. If sorted, you can look at the midpoint and look at the correct half recursively in logarithmic time total. None obvious uses of sorting are for subroutines in data compression or for computer graphics.

The first sorting algorithm shown is insertion sort. For i = 1,2,…n, insert A[i] into sorted array A[0:i-1] by pairwise swaps down to the correct position for the number originally in A[i]. For instance, if the array is [5,2,4,6,1,3], starting at i=1, one swap would happen and we’d look at 4. [5,2,4,6,1,3] -> [2,5,4,6,1,3] -> [2,4,5,6,1,3]. The key here is at 6 and the left hand side is in order so the loop just moves to the next step. Four swaps are required to get to [1,2,4,5,6,3] and 3 more to get [1,2,3,4,5,6]. The algorithm has theta(n) steps and each step is theta(n) compares and swaps. This leads to a theta(n²). In these examples, compares and swaps are often treated as equal though sometimes compares will be more expensive than swaps such as if two pointers are switched based on properties of the item at that point. If only looking at complexity of compares or if compares are more expensive, you could replace the pairwise swaps with binary search to get a theta(n lg n) algorithm in terms of compares as you can do a binary search on the previously sorted part. Unfortunately, it’s still theta(n²) for swaps. This second example is referred to as binary insertion sort.

Merge sort is a divide and conquer algorithm which splits an array A into two parts L and R and sorts them and merges them to get a fully sorted array. The merge function assumes you have two sorted arrays and it merges them together, taking the smaller element of the beginnings of each list until a new sorted array is created. The merge routine does most of the work as the body recursively splits the array and merges it back together. Merge has linear complexity and the overall complexity of merge sort is theta(n lg n). T(n) = C1 (divide) + 2T(n/2) (recursive) + cn (merge). The divide routine eventually splits into n leaves and then merges them back together at each level. The hidden difference is merge sort required theta(n) auxiliary space while insertion sort’s in place sorting had constant auxiliary space. An impractical in-place merge-sort exists but isn’t touched on in this course. In python, merge sort takes 2.2n lg(n) ms and insertion sort is 0.2n² ms. Comparatively, C’s insertion sort is 0.01²ms. As the length grows, merge sort in python beats insertion sort in C.

Lecture 4 – Heaps and Heap Sort

A priority queue is a structure that implements a set S of elements which are each associated with a key. A heap is an abstract data type with different methods. You can insert(S,x) which inserts element x into set S. max(S) returns the element of S with the largest key and extract_max(S) returns and removes the element with the largest key. increase_key(S,x,k) would increase the value of x’s key to k. A heap is an implementation of a priority queue which is an array visualized as a nearly complete binary tree. For instance, with the array, [16,14,10,8,7,9,3,2,4,1] could be visualized as a tree with the first item as the route, the 2nd and 3rd are it’s children. 4 and 5 are children of 2 and 6 and 7 are children of 3. 8 and 9 are children of 4 and 10 is a child of 5. This is a heap structure giving a tree representation of an array. The root of the tree is the first element, i=1. The parent(i) is i/2, the left(i)=2i and right(i)=2i+1. The max-heap property says the key of a node is at least the keys of its children for every node in the tree. Min-heap is the same with “at most”. The max operation is easier on a max heap since it’s the first element, but extract-max requires more work to modify the heap.

Heap operations that need to be implemented while maintaining the max-heap property for an example include build_max_heap which rearranges an array to build a max heap ([2,4,1] to [4,2,1]). Max_heapify corrects a single violation of the heap property in a subtree’s root. Only max-heaps are talked about for the rest of the lecture. Max_heapify will need to be run recursively to build a max heap. For max_heapify(A,i), assume the trees rooted at left(i) and right(i) are max heaps. Leaves are max heaps since they don’t compare with anything. Max_heapify looks at the children at the index and exchanges A[i] with it’s bigger child, then calls max_heapify again on that child if a swap occurs. The height of the tree is bounded by log(n) and max_heapify is theta(log n) with the assumption that left and right are max-heaps. You can use max_heapify to create build_max_heap and build a max heap out of an unordered array. The pseudocode for build_max_heap(A) is for i=n/2 down to 1, do max_heapify(A,i). It starts at n/2 because leaves are automatically good and the second half of the array is all leaves. This starts at the nodes above the leaves and works it’s way up so every step only works with max_heaps as it’s left and right childs.

Build_max_heap is O(n lg n) in the worst case, but you can get O(n) out of it. The first level above leaves is 1 operation, the second could have two, etc. The higher the node, the more operations, but the fewer nodes. With that, you can treat it as O(n). Analysis is about counting operations. Max_heapify takes constant time for nodes one level above the leaves and in general O(l) time for nodes l levels above the leaves. We have n/4 nodes with level 1, n/8 at level 2, and 1 node with lgn level. The total amount of work in the for loop is n/4(1c)+n/8 (2c) + n/16 (3c) +…+1(lgn c). Set n/4= 2^k, so c2^k(½⁰ + 2/2¹ + 3/2² +…+(k+1)/2^k) which converges to roughly 2. This is a convergent series bounded by a constant less than 3, even as k goes to infinity. This shows that depending on implementation it can be done in linear time. For heap sort, build_max_heap from the unordered array, find the maximum element A[i], swap elements A[n] with A[i] and the maximum element will be at the end of the array. Discard node n front the heap by decrementing the heapsize. The new root may violate the max heap property, but the children are max heaps, so max heapify can run to fix this. Find the max element A[i] and repeat the process from there n times. Heap sort takes O(n lg n) time.

Lecture 5 – Binary Search Trees, BST Sort

Binary search is a fundamental design and conquer paradigm and a data structure that goes with it is the binary search tree. A problem to illustrate this is a runway reservation system. For an airport with a single runway, reservations for future landings are built with this system. The system can reserve requests for landings specifying landing time, t, which is added to a set R of landing times if no other landings are scheduled within k minutes. After the plane lands, the reservation is removed from R. The goal is to implement a set R with n items and these operations can be done in O(lg n) time.

An unsorted array won’t work since any operation will be at least linear. You can insert in order one without a check, but the check will take O(n). What about a sorted array? If you get a time t, you can go to the middle and check the left or right to find an insertion point and check it’s left or right values. The search is O(log n), the check is constant and the insertion is O(n) since elements need to be shifted over. The algorithm for the sorted array would be able to find the smallest i where R[i]>=t in O(lg n) time and compare R[i] and R[i-1] against O(1) time. Since the actual insertion of the array requires shifting, it takes O(n). A sorted list is also a problem (linked list not python list/array) since insertion is fast but you can’t do binary search on a list. Heaps are a problem since checking if an element is at most or at least k from t time will take O(n) time. At this point we aren’t supposed to know about dictionaries or hash tables.

If we could do a fast insertion into a sorted array, everything would work. Binary search trees allow for this fast insertion. Binary search trees are a tree made up of nodes. Nodes have pointers to their parent and left and right elements as well as a value of the node. Unlike a heap, this is an actual tree made up of nodes as opposed to an array. In this tree, for all nodes x, if y is in the left subtree of x, then key(y) <= key(x) and if y is in the right subtree, key(y) >= key(x). Key is just the function to get and compare values? How can this be used to solve the runway problem? Looking at inserting, for the first element, you make a node. For the second element, compare the second time and attach it to the first node’s left or right. For every other element, start by comparing it to the first element, then move left or right until it finds its place. Adding the k minute check, at each compare make sure they aren’t k minutes apart. Reject adding the element if it doesn’t meet the check requirements. If h is the height of the tree, then insertion is O(h) time. Find_min is constant time in a min heap, but in a binary search tree would require going to the left every step at O(h) and going to the right until you hit a leaf for a max with the same complexity.

Augmented binary search trees can do more and have more data in them than the pointers mentioned earlier. A new requirement is to compute rank(t) which is how many planes are scheduled to land at or before time t. Augment the binary search tree to add a size property to each node that counts the nodes below it and which is modified every time a node is added. Ideally should be done with recursion in one pass. Delete would require moving pointers. To compute rank(t) from subtree sizes, walk down the tree to find t, and add in the nodes that are smaller, add one to that and add the subtree size to the left. The bad news with binary search trees is if you insert items in order, they’ll look like a list and will need to be balanced to work so h is log n.

Lecture 6 – AVL Trees, AVL Sort

To get the items of a binary search tree in order, recursively check the left, current node then right hand nodes in the same order. Binary search trees ideally are perfectly balanced where the height is theta(lg n). An unbalanced tree has a height of n. The height of a tree is the length of the longest path from a root to some leaf. AVL trees are the original way people found to keep trees balanced in the 60’s. The height of a node in a tree is the longest path down from that node to a leaf or “max{height(left child), height(right child)}+1. Similar to the size from last week, we can maintain a height property on each node. Empty child nodes can be treated as having a height of -1 to make the following methods work. An AVL requires the heights of the left and right children of every node to differ by at most + or – 1. AVL trees are balanced meaning their height is always O(log n).

The worst case is where the right subtree has height 1 more than the left for every node. If N_h is the minimum number of nodes possible in an AVL tree of height h, N_h = 1 +N_h-1+N_h-2. This recurrence is almost fibonacci and is bigger at every level. Fibonacci is exponenacci and is phi^h/ sqrt(5) and is an exponential bound. Phi is about 1.618. This is less than n. The height is always less than 1.44 log n. There are binary search trees that achieve 1 log n, but this is still pretty good. A much easier way of analyzing the recurrence is ignoring the const and seeing making both N_h-1+N_h-2, N_h-2 since we want to know N_h is greater than something. So N_h is greater than 2N_h-2 which is theta(2^h/2) and h < 2 lg n with 1.44 being the real answer and 2 being an easier to find answer.

Today, we’re only focusing on insert but delete is almost identical. For insert, do the BST insert. The second step is to fix the AVL property from the changed node up. The question is how to do step 2. If they aren’t balanced, use a rotation. If x has a left child A and right child Y with left and right children B and C, you can do a left rotate on x to make a tree where y has a left child of x which has A and B and y also has a right child of C. A right rotate of y will get back to the original state. To fix an AVL property with rotation, rotate the opposite direction of the extra weight on the tree. There may be several violations up the tree. Suppose x is the lowest node violating AVC, assume the right child is the heavier one, then if x’s right child is right-heavy, if we right rotate when x is k, left A is k-3 y is k-1 and it’s left child B is k-3 and right C is k-2. After a right rotate, x’s children A and B will have k-3 when y is k-1. This also works if the right child is balanced or left heavy. AVL can also be used to sort, by inserting n items and doing in-order traversal which takes theta(n) time. Inserting n items takes theta(n log n) time. An abstract data type is an algorithmic specification of what your object is supposed to do. In this case, things like insert and delete and min which are a priority queue. The data structure is how you implement it. You can make a priority queue with a heap or AVL. A balanced binary search tree adds additional capabilities. AVL trees have everything so far in log n.

Lecture 7 – Counting Sort, Radix Sort, Lower Bounds for Sorting and Searching

This lecture starts by saying that lower bounds on searching are omega(lg n) and sorting is omega(n log n) and showing in some cases we can get linear. With a comparison model, all input items are treated as black boxes with ADT data structure where the only operations allowed are comparisons to get a binary answer. The cost of an algorithm is treated as the number of comparisons in this case. A decision tree is used to show bounds which works by drawing all possible things the comparison algorithm can do. Any comparison algorithm can be viewed as a tree of all possible comparisons and their outcomes, and the resulting answer, for any particular value of n. An example of this is drawn as a binary search where n=3, meaning the array has 3 elements. The first case checks if the middle value is less than x going down one of two paths for the other elements and checking the values accordingly.

Since a tree can be an exponential size, it’s generally not a great representation, although it’s nice to see all possible outcomes at once. An internal node in the decision tree represents a binary decision or comparison in the algorithm. A leaf represents the answer of the algorithm. A root to leaf path corresponds to an execution of the algorithm and the length of a path corresponds to the run time. The worst case running time of the algorithm is the height of the decision tree.

Proving lower bounds for searching, if you have n preprocessed items, finding a given item among them in the comparison model requires omega(lg n) comparisons in the worst case. Preprocessing can be sorting or using a data structure such as an AVL tree. The proof of this is that the decision tree is binary and must have at least n leaves, with one for each answer and this means that the height is at least log n. Sorting is a similar proof. Proving that the number of comparisons is at least n log n is shown because the decision tree is binary and the number of leaves is at least the number of possible answer which is n factorial, the number of permutations on the input sequence. This means the height is at least lg(n factorial). The log of a product is the sum of the logs which factors down to n/2 lg n – n/2 and ignoring the constant, shows omega(n lg n). The actual answer is n lg n – O(n) if you care about constants. Comparison trees help to show these factors of a tree.

The next example uses the power of the RAM from earlier to sort in linear time. Integer sorting is an appropriate title for this section and often items can be represented as integers. Assuming that n keys being sorted are integers in a range of numbers 0 to k-1, each fitting in a word (machine definition that can be manipulated in constant time). Doing more than comparisons in this way can help a lot. For k items, you can sort in linear time. The best algorithm to solve for any size k at the time of this course is O(n sqrt(lglg n). For cases where k isn’t huge, it’s easy to sort in linear time. Counting sort is the first example. To do this, count the number of each item to write them in numerical order. Allocate an array of size k to store counters and use the index of the item in the range of numbers every time the number appears. Since the array is written in order by key, when you find a non zero number write it down the number of times it appears. A problem with this is that which 3 might matter if 3 represents another object and two slightly different items can be represented by 3. Instead, if L is an array of k empty lists, then for j in range(n), L[key(A[j])].append(A[j]). Set output to [], then for i in range(k), output.extend(L[i]). Creating the lists takes k times, the append is constant and the loop of range k takes the length of L₁ + 1, giving O(n+k), as long as k is order n, it’s linear time.

A better algorithm is radix sort which can take a much larger k and still get linear time. K can be up to n to the power 100 and still be linear time. To do this, imagine each integer as base b. The number of digits will be log_bk. Sort the integers by the least significant digit, then the next least and so on until you sort by the most significant digit. Each sort is done using counting sort. Each digit is sorted in O(n+b) and the total time is O((n+b)d) which is O(n+b)log_bk). Setting b to theta(n) minimizes this and it comes out to O(n log_nk). If k is at most n^c then it’s O(nc), and linear time since c is constant.

Lecture 8 – Hashing with Chaining

Hashes are the most used data structure in computer science and are called a dictionary in Python. A dictionary is an abstract data type, ADT which has a set of items each with a key. Items can be inserted with insert(item) and will overwrite an existing key, deleted with delete(item) and searched for with search(key). Search will return an item with a given key or report that it doesn’t exist. With dictionaries you only care about the specific key you searched for. So far, AVL had the best time with O(lg n). Using dictionaries, we’ll be able to search in constant time with high probability.

In python, the dict data type represents this. D[key] is search, d[key]=value is insert and del D[key] is delete. Calling D.items will give a list of (key, value) pairs. Dictionaries run in constant time in pretty much every programming language and most databases. Spell check looks up words in dictionary, for instance, changning to find correct words and username and passwords are stored in a dictionary as well. Two types of databases exist, using either hashing or trees. Compilers and interpreters use a dictionary as well, with a variable being associated with an item in memory. Early on, the interpreter for python stored dictionaries of variables. Dictionaries are still necessary in compilers. Routers store all connected devices in a dictionary to look up local machines quickly. Substring search and checking similarities between strings can be done using hashing, as well as file or directory synchronization such as with rsync. Cryptography uses hashing to show that a file wasn’t manipulated.

A simple approach to creating a hash table is with a direct-access table. Items are stored in an array indexed by key. Just using an array is bad because we want keys that aren’t necessarily integers. Collisions aren’t an issue here but items are overwritten. Another problem is that it can take a lot of memory, since there’s at least 1 slot in the array per key. Fixing the problem of having keys that aren’t integers, you can prehash (Python calls this hashing even though it’s inaccurate) which maps keys to non-negative integers. In theory, keys are finite and discrete since anything can be written as a string of bits which is an integer. In Python, hash(x) gives the prehash of x and this is a builtin function which maps an object to an integer, though there are some weird examples where two combinations map to the same number. Ideally, hash(x) should only equal hash(y) when x and y are the same, but sadly this isn’t quite true in Python. __hash__ can be added as a property on an object to tell it how to hash, otherwise it’s the space in memory. If a prehash function is implemented make sure it won’t change over time. This is why you can’t put a list as a key value, as it is mutable and could change.

The other problem from the simple method is it’s size. Hashing can help reduce the universe of all keys to a reasonable size m for the table. So, store all possible keys in an array of length m. Only some number of keys are directly present in the array, so a hash function h would map each item in the universe of keys to a spot in the table. If there are two keys that map to the same slot in the hash table, you’ll have a collision which is going to happen a lot. Chaining is one technique for dealing with collisions. Next week, the other method shown is open addressing. Chaining stores the items mapped to an address in the m-length array as a linked list, appending items to the linked list at the array’s index. While in the worst case scenario, every element could be randomly mapped to the same linked list, resulting in actions of theta(n). In general, however this will be pretty much constant time. The hashing function maps the universe to a number under m and does the work of reducing key space. How do we get low time? With simple uniform hashing, each key is equally likely to be hashed to any slot of the table independent of where others are hashed. To do proper hashing, we need both uniformity and independence. If true, then operations can be proved to take constant time. The expected length of a chain given n keys stored in m slots is n/m which is known as the load factor of the table, or alpha. This is constant as long as m is roughly n. Since you have to walk the entire list to look up an item, the run time in the worst case is O(1+alpha).

A couple of methods for creating hashing methods are shown. The first is the division method where h(k)=k mod m, which is a bad method especially is m and k have common factors. If m is prime and not close to powers of 2 or ten, then it’s better. The second method is the multiplication method where h(k)=[(a*k) mod 2^w] >>(w-r). A is a random integer in all possible w bit integers, shift right w-r to get appropriate bits, where m is 2^r. This works quite well in practice and the shifting adds randomization. Last, the best method beyond the scope of the course is universal hashing where h(k)=[(ak+b) mod p] mod m. a and b are random numbers between 0 and p-1 where p is a prime number greater than the universe size. For the worst case distinct keys k1 and k2, the probability (over a and b) of them being equal is 1/m. You can prove the expected number of collisions is the load factor in the worst case. Another pitch for 6.046.

Lecture 9 – Table Doubling, Karp-Rabin

One crucial piece missing from last week when implementing the data structure is how to grow a table. If we’re creating a table size m that should be roughly n but don’t know what n is or n grows, we could be in trouble. Balance between being big enough that it’s fast and small enough that it doesn’t waste space. M should be theta(n) and alpha should be theta(1). To solve these problems, the table can grow, starting small with m=8, we can grow and shrink the table as necessary. A new concept introduced is amortization and the problem of table doubling is used in Python’s list. If we want m to be at least n at all times, if m is less than n, then we should grow the table, making a table of size m’, building a new hash function and rehashing everything. For each item in the old table insert the item into the new table. This process of growing the table takes theta(n) time, generally theta(n+m+m’). m’ =2m is a good way of growing since simply adding a small size requires paying the cost of growing the table multiple times. Every time the table size changes, everything would be rebuilt. If doubling each time, the cost of resizing is only paid log n times, though each doubling cost goes up by doubling as well. Adding all doubling costs is theta(n). Luckily, even though a few operations such as doubling are expensive, most methods of the data structure are roughly constant.

An operation takes “T(n) amortized” if k operations take at most k*T(n) time. The idea of amortization is spreading out a high cost so average costs are cheaper. This is similar to saying the cost is “T(n) on average” where the average is taken over all operations. Data structures have lots of operations run on them over time. In table doubling, k inserts take theta(k) time with a single insert taking theta(1) amortized per insert. This is the same for deletions, which can help run more operations before having to regrow the table. When do you cut the size of the table if it gets too small? Simply cutting in half when m=n/2 is bad since 1 insert will cause it to double. The second idea cuts the table in half when n=n/4 which gives it more room to grow. This causes the amortized time to become constant for k insertions and deletions. Python lists are resizable arrays that allow the ith item to be modified and an item to added or deleted in constant time. Deleting the first item is linear time since all items are copied over. Python uses table doubling in a similar way and list.append and list.pop are theta(1) amortized.

String matching is a problem similar to grep in linux or find in a text editor. Given two strings, s and t, does s occur as a substring of t? Typically s is small and t is large (ex google search vs entire web). How do you do this quickly? A simple algorithm is any(s==t[i:i+len(s)], for i in range(len(t)-len(s)). This compares s to every possible substring of the same length with code in the for loop handling the case of a match. The runtime is theta(|s|*(|t|-|s|)) and O(|s|*|t|) which is quadratic and not very good. Using hashing this can be brought down to linear time. The first suggested solution to comparing s to a rolling string is to use a rolling hash ADT. This requires computing the hash function of each substring of length s and compute the next substring in constant time. Given a hash value r, you should be able to append a character. r maintains a string x and r.append(c) adds c to the end of x. r.skip(c) should delete the first character of x (assuming it’s c). r() is a hash value of x=h(x). If each operation can be done in constant time, then string matching will work. The Karp-Rabin string matching algorithm is

for c in s: rs.append(c)
for c in t[:len(s)]: rt.append(c)
If rs()==rt():
    if s==t[i-len(s)+1:i+1]:
      Return “found a match”
    else:
      This should happen at most 1/|s| 
for i in range(len(s), len(t)):
    rt.skip(t[i-len(s)]
    rt.append(t[i])
    if rs()==rt():
      Same as previous check for match

If rs() is r(), that means hashes are the same but it could be due to a collision. If this happens with probability 1/|s| then we can expect this operation to take O(1) time. If you only care about 1 match, the algorithm is expected to take linear time. Rs and rt are both rolling hashes. This algorithm is from the 90s. The last problem is figuring out how to build the ADT where each operation takes constant time. This algorithm will hold true even if the hashing method is the division method where h(k)=k mod m and m is a random prime at least |s|. Treating the string x as a multi-digit number (this is the pre hash function) and the base is the size of alphabet (256 for ASCII) in string. Looking at append operation, when you add a character it’s the same as taking the number, shifting it over by 1 and adding a value. So, hash value u becomes ua+ord(c), skip sets u to u-ca^|u-1|. So, r.append(c) sets r to ra+ord(c) mod m and r.skip sets r to r-c*a^|x-1| mod m. R stores the current hash value and a to the power of the length of x and can do operations in constant time to make the process work.

Lecture 10 – Open Addressing, Cryptographic Hashing

Open addressing is the simplest way to implement a hash table and it uses a single array without linked lists. This requires being more careful than a hash table with chaining and requires a different uniform hashing assumption. Open addressing is another approach to dealing collisions, as an array would work directly without collisions. Assuming an array with items and at most 1 item per slot, this means that m (number of slots) is at least n (number of elements). This works thanks to probing which is the idea of trying to insert something into a hash table and recomputing a slightly different hash if there’s an item already there, recomputing until a success. This requires a hash function that specifies the order of slots to probe for a key which is used for insert, search and delete. So, the hash function, h, takes the universe of keys and a trial count and produces a number between 0 and m-1. It’s important that h(k,1), h(k,2)…h(k,m-1) to be a permutation of 0,1,…m-1 for an arbitrary key k, so that the entirety of the hash table can be used if you try long enough.

An empty array slot will give a value None and if a hashing function finds something else it will know to fail and try a different trial count value to try a different slot. This is how insert would work. Searching is similar where as long as slots encountered by a hash function aren’t they key searched for and if not move on to the next trial. Keep probing until either the key or find an empty slot. Based on this, the trial counts would go in order from 1 to m-1 each time in the same order. Delete is a bit different. If you simply replace an item with None, any search algorithm going past that index would fail. Instead use a “Deleted” flag and modify insert to look for None or Deleted instead. Search will know the Deleted flag’s slot was passed before for other insertions.

Coding the array structure is straightforward but how well this works is based on the hash function. A bad example of a probing strategy or hash function is linear probing, where h(k,i)=(h’(k)+i)mod m where h’ is an ordinary hash function. This satisfies the permutation property but has a problem with clustering where consecutive groups of occupied slots keep growing and clusters get longer and longer. Clustering is the opposite of load balancing so it’s not useful when you want things evenly split. This also means losing the average constant-time lookups as you’ll keep looking in the same area.

A better strategy is double-hashing where h(k,i)=(h₁(k)+ih₂(k))mod m where where h₁ and h₂ are ordinary hash functions. For this to work, h₂(k) and m should be relatively prime to get a permutation. m=2^r, h₂(k) for all k is odd would satisfy this. Double-hashing is effective in practice as well. The uniform hashing function says each key is equally likely to have any one of the m-factorial permutations as it’s probe sequence. This is hard in practice, but double-hashing can get close which means if alpha=n/m, then the cost of operations is at most 1/(1-alpha) meaning a smaller alpha is more efficient. It’s important for alpha to stay around .5, but it’s important to have a strategy for resizing with open-addressing. One downside to the delete method is that it doesn’t make searching take less time on a collision.

Lastly, this week talks a bit about cryptographic hashes. An example is password storage and how unix systems work when you type in your password. How is the password checked? Storing the password in plain text is a stupid method, but it’s better that even an admin doesn’t know the password. This is done using a one-way cryptographic hash where given h(x)=Q it’s very hard to find x. The system stores the hashed password and compares it with the hash of what you typed in.

Lecture 11 – Integer Arithmetic, Karatsuba Multiplication

The lecture starts out talking about irrational numbers and the square root of two, along with talking about Pythagoreus and his objection to irrational numbers. Catalan numbers are introduced which represent the cardinality of the set P of balanced parenthesis strings, recursively defined as lambda belonging to P where lambda is the empty string and if alpha, beta belong to P, the (alpha)beta belongs to P. Every nonempty balanced parentheses string can be obtained via rule two from a unique alpha, beta pair, so “(())()()” is obtained by having alpha be “()” and beta is “()()”. The catalan number, cn is the number of balanced parenthesis strings with exactly n pairs of parenthesis, C0 is 1, which is an empty string. C1 is also 1, base on this, C2 is C0C1+C1C0 and a general equation for C_n+1 is the sum of k=0 to n of C_kC_n-k where n is at least 0. This creates a weird sequence of numbers. I’m kind of confused here and could definitely use another explanation of Catalan numbers.

Newton’s method is used to find the root of a function f(x)=0 through successive approximation. If the function f(x)=x^{2i+1 is y=f(x_i)+f’(x_i)(x-x_i). Iteratively applying Newton’s method is used to calculate the square root to. x_i+1=x_i-(f(x_i)/f’(x_i)). f(x)=x2-2 gives x, then x_i+1=x_i-((x_i2-a)/2x_i) which equals (x_i+(a/x_i))/2. These numbers can often have millions of digits. If x₀ is 1.0 then as i in x_i grows, the numbers in the decimal become closer to the accurate answer. This is because there’s a quadratic rate of convergence where the number of correct digits doubles. This is why Newton’s method is used in practice.}

The standard multiplication method is O(n²). For these examples, desired digits of precision is d and known before hand. If we want an integer floor(10^dsqrt(2)) which is basically floor(sqrt(2*10^2d)). High precision multiplication can be used to multiply two n-digit numbers (radix r=2, 10) and x and y are between 0 and rⁿ. Using divide and conquer, set x = x₁r^n/2+x₀ where x₁ is the high half and x₀ is the low half (same for y’s) and y=y₁r^n/2+y₀ where x₀,x₁,y₀ and y₁ are between 0 and r^n/2. Let z₀=x₀y₀, z₁=(x₀y₁+x₁y₀). z₂=x₂y₂ and overall, z=xy=x₁y₁rⁿ+(x₀y₁+x₁y₀)r^n/2+x₀y₀. There are four multiplies of n/2 digit numbers which means the algorithm will take theta(n²) time and T(n)=4T(n/2)+theta(n).

How do you do better? The Karatsuba algorithm calculates the z₀ and z₂ in the same way as the previous algorithm and uses those to compute z₁=(x₀+x₁)*(y₀y₁)-z₀-z₂. Because additions are easy and linear and there are only 3 multiplications, T(n)=3T(n/2)+theta(n). So, T(n)=theta(n^(log₂3))=theta(n^1.58…). To do better than this, you can try breaking up x into more chunks and trying to do less multiplications than chunks.

Last is a geometry problem with a circle of a trillion unit longs. C is the center and B and A are points on the edge. The line B to the line from C to A is 1 unit, going straight down to make a right triangle. The length along the line from C to A past this triangle is D, which is AD=AC-CD=½ trillion – sqrt(½ trillion squared – 1). This leads to a sequence including the catalan numbers.

Lecture 12 – Square Roots, Newton’s Method

The goal is to find the millionth digit of sqrt(2) and attempted to do this by computing the floor of sqrt(2*10^2d)) where d is the digits of precision, d=10⁶. Continuing from the problem last week after finding x_i+1=(x_i+a/x_i)/2. This is doing an error analysis of Newton’s method to show that it has a quadratic root of convergence. x_n=sqrt(a)(1+E_n) where E_n) can be positive or negative. To show convergence, as E_n as n gets large goes to zero. Putting that in that in to the x_i+1 equation, shows that X_n+1 = sqrt(a) time 1 + E_n²/(2(1+E_n)) so if E_n is .01, then E_n² is .0001, giving a quadratic rate of convergence. Figuring out the complexity of the division (a/x_i) will give the complexity of finding the square root of 2 or a using Newton’s method.

Multiplication algorithms were started last week with the goal of multiplying d-digit numbers with divide and conquer and are continued here. Toom-Cook algorithm generalized Karatsuba for k of at least 2 and breaks X and Y into 3 parts of d/3 digits. Trying to multiply these would require nine normally, so if you can beat that with mathematical tricks, you get a better algorithm. This algorithm reduces nine multiplications down to 5, giving theta(n^(log₃5) which is theta(n^1.465). Until a few years ago, Schonhage-Strasen was the best method of multiplication which was theta(n logn lg lgn) time using FFT. This, and the other algorithms, are available to test in Python’s gmpy package. In 2007, Furer came up with an algorithm that was theta(n log n 2^O(log*n)). log*n is the “iterated logarithm” which is the number of times the log function needs to be applied to get a result that is at most 1. Practically, Schonhage-Strasen is still better in practice with very large n values and Furer isn’t in gmpy.

How long would division take? To find a high-precision rep of a/b, first compute for 1/b. This means compute floor(R/b) where R is a large value where it’s easy to divide by R (ex R=2^k can be divided by shift operator). Dividing R/b by R gives 1/b. R/b can be computed by using Newton’s method. f(x)=1/x – b/R (with a zero at x=R/b). f’(x)=-1/x². Applying Newton’s as done before by computing a tangent and the new value is the x intercept, x_i+1=x_i-(f(x_i)/f’(x_i)=xi-(1/x_i – b/R)/(-1/x_i². x_i+1=x_i+x_i²(1/x_i – b/R) = 2x_i – bx_i²/R. This would require repeatedly applying multiplication to solve. If you want R/b = 2¹⁶/5 = 13107.2 to see how to get to that. An initial guess could be 2¹⁶/4 which is x₀ and 16384. By, x₂, you’d get 13056, then 13107 on x₃.

There’s a quadratic rate of convergence for division as well and the number of digits doubles at each step. There are d-digits of precision and lg d iterations. Multiplication is in theta(n^alpha) where alpha is at least 1, the complexity of division based on information so far is O(log n * n^alpha). If we want d digits of precision, start with 1 digit and double until d. The initial multiplies are easy, but toward the end a lot more work is done. The number of operation is C*1^alpha+C*2^alpha+…C*(d/4)^alpha+C*(d/2)^alpha+C*d^alpha which is less than 2C*d^alpha. This helps compute more accurately and ultimately means that the complexity of division is the same as multiplication. Lastly, the complexity of computing square roots is the same as the complexity of division and thus multiplication. This follows the same reasoning that smaller numbers require less work and most of the work is done at higher numbers, since the process of calculating the square root calls division.

Lecture 13 – Breadth-First Search (BFS)

Graph search is about exploring a graph. This might be looking for a path between two particular nodes or from one state of a rubix cube to a solved cube in an example. You might also want to look at all possible states reachable from the start or all possible states in general. A graph G=(V,E) where V is a set of vertices and E is a set of edges. Edges are either unordered {v,w} or ordered pairs (v,w). Usually all edges are of one type with unordered leading to undirected graphs and ordered leading to directed. A list of all edges is a bad way of representing a graph since you’d have to search the list to know if two nodes were connected. Some applications of graph search are web crawling, social networking, network broadcasting such as data packets across the internet, garbage collection in modern languages (unreachable data can be thrown away), model checking (proving code does what you think it does by looking at all states), and solving games and puzzles. Breadth-first search is used for a lot of these examples.

The first example represents a 2x2x2 rubik’s cube as a configuration graph of possible states. The number of vertices is 8!*8³ which is 264,539,510, this is because there are 8 cubes in the rubik’s cubes each of which has 3 possible twists. You can divide by 24 because there are 24 symmetries and by a third since only a third is actually reachable. Each vertex is a possible state of the cube. Edges are the move that takes you from one possible move to another, and edges are undirected since you can move back. The graph is drawn starting from a solved state branching out into all possible states reachable in 11 moves (allowing half twists). The diameter is 11 for 2x2x2 and 20 for 3x3x3 and is the best algorithm from the worst starting state. The best known way to compute the diameter is constructing a massive graph of all possible states and using a breadth-first search to go layer by layer. For an nxnxn rubik’s cube you can solve in theta(n²/lgn).

A standard representation of graphs is an adjacency list where you have an array, Adj of size V, where each element is a pointer to a linked list and for each vertex u in V, you can call Adj[u] to get the linked list corresponding to the vertex. Adj[u] also stores u’s neighbors, so Adj[u] is the set of all vertices V such that (u,v) is an edge. You could use an object-oriented way as well to store v.neighbors = Adj[v], though the array/linked list representation is convenient to have multiple Adj arrays on different graphs talking about the same vertices. Adjacency lists are the way to go for multiple lists. Another way of representing graphs is with implicit representation where Adj(u) is a function and v.neighbors is a method and are computed when needed using less space. The bare minimum space to get a representation of the graph is theta(|V|+|E|) and algorithms should run in that time since it’s what’s needed to look at the graph.

With breadth-first search the goal is to visit all nodes that are reachable from a given node s in O(V+E) time. Look at the nodes reachable in 0 moves (s), 1 move (Adj[s]), 2 moves, etc, being careful to avoid duplicates. As long as there’s 1 cycle it could go on forever. A pseudocode implementation shown in the lecture is

BFS(s,Adj):
  level={s:0}, 
  parent={S:None}, 
  i=1, frontier=[s]
  While frontier: 
    next=[]
    for u in frontier:
        for v in Adj[u]:
            if v not in level: 
                level[v]=i
                parent[v]=u
                next.append(v)
    frontier = next
    i+=1

For every node at a level, this creates a list of all possible states reachable from a node at that level and stores it as next if it hasn’t been looked at. After all nodes in the current frontier are looked at, the next level becomes the new frontier. The algorithm runs until the frontier is empty and all vertices were looked at.

If you take a node v and take it’s parent and that node’s parent and so on back to s, this will give a shortest path from s to v and the length of the path will be the level of v. So, the level will tell you the shortest path length and parent pointers give the shortest path. The total running time of the algorithm will be the sum of all vertices of |Adj[v]|=2|E| for undirected or E for directed graphs. This can be said to be O(E) and O(V+E) linear time to also list unreachable vertices.

Lecture 14 – Depth-First Search (DFS), Topological Sort

Depth-first search is a simple algorithm with some useful features such as detecting cycles and topological sort. Depth-first search recursively explores a graph and backtracks as necessary, similar to solving a maze keeping track of paths gone down. The following pseudocode is a basic implementation that can access all vertices from a given source s.

DFS-Visit(Adj, s):
  For v in Adj[s]:
       If v not in parent:
           parent[v]=s
           DFS-Visit(Adj,v)

Starting from s, it looks at all outgoing edges and sees if it’s gone there already. Setting the parent of a node allows this check to avoid repeating. DFS-Visit will only reach vertices accessible from s. The actual DFS will reach all vertices by looping through possible s values.

DFS(V, Adj):
    parent={}
    For s in V:
        If s not in parent:
            parent[s]=None
            DFS-Visit(Adj,s)

This algorithm is theta(V+E) linear time since it visits each vertex once in DFS alone which is O(v) and DFS-Visit is called once per vertex v and pays |Adj[v]| which leads to O(E) for directed graphs. Both DFS and BFS are linear time. BFS is good for finding shortest paths where DFS is not.

In a directed graph, every edge is checked once and in an undirected graph every edge is checked twice. If an edge leads to something unvisited then it’s a tree edge. Parent pointers specify tree edges and occur when a new vertex is visited via that edge. Other types of edges are forward edges, backward edges and cross edges. A forward edge goes from a node to a descendant in a tree, a backward node goes to an ancestor in a tree and other nodes are cross edges. Cross edges are between two non-ancestor-related subtrees. How do you compute which edges are which? By flagging nodes as currently in the recursion stack, if we follow an edge to something currently in a stack means that it’s a backward edge. Detecting forward edges is trickier, with one method being recording the time S got visited and storing start and finish times to see when a node was visited and using that to determine if it’s a forward or cross edge. In an undirected graph, only tree and backward edges exist.

A graph has a cycle if and only if a DFS of the graph has a back-edge. Detecting cycles is easier in undirected graphs and a bit trickier in directed since you have to worry about a cycle. If tree edges connect a node to it’s descendant and a back edge goes back to the ancestor then there’s a cycle. If there’s a back edge, there’s at least 1 cycle. Similarly, if there’s a cycle there’s a back edge somewhere. In a loop that goes from a v₀ to v_k which goes back to v₀, v_k will be visited before v₀ is finished, so the edge between v_k and v₀ is a back edge.

The second application of DFS is topological sort shown through a job scheduling problem. For job scheduling, we have a directed acyclic graph (DAG from 6.042) and want to order the vertices so that all edges point from lower order to higher order. This was shown in the previous course with the example of putting clothes on in an order. Topological sort takes a graph and sorts vertices in the graph by using DFS to output the reverse of the finishing times of the vertices. There are no cycles here and we want to prove that for an edge (u,v), v finishes before u. If u starts before v, then we’ll visit v before u’s recursion finishes. Otherwise, v starts before u, which would mean there’s a back edge and a cycle causing a contradiction, so v will finish first either way. This is a proof that topological sort gives a valid job schedule.

Lecture 15 – Single-Source Shortest Paths Problem

This week looks at graphs with weighted edges which allow for a better set of applications. One motivation for finding shortest paths is to find the fastest way from getting from two points on a map. In weighted graphs examples, a graph is G(V,E,W) where W maps edges to real number weights. One algorithm that will be looked at in the future is Dijkstra’s algorithm which assumes non negative weighted edges and has a complexity of O(VlgV+E) which is practically linear and E is O(v²). A simple graph has at most 1 edge between a pair of vertices. Multigraphs can have multiple edges but aren’t in these lectures. The dominating factor is E and Dijkstra’s algorithm’s linear time is appealing. The other algorithm that uses weighted edges is the Bellman-Ford algorithm which can have positive or negative weighted edges and runs in O(VE). In general Dijkstra’s algorithm is the go too, but if negative weighted edges occur, you’ll need Bellman-Ford.

A path p is a sequence of vertices from from 0 to k and edges is a set of pairs of v_i, v_i+1 in a set E where i is between 0 and k-1 inclusive. w(p) is the sum of i=0 to k-1 of w(v_i, v_i+1) and finding the shortest path is done by finding the path with the shortest minimum weight. There are only E² possible weights in a graph and the range of weights can be exponential. Dijkstra and Bellman-Ford are useful because they don’t depend on the dynamic range of weights. Some more notation for weighted graphs shows that a path P from v₀ to v_k. (v₀) is a path from v₀ to v₀ with a weight of 0. The shortest path weight from u to v is delta(u,v) is the minimum weight of all paths from u to v if there is a path, otherwise it’s infinity if there is no path. This can be tedious since you’ll need to enumerate all possible path weights to know which is the shortest which gets more complicated with negative paths. d(v) is the current weight and pi[v] is the predecessor vertex on the best path to v and pi[s] is nil. The current distance and predecessor relationship are the two variables that need to be iterated on.

Negative weights are important since in some paths you can gain or lose some tracked value such as money, though they make these problems more difficult as you can’t simply ditch a path when it becomes longer than a previously found one. With points in a cycle, you can get back to the same node and paths with the same start and end can make large varieties of weights. An algorithm that’s effective for handling negative cycles should finish in a reasonable amount of time O(VE) and give delta’s for all vertices with finite numbers and mark other weights as being indeterminate. If you don’t know if you have a cycle, you’ll end up having to use Bellman Ford which can detect negative cycles. Dijkstra’s algorithm doesn’t have to do that and is simpler.

The general structure of shortest path algorithms is as follows. Initialize for u in V, d[v] is set to infinity and pi[u] is NIL and d[s] is 0. Repeat the process of selecting an edge (u,v) (done differently by different algorithms). “Relax” edge(u,v) which means if d[v] is greater than d[u]+w(u,v) then you’ve found a better way of getting to v and update d[v] as follows. With the initial infinity value, the first path will always be better. If you found a better path then update d[v] to that path and set pi[v] to u. The repeat is run until all edges have d[v] which is at most d[u]+w(u,v). Relaxing is a fundamental notion that will be used in weighted edge algorithms. The termination condition means that smaller edges can’t be found. This is a brute force algorithm which will work but will be slow. Relaxation is a good process to learn from this, but the random way of selecting algorithms is bad and the termination is an O(E) check by itself which is bad.

Edges with high dynamic ranges including square roots and anything positive should still be handled by Dijkstra’s algorithm. One problem with the example algorithm is that you can end up relaxing edges an exponential number of times making it an exponential time algorithm. The problem is due to pathological ordering and changing the ordering can make the algorithm much better. The optimum substructure states that subpaths of a shortest path are shortest paths. The triangle inequality states that for all u,v,x in X, delta(u,v) is at most delta(u,x)+delta(x,v).

Lecture 16 – Dijkstra

This is the second lecture of four on shortest paths. Last week talked about a general structure and this week looks at Dijkstra’s algorithm. Relaxation is reviewed and is a fundamental feature in shortest paths algorithms. If d[v] is the length of the current shortest path from source s to v, initialized as infinity. Through relaxation, these d values are reduced to delta values which are the length of a shortest possible path. When all d’s are deltas, the algorithm is done. Pi[v] is the predecessor of v in the shortest path from s to v and the predecessor chain can be followed to reconstruct the shortest path. Relax(u,v,w): if d[v]>d[u]+w(u,v), then d[v]=d[u]+w(u,v) and pi[v]=u. You want to make sure relaxation is safe and never reduces below correct values which is dangerous. A simple lemma that can be proved says that the relation operation maintaining the invariant that d[v] is at least delta(s,v) for all vertices. You’ll never get something in the middle less than your shortest path value and the function will eventually terminate and run in polynomial time. Proof by induction on the number of steps by assuming d[u] is at least delta(s,v). By triangle inequality, delta(s,v) is less than delta(s,u)+delta(u,v). Triangle inequality is about shortest paths not single edge weights. The other half of the proof is done by showing delta(s,v) is at most d[u]+w(u,v) = d[v] after relaxation steps. Relaxations will be applied in different ways based on constraints of different problems to get the most efficient algorithm.

For directed acyclic graphs or DAGs, things are a bit easier since there can’t be cycles or negative cycles. Negative edges can still exist and there isn’t a problem if you only go through a negative edge once, since you just subtract. A good way of solving a DAG is to start by topologically sorting it with the path from u to v implying u is before v in the ordering. This gives a linear ordering that can be gone through from left to right in one pass over the vertices. Run through this and relax each edge that leaves each vertex. This takes O(V+E) time and should work for any choice of starting vertex. Unlike the above algorithm, Dijkstra’s algorithm doesn’t work for negative edges though it works for graphs with cycles unlike this one. Both algorithms have their place based on different scenarios.

Dijkstra’s algorithm is very straightforward partially because it’s a greedy algorithm building shortest paths vertex by vertex.

Dijkstra(G,W,s):
  Initialize (G,s) and S (set) becomes null and Q becomes V[G] (entire set of vertices) d[s]=0
  While Q != null:
      u is set to extract-min(Q) //deletes u from Q
      S is set to S union {u}
      for each vertex x in set Adj[u]:
          relax(u,v,w)

As vertices are processed, they’re moved to S which contains vertices where shortest paths are known. As long as Q isn’t empty, it picks the minimum priority from Q and claims it’s something you already computed the shortest paths for. Putting it into S says you’ve computed the shortest path and priority values that need to be changed based on this update are changed. Min priority means the nearest node. Q is a priority queue and ADT, with changing priorities for each value which are the d values. If initially all priorities are set to infinity with one being 0, that will be the starting node chosen. There are theta(V) inserts into the priority queue, theta(V) extract-min operations and theta(E) decrease-key/relax operations. For an array structure as the priority queue, theta(v) for extract min, theta(1) for decrease-key and theta(V²) in total. Using a binary min-heap works better for Dijkstra, which has theta(log v) for extract min, theta(log v) for decrease and theta(v lg v+E lg V) overall. Though this isn’t the best for Dijkstra. A Fibonacci heap that isn’t talked about in the course is better and has theta(log v) for extract min and constant amortized time for decrease giving an overall theta(v log v + E) time. This is essentially linear time.

Lecture 17 – Bellman-Ford

This algorithm can compute the shortest path in a graph even with negative paths and cycles. This is a polynomial time algorithm. Given a graph with a negative weight cycle, an algorithm needs to be able to detect that the cycle exist and say that any paths that could be affected by the cycle are undefined. Taking a second look at the generic shortest path algorithm from last lecture, where you set all vertices to have infinite shortest path weights and predecessors to be NIL, along with d[s] to 0. The main loop repeated selecting an edge and relaxing the edge when appropriate until you can’t relax any more. Two problems with this generic algorithm are that its complexity could be exponential time even with positive edge weights and that it will not terminate if there’s a negative weight cycle reachable from the source. Even with Dijkstra, we haven’t seen yet how to solve a problem with negative edges in the general case (not DAG).

Bellman-Ford(G, W, s)
    Initialize(): // same as generic
    For i = 1 to |V|-1:
        For each edge(u,v) in set E:
            relax(u,v,w)
    For each edge (u,v) in set E:
            If d[v] > d[u] + w(u,v):
                Then report negative weight cycle exists

This algorithm is O(VE) and isn’t dependent on data structure like Dijkstra is, but Dijkstra can be brought down to linear while this can be cubic complexity in terms of vertices. The two parts of this algorithm are finding shortest paths and detecting negative weight cycles. If G=(V,E) contains no negative weight cycles then after execution, d[v]=delta(s,v) for all v in set V. Also, if a value d[v] fails to converge after |v|-1 passes, there’s a negative weight cycle reachable from s. To prove these, let v in set V, p is a path from v₀ to v_k, v₀=s to v_k=v. P is a shortest path with the minimum number of edges, no negative weight cycles implies that p is simple, which implies that k is at most |v|-1. Since you’re relaxing every edge in each pass of the algorithm, after one pass through all edges, we have d[v₁ = d(s,v₁) since the edge v₀, v₁ can’t be smaller. After two passes, d[v₂ = d(s,v₂) since the edge v₁, v₂ can’t be smaller. This pattern follows through to the last element and all reachable vertices will have delta values. If after v-1 passes, an edge can still be relaxed, it means that the current shortest path from s to some vertex v is not simple and there is a repeated vertex and thus a negative weight cycle. In summary, after a certain number of passes if you haven’t finished, you’ve found a negative weight cycle and can quit. If you have a graph with negative weight cycles, finding the shortest simple path is an np hard problem and no solutions better than exponential are known. Longest path is also np-hard and if you can solve one, you can solve the other. Bellman-Ford’s polynomial time comes from its ability to abort and not return a weight, asking for the shortest simple path instead makes it much harder.

Lecture 18 – Speeding up Dijkstra

This is the last lecture on shortest paths and focuses on improving average performance. Dijkstra can be optimized when looking at a single target, by adding a line of code to top if u equals t instead of when Q is null. This will generally run as it won’t generally be the last vertex. This handles the single-source single-target problem. Another problem is a bi-directional search which takes 1 step of a forward search followed by a step backward search from the target, t. Data structures for this problem have to have edges that can go either way. Only the source has 0 priority in the forward search and only the target in the backward search. To have two searches in this way, twice the data structures are required and a backwards frontier is grown as well. Df[u] corresponds to the distances for the forward search and db[u] for the backwards, with priority queues Qf and Qb corresponding the same. The termination condition is that the frontiers meet, which means that some vertex u has been processed both in the forward and backward search and has been deleted from both Qf and Qb.

A harder problem is finding the shortest path after termination from s to t. Two Pi predecessors exist here as well with Pi f being the regular predecessor and Pi b following it backward. The following is an incorrect claim: If w was processed first from both Qf and Qb, then the shortest path using Pi f from s to w and the shortest path using pi b from t to w, following edges backwards. For the forward search, repeatedly apply Pi f to w until you reach s and Pi b to w until you get to t in the backward search. The problem with this approach is that w may not be on the shortest path. The termination condition is correct, but the use of w to construct the shortest path is wrong. The problem is that the frontiers will collide on the shortest length path node wise, not weight wise so this won’t work. The correct solution is to find an x (possibly different from w) with a minimum value of df[x]+db[x] and replace “s to w” in incorrect claim with “s to x” to make it correct.

Some techniques used for modifying graphs themselves to help processes run faster exist. In a goal directed graph, you modify edge weights with potential functions, so w’(u,v) = w(u,v)-lambda(u)+lambda(v). Any path w’(p)=w(p)-lambda_t(s)+lambda_t(t). Any path will be shifted by the same amount and this shows a correctness check, though hopefully you’ll discover it faster. One way of getting the potential w’(p) function is using a landmark. With a landmark l, which is a vertex belonging to V, precompute delta(u,l) for all u in V. The potential lambda_t(u) = delta(u,l)-delta(t,l). With the correct choice of landmark, Dijkstra will run faster.

Lecture 19 – Dynamic Programming 1: Fibonacci, Shortest Paths

The next four lectures talk about dynamic programming, a general and powerful way of designing new algorithms. Dynamic programming is particularly good for optimization problems and can do an exhaustive search in polynomial time. It can be thought of as a “careful brute force.” Many problems only have a polynomial time algorithm coming from dynamic programming. Dynamic programming was invented by Richard Bellman, also behind the Bellman-Ford algorithm. The basics of dynamic programming is to split a problem into subproblems, solve them and reuse the solutions.

For a first example, Fibonacci numbers return, where F₁=F₂=1 and F_nsub>=F_n-1+F_n-2. Starting with a naive recursive algorithm,

fib(n):
    if n<=2: f=1
    else: f=fib(n-1)+fib(n-2)
    return f

which is correct, but is an exponential time algorithm, where T(n)=T(n-1)+T(n-2)+theta(1). With the addition of memoization this becomes much better. The idea is to put every computation of a fibonacci number into a dictionary and check if you’ve computed it before, before doing so again.

fib(n):
    if n in memo: return memo[n]
    if n<= 2: f=1
    else: f=fib(n-1)+fib(n-2)
    memo[n]=f
    return f

This is a general strategy that can make recursive algorithms more efficient by reusing work already done. Fib(k) will only recurse the first time it’s called, otherwise the memoized calls cost theta(1). The number of non-memoized calls is n and the non-recursive work per call is constant. This means the running time goes from exponential to linear, theta(n), though there is a better algorithm that can theta(log n). In general in dynamic programming, memoize and re-use solutions to subproblems that help solve the bigger problem. Dynamic programming is essentially recursion plus memoization. The time taken will be the number of subproblems times the time per subproblem, ignoring memoized recursive calls.

Another use of dynamic programming starts at the bottom and works its way up (versus recursion which is top down).

fib = {}
for k in range(1,n+1):
    if k<= 2: f=1
    else: f=fib[k-1]+fib[k-2]
    fib(k)=f
return fib[n]

This does the same computations and does the topological sort of the subproblem dependency DAG. Subproblems are solved in the order from F(1) to F(n), making up the graph. You can also update the algorithm to save space since the algorithm only needs to remember the last two values.

Looking back at single-source shortest paths, looking for delta(s,v) for all V, this can be written as a recursive algorithm and memoized. One technique for solving the problem is to try all guessing. This is apparently central to dynamic programming which is recursion plus memoization plus guessing. Try all the guesses and take the best one. For shortest paths, try each possible path backward from target v and try to find the shortest path back to s. delta(s,v)=(min over(u,v) in all edges)(delta(s,u)+w(u,v). This is an analog for Fibonacci’s naive recursive algorithm and is exponential. To memoize, check if delta(s,v) is in a memo dictionary before computing it. This memoization works in O(V+E) on DAGs but can be infinite in graphs with cycles. Memoized, it’s essentially doing a DFS, topological sort and Bellman-Ford approached from a different method. For memoization to work, subproblem dependendencies should be acyclic otherwise you get an infinite loop. You can take a cyclic graph and make it acyclic by creating a new layer in a graph for each state/edge followed. s_k(s,v) is the weight of the shortest s to v path that uses at most k edges. s_k=min over (u,v) in edges (s_k-1(s,u)+w(u,v)). This increases the number of subproblems to v² and d_{|v|-1(s,v) and the running time is theta(VE), seeing another way of getting Bellman-Ford’s performance.}

Lecture 20 – Dynamic Programming II: Text Justification, Blackjack

Five easy steps to dynamic programming: 1. Define subproblems. 2. Guess part of the solution. 3. Relate subproblem solutions with recurrence. 4. Build an algorithm by recursion and memoization or building a table bottom up. 5. Solve original problem. Fibonacci and shortest paths problems are further analyzed by dividing them into these steps. In general, we’ll need to count the number of subproblems for analysis, and the number of choices that could be guessed. The third step analyzes the time per subproblem and fourth we need to check that the recurrence is acyclic and compute the total time. Lastly, make sure the actual problem gets solved. Combining solutions can sometimes require extra time.

A new problem this week is text justification where text is split into “good” lines. Basically this is about when to split lines in a text editor. Word used to use a greedy strategy trying to fit as many words as possible while another example editor (“Latex?”) used Dynamic programming. Text is a list of words and badness(i,j) is defined as how bad it is to use words[i:j] as a line. In the case they don’t fit on a line, then badness is infinity and is the worst case. Otherwise, badness is (page-width – total width)³. The hard part is defining subproblems. In this case, a brute force strategy would be guessing where each line begins. After you guess where the second line begins, you’ll have a problem deciding where the next line begins with the remaining words. Subproblems will be suffixes, words[i:] and with n words, there are n sub problems since we’re only thinking about remaining words. The guessing is where to start the second line with at most n-i or O(n) choices. Third, the recurrence, DP(i) considers all possible guesses and loops through j in range(i+1,n+1). If i is the first word of the first line and j is the first word of the second line, DP(i) =min (DP(j) + badness(i,j) for j in range(i+1,n+1)). For each of O(n) choices, constant work is done. There is a topological order in the problem since i = n,n-1,…0 and the base case is where DP(n)=0 and there is no cost for a blank line. The total time is theta(n². Lastly, making sure the original problem gets solved, as long as DP(0) is solved the algorithm will handle all words. Parent pointers are used to remember which guess was best, what was the value of j that gave you the minimum value. Keep track of which one was the best. Parent[i]=argmin(…) = best j value for i. Parent pointers can be applied to any dynamic programming problem.

The next problem covers the game Blackjack. In an example of perfect information Blackjack with a deck a sequence of cards c₀ to c_n-1 with 1 player versus a dealer and 1 dollar per hand, should you hit or stand? The guess step is whether to hit or stand given a card. Specifically, how many times should you hit in the first play? Subproblems are where you start the new hand, suffix c_i:. The number of subproblems is n and the number of choices to guess is at most n. For the recurrence, BJ(i)=max(outcome of 1st play {-1,0,1} + BJ(j) for number of hits in range(0,n) if valid play). J is actually i + 3 + number of hits + number of dealer hits to determine the number of cards left. The dealer will always take another card if below 17. This can be represented as a DAG as with most algorithms a different plays will have different outcomes or weights of edges. Using this you can try and find the longest path where you gain the most money. The actual algorithm would be a bit more complicated but this shows the power of dynamic programming.

Lecture 21 – Dynamic Programming III: Parenthesization, Edit Distance, Knapsack

This lecture shows three more examples of solving problems with dynamic programming. First the professor gives some general tips for finding subproblems where inputs are strings or sequences. Generally, these problems use suffixes as long as you’re taking off the beginning of the sequence, though if taking off the end, you’ll end up with prefixes. Both have linear size. In cases where neither work, you may have to look at all substrings, x[i:j] which is theta(n²) and polynomial. In general if prefixes or suffixes don’t work on their own, you’ll need substrings.

The first problem is parenthesization where you find the optimal evaluation of an associative expression such as matrix multiplication, with n matrices. A₀*A₁*…*A_n-1*. If it doesn’t matter where parenthesis are for finding the right answer, then you can place them to get better times. You can try to guess the outermost or last multiplication and there will be a last multiplication in each subproblem. The subproblem will be the optimal evaluation of A_i…A_j-1 after guessing a split (A_i…A_k-1)*(A_k…A_j-1). The number of guess choices is at most O(n). For the recurrence, DP(i,j)=min( DP(i,k)+DP(k,j)+(cost of (A_i:k)*(A_k:j)) for k in range(i, j). This considers all possible parenthesizations of matrices in polynomial time. The number of subproblems is theta(n²) and time is theta(n³) which is much better than trying all roughly 4ⁿ parenthesizations. The topological order is increasing substring size and a DAG can be drawn from it though Dynamic Programming isn’t the shortest paths in the DAG. The base case is DP(i, i+1)=0. Lastly, the overall problem is DP(0,n). Memoization can be added to the algorithm in the usual way.

The next problem is edit distance where given two strings, x and y, you find the cheapest possible sequence of character edits to transform x into y. Edits allowed are inserting, deleting or replacing a character anywhere in the string. Different replacements could be given different costs based on mutation likelihood such as in autocorrect or DNA sequence problems. Another problem is the longest common subsequence where you find the longest subsequence between two strings where you drop any number from x and y to be equal. Thinking of this as an edit distance problem, you set the cost of insert and delete to be 1 and replace to 0 if replacing with the same string or infinity otherwise (don’t do it). A subproblem for the edit distance problem is to solve edit distance on x[i:] and y[j:] for all i,j and the number of subproblems will be n² if lengths are the same, or theta(|x|*|y|) which is quadratic. Then you have to guess which move to make and on which character(s). The recurrence is DP(i,j)=min((cost of replace x[i]->y[j]+DP(i+1,j+1)), (cost of insert y[j])+DP(i,j+1), (cost of delete x[i])+DP(i+1,j)). For topological order, for suffixes you typically go from the end to the beginning. This can be done for both strings, regardless of order. “For i =|x| down to 0: for j=|y| down to 0” will work. This way, the overall problem will start at DP(0,0). Sub problems are theta(1) and the overall problem is theta(|x|*|y|). Space can be reduced to linear, but time is quadratic.

The last problem for the week is the knapsack problem where you have a list of items each with a size and value, s_i and v_v and a knapsack of total size S. The goal is to maximize the sum of the values for a subset of items of total size of at most S. First, a subproblem looks at suffixes of items creating an order of items. Guess if item i is in the subset or not. For each item, there are two choices. max(DP(i+1), DP(i+1)+V₁ is a bad algorithm since it doesn’t keep track of the size in the bag. A better algorithm keeps track of remaining capacity X of at most S. The number of subproblems in theta(nS). DP(i,s)=max(DP(i+1,X), DP(i+1,X-s_i)+V_i) The total running time is theta(n*S) which isn’t polynomial in n (the size of the input) and can be thought of as pseudo polynomial time since one element is based on a different number in the input. Can be between polynomial and exponential and is the best we can do for the problem.

Lecture 22 – Dynamic Programming IV: Guitar Fingering, Tetris, Super Mario Bros

This lecture shows a different kind of guessing when solving problems with dynamic programming. Usually you guess which subproblems to use to solve the bigger problem or subproblem which are in steps 2 and 3. A higher level of guessing can be used in step one to add more subproblems to guess or remember more features of the solution. This was sort of used in knapsack when information was added to handle the knapsack size, additionally you could find different solutions for different knapsack sizes to gain more information about the problem. This week shows examples where the standard subproblems don’t work without more information.

The first example is of piano and guitar fingering where given a musical piece (sequence of n notes) you figure out which finger you should use to play each note. Fingers are labeled 1 to F and only one note is played at a time initially. A difficulty measure d(p,f,q,g) is computed as well. p and q are notes and f and g are fingers. How hard is it to transition from f on p to g on q? An incorrect example follows. First, the subproblem will use suffixes and will be how to play notes[i:], figuring out how to play notes from left to write. Then you guess which finger to use for note i. The recurrence is DP(i)=min(DP[i+1]+d(i,f,i+1,?) for f in 1 to F). The problem is we don’t know which finger to use for i+1 and will have to add more subproblems.

Another attempt at the note fingering problem starts by defining the subproblem as how play notes[i:] when you use f for notes[i]. The next thing to guess is finger g for notes[i+1] and the recurrence is DP(i,f)=min(DP(i+1,g)+d(notes[i],f,notes[i+1],g)) for g in 1 to F). This takes into consideration the unknown fingering from before. The topological order is a bit trickier with two parameters and goes “for i in reversed(range(n)): for f in 1 to F:”. To solve the original problem, min(DP(0,f) for f in 1 to F) since we don’t know which finger to start with, we just take the min. In DAG form, the graph forms a grid as a complete bipartite graph to take into consideration the four parameters. It’s a bit trickier without a single source, so a source node is added and connected to all first note possibilities. Subproblems take theta(F) and there are nF sub problems, so the total time is theta(nF²) which is almost linear time as F should be small.

Unfortunately, in reality, you tend to play chords or multiple notes at a time. First, the input, notes[i] is a list (at most F) notes that you play at once. The state we need to know about the past is the assignment of all fingers to notes or if they aren’t being used. There are (F+1)^F mappings for what fingers could be doing. The code is pretty much the same, where i is a list of notes instead and finger position is hand position. The overall time becomes theta(n(F+1)^2F).

The next problem is called Tetris training. Given a sequence of n pieces that will fall from the top. Pieces can’t change position once dropped and full rows don’t clear in this example. The width of the board, w, is small and the board is initially empty. Regular Tetris without the constraint is np-complete, but with these assumptions it’s easy. Subproblems are suffixes, how to play pieces[i:] which isn’t enough information. We also need to know what the board looks like, specifically how high each column is. Given a board skyline there are n*(h+1)^w subproblems. Guesses are how to play a piece, how to rotate and where to drop the piece, so for each piece there are 4w choices. The overall run time is theta(n*w*(h+1)¹). This is a case where the answer is a boolean value, but in 0 and 1, where 1 is true, you can do this as a maximization problem.

The last problem for the lecture tries to model Super Mario Bros. Given a level with a small w by h screen, if the goal is to maximize score or minimize level time, we can solve these problems. The configuration is everything on screen. We need to know monsters’ positions and states and positions of objects. There’s something like c^wh information which says there’s something for every pixel or position on the screen. Mario’s velocity is also a factor as turning around is slower than going forward, but there are a constant number of choices. Score and time are also factors and are large integers making this a pseudo polynomial algorithm. Lastly, the screen’s position in the level is a factor. There are theta(c^whST) configurations. Each button or move is a possible guess and that with configurations can lead to a graph which can be solved with dynamic programming or shortest paths. With recursion, all reachable configurations can be memoized.

Lecture 23: Computational Complexity

This week is about complex problems that can’t be solved in polynomial time (6.045 is an entire class continuing this topic). Polynomial-time problems (P) cover most of the problems in this class. Trickier problems can only be solved in exponential time (EXP) which is 2^{n^c} and almost every problem can be solved this way. Polynomial problems are also in the range of problems that can also be solved in exponential time and “R” covers both, where R is all problems solvable in finite time. R is basically solvable problems. Most problems are unsolvable and this will be proved. Negative weight cycle detection is a problem that’s in P (and EXP and R). NxN chess is a problem of figuring out who will win from a position can be solved in EXP and not P. Realistic Tetris (knowing order of pieces in advance) can be solved in EXP and if it’s in P is unknown. A halting problem, where given a program you ask whether it stops, is not in R so there’s no correct algorithm given an arbitrary program.

Almost every decision problem is not computable. Most problems so far were yes or no problems. Every program can be represented as a binary string and thus a natural number in N. Every program reduces to an integer. Decision problems however can be thought of as a function which maps inputs to yes or no or 1 or 0. A decision function takes a binary string in N and maps it to 1 or 0. A decision problem is an infinite string of bits while a program is a finite string of bits. A decision problem is something in R, all real numbers while a program is in N all integers and |R| is much bigger than |N|. This means there are way more problems than programs to solve them. For some reason, most of the problems we think of are computable even though most problems we could think of aren’t computable. There’s some meta theorem of how we think about problems.

NP, nondeterministic polynomial, is an area of problem that sits somewhere between P and EXP. NP is the set of decision problems solvable in polynomial time via a “lucky” algorithm. A lucky algorithm can make guesses and always guesses the right choice. This is not a realistic method of computation but it is a model of computation called nondeterministic. It gets at the core of what is theoretical to solve. An algorithm can compute stuff and make guesses, yes or no. Guesses are guaranteed to lead to a yes if possible in this model. Tetris is in this NP complete set when trying to solve for if you’ll survive. Another way to think about NP is as the set of decision problems with “solutions” that can be “checked” in polynomial time. In other words, if the answer is yes, you can prove it and check the proof in polynomial time. It’s easy to prove you can solve a Tetris situation by saying what moves to make.

You’d win the millennium prize if you proved that P does not equal NP and that NP problems could not be solved in P time. You can’t engineer luck. It’s harder to generate the proof of a theorem than to check a proof. Inspiration is luck which is obtained by guessing. Tetris is believed to be in the space of NP-P if they are different and is believed to be the hardest problem in NP. We can prove that Tetris is not in P if P isn’t NP. Tetris is NP-hard, meaning it’s at least as hard as every problem in NP. Tetris is NP-complete meaning it’s NP-hard (can be EXP or worse as well) and in NP. Chess is EXP-hard, at least as hard as any problem in EXP.

Reductions are a way of designing algorithms we’ve used before. Graph reduction converts a problem to a graph and uses techniques for solving graphs. If you can convert a problem into a form you know how to solve, then you’ll be able to solve it. For instance, if you have an unweighted shortest path and want to solve it with Dijkstra, you can add weights of 1 to every edge. You can reduce a min-product path to a shortest path as well or convert longest path to shortest path to solve with Bellman-Ford by negating all of the weights. An NP-complete problem called 3-partition can be reduced to Tetris. If we reduce A to B, then B is at least as hard as A. All of the NP-complete problems can be reduced to each other so if one NP complete problem can be solved in P, then all others can be by reducing them. The travelling salesman problem is NP-Complete as well as Minesweeper, Sudoku and most interesting puzzles.

Lecture 24 – Topics in Algorithms Research

The first personal computer was built in 1981. Intel invented the Intel 8086 and the x86 architecture. The IMB PC ran at 5 megahertz. The Intel 80490 came out in 1989 with 25 MHZ. Intel’s Pentium processor came out in 1993 with 66 MHZ. The Pentium 4 came out in 2000 with 1.5 GHZ thanks to a 30 stage pipeline. The Pentium had some famous bugs, such as the F00F bug which allowed a program to crash the system without admin access. After 2005’s Pentium D where clock speed got up to 3.2 GHZ, clock speed stopped increasing. For instance, the Quadcore Xeon from 2008 only had 3 GHZ and extra performance came from multiple processors (also called cores) on a chip. Because of this switch, a lot of algorithms are being designed to be parallelized to make use of these multiple processors.

For instance, a chip could have four cores and a large amount of static RAM, SRAM (MB), connected to a large amount of DRAM (GB) off the chip. The connection between processors and SRAM is very fast since both are on chip while the DRAM connection is slower. The problem is there will be a bottleneck to access memory. The number of read ports for memory is roughly 4, and the architecture won’t be sustainable beyond 16 or so cores. People are trying to design an architecture where processors and memory are organized in tiles where processors have cache memory attached to them. The space between the processor tiles is reserved for connection and routing algorithms are required for finding shortest paths between processors and weights representing congestion are factored in. A program/thread runs on each process and to get data from another processor it has to make a request to the address of the other processor and requires a round trip cycle to get data back. It can take tens of cycles to send a message and get data back. This routing is a major bottleneck.

Execution migration is the idea of migrating computation instead of data. If a program wanted to access remote memory, it could migrate the program itself, or the context of the program. The context is the program counter or where you are in terms of executing the program and the register file, cache memory and more. This migration is a one way trip instead of a round trip and you don’t have to get anything back after migrating your execution, finishing on the remote processor instead. A downside to this is that the context can be kbits and larger than the data you’d send otherwise.

Deciding which to use can be looked at as an optimization problem. Assuming we can predict the access pattern of a program, and m₁ through m_N are memory addresses and p(m)’s are the processor caches for each m_i. For example, it might want to go P1,P2,P2,P1,P1,P3,P2. The cost of migration, costmig(S,d) is the distance(s,d)+L (contant size for the architecture). The cost of access(s,d) is twice the distance between s and d. The problem is deciding when to migrate to minimize total memory access calls. Each switch between processors is a migration which will add to the cost. Dynamic programming can be used to solve this problem. The program at P1 has a number, Q, of processors. The subproblems are DP(k,P1) is the cost of the optimal solution of the prefix m₁ through m_k of memory access when the program starts at P1 and ends at Pi. With this, there are NQ subproblems. DP(k+1,Pj)= DP(k,Pj)+cost_access(Pj,P(m_k+1)) if Pj doesn’t equal P(m_k+1), otherwise it’s the minimum i=1 to Q of (DP(k,Pi)+cost_mig(Pi,Pj)) if P1 = P(m_k+1). The cost of a subproblem is O(Q) and the total cost is O(NQ²). This type of analysis is realistic for coming up with hardware and algorithms like these are close to what the professor’s team is working on.

Erik takes over and works more on computational geometry. 6.850 covers computational geometry, 6.849 covers folding, 6.851 covers data structures and 6.889 covers graph algorithms and SP268 covers Recreational Algorithms. Most of these should have material on Open CourseWare. 6.046 is the next course we should take and may be a prerequisite to these. Origami is an example of geometric folding algorithms where a flat sheet of paper is turned into a geometric shape. Folding a crease pattern is an example of an NP Complete problem, but designing a pattern from a 3d can be done by an algorithm called Origamizer. There’s also an algorithm for folding to make a shape with a single cut and he shows some examples.

An issue that exists in real computers that gets even worse in parallel is that there’s a bottleneck between the main memory and cache. At some point you can run out of space and have to access main memory. To help with this, memory transfers happen in blocks. Memory transfers are something else that needs to be optimized for algorithms such as sorting and searching. Optimally, search can be done in theta(log_B n) and sorting in theta(n/B log_C n/B) where B is the size of a block and C is the number of blocks in the cache memory. There’s a whole study of these algorithms and they can be achieved even without knowing B and C.

For Word RAM, if we want to do insert, delete, predecessor and successor functions, we can do better than binary search trees. With n integers in the size of the universe, you can get O(lglg u) with a data structure called van Emde Boa, O(lgn/lglgu) via fusion trees and O(sqrt(lgn/lglgn)) by taking the minimum of the last two and is the best you can do, though the data structures are impractical. Graphs can be assumed to be planar or almost planar, meaning they avoid crossings, similar to a roadmap without overpasses. Dijkstra can do these in O(E) time and Bellman-Ford can do them with negative weights in O(n(lg²n/lglgn)) which is almost linear and is the best known to date.

Conclusion

It’s been refreshing to go through a whole course on algorithms which I can see coming back to whenever I’m practicing on HackerRank or a similar site. This course has been the closest thing to what I was hoping to gain from this program, being directly relevant to computer science and going through the previous courses, the mathematical notation used is familiar to me. Maybe at some point in the future, I’ll come back to these notes and fill them out further with information missed from the recitation videos or will go through the book, but I’m going to continue on for now. The last lecture especially gives me some excitement for future courses both for further information on algorithms and the workings of computer programs and hardware itself.

NikCreate

Leave a ReplyCancel reply

From Linux to Lambda: The Evolutions of Compute – Containerization

NickLiftWeight: A Tracker for Weight Lifting

Nick Ate: Food Logging

Trending