MIT Courses #3 – Mathematics for Computer Science

The path of trying to figure out what content I missed by not following a traditional computer science degree program continues and this time I’m looking at Mathematics for Computer Science. The first course, Introduction to Computer Science was all familiar to me which should be reassuring, while going through the Calculus lectures left me more familiar with some of its concepts with a lot left to learn in the field. One thing I still have to learn is where it can be useful and applied to computer science. Most real world examples given were physics related which I expect to be interesting, but what’s the connection to computer science?

I figured if anything would connect math and computer science for me it would be the third course on my list, Mathematics for Computer Science. Having gone through the lectures, I ended up with a mixed reaction. The course was directly relevant to me as it was basically solving and proving the kind of algorithm problems that appear in interviews for coding jobs. It introduced sorting algorithms, data structures and engaging mathematical riddles that I was familiar with code, without using code. While this certainly helped bridge math and computer science, only a couple of lectures used derivatives and integrals and it seemed like I would have missed out on only a small amount without it. This was the most engaging course for me so far since practicing algorithm problems is something I need to work on in the future and has been the first attempt at seeing math applied to my field and hope to enjoy future courses as much as I had this one.

Mathematics for Computer Science: 6042 – Fall 2010 by Prof. Tom Leighton and Dr. Marten van Dijk

Lecture 1 – Introduction and Proofs

The first lecture introduces proofs, a concept popular in mathematics that also exists outside of them. The professor describes the concept of a proof as a method for ascertaining the truth and some ways of ascertaining truth include observing rules of physics such as gravity, coming up with counter examples or relying on authority to confirm a belief. He also gives the formal definition, “a mathematical proof is a verification of a proposition by a chain of logical deductions from a set of axioms.” This definition leaves a lot to unpack, primarily the definitions of “proposition” and “axioms”.

A proposition is a statement that evaluates to a boolean, true or false. For instance, “3+2=5” is a true statement. Propositions will often be harder to figure out. For instance, “∀n in ∈N, n²+n+41 is prime” states that for all (∀) values of n in the set (∈) containing all natural numbers (N), n²+n+41 is a prime number. This looks good at first, but after enough examples, you can find numbers it doesn’t hold true for. Some propositions are can take a long time to prove, for instance Euler suggested “a⁴+b⁴+c⁴=d⁴” has no positive integer solutions in 1769 and this was eventually disproven two centuries later. So “∃ (there exists) a,b,c,d ∊N+ (positive natural numbers), (where) a⁴+b⁴+c⁴=d⁴.” A similar example stated that “313(x³+y³)=z³ has no positive integer solutions,” which is also false but the shortest counter example has over a thousand digits. This equation is important as it’s an example of an elliptic curve and is apparently central to factoring large integers, which is key to breaking cryptography systems.

Other examples are that “the regions in any map can be colored in 4 colors so that adjacent regions have different colors,” suggested by Francis Guthrie and proved using computers eventually, or that “every positive integer but 2 is the sum of 2 primes” known as “Goelbach’s conjecture” and unproven at the time of this course. Riemann’s hypothesis has been proven to 1.5 billion zeroes, but not completely and the Poincare Conjecture, related to 2d and 3d spaces required an 80 page proof with 350 additional pages of details to prove.

Another important notation is that => represents “implies.” An implication P => q is true if p if false or q is true, which means that:If P and q are true, P=>q is true, q=>P is True, P<=>q is TrueIf P is T and q is F, P=>q is F, q=>P is T, P<=>q is FIf P is F and q is T, P=>q is T, q=>P is F, P<=>q is FIf P is F and q is F, P=> is T, q=>P is T P<=>q is T

Because of this, “pigs fly => I’m king” is said to be true by the last implication.

⇔ is if and only iff, also sometimes written as “iff,” so “∀n in ∊Z (integers), n ≥2<=>n²≥4” means that for all integers, n, n ≥ 2 if and only if n² ≥ 4. “If and only if” requires proving both ways, so P implies q and q implies P. The above example is proven false when n=-3. A couple other definitions given that are related to propositions are predicate which is the proposition whose truth depends on the value of variables and universe of discourse which is the variable space you’re talking about.

Moving on to the axiom, “an axiom is a proposition that is assumed to be true.” There isn’t a proof, it’s just something that seems reasonable or from greek, “to think worthy.” The idea is that, while you have to make assumptions to figure things out, it’s important to state what those assumptions are, so people can follow your reasoning even if they disagree with your axioms.

For example, take the following three statements based on categories of geometry. given a line L and a point p not on L:Euclidean Geometry – there is exactly one line through p parallel to LSpherical geometry – no line on the sphere through p parallel to LHyperbolic geometry – infinitely many lines through p parallel to L

Each of these rules makes sense in its own field of geometry because they assume different axioms. When working with axioms, guiding principles suggest they should be consistent and complete. If no propositions can be both proved false and true, then the axiom is consistent. If a set of axioms can be used to prove every single proposition true or false, then it is complete. In the 1930’s Kurt Godel proved that there aren’t any axioms that are both consistent and true, so if you want consistency, there will be facts you won’t be able to prove.

Lecture 2 – Induction

Proofs shown so far have all been direct proofs, with this lecture showing indirect proofs using contradiction. When doing this, specify in the proof that it’s “by contradiction.” For a proof by contradiction, start with the opposite of what you’re trying to prove and prove that it can’t be true, showing that what you were actually trying to prove is true. This idea is noted, “If ¬P => F is true, P is true,” with ¬ being “not.” For instance, to prove that the square root of 2 is irrational, use the knowledge that any irrational number can’t be expressed as a ratio of integers.Assume that P is false and that the square root of 2 can be expressed as a/b, so 2 is a²/b² implying 2b² =a² , which implies that 2|a, implying a is even, implying 4|a², implying 4|2b² , implying b is even which finally implies that a/b isn’t in it’s lowest terms, leading to a contradiction. The symbol “|” in this case is “divides.”

A proof is ended with a QED to suggest that it’s done. The above proof was discovered by the Pythagoreans and has an interesting backstory or legend. A type of proof that can be misleading and should be avoided is a proof by picture. In this example, two triangles created from a rectangle with an area of 90 degrees are slid apart to create triangles that mistakenly show that 90 is greater than 92 degrees which looks like a paradox, but only because the drawing isn’t to scale.

Induction is the most used proof technique in computer science and should be the biggest takeaway from this course according to the professor. Induction itself is itself an axiom: “Let P(n) be a predicate. If P(0) is true and ∀n in ∊N (P(n) => P(n+1)) is true, then ∀n in ∊N P(n) is true.” In other words, if you can prove a base case at zero and an inductive step, then each predicate implies the next, so every case of n+1 can be assumed to be true.

For example, take the proposition, “∀n ≥ 0, 1+2+3+…+n=(n(n+1))/2.” This states that adding all positive integers from one to n, gives (n(n+1))/2. It’s easy to check for any value but induction is important as its hard to prove it holds true for all values. First figure out what the predicate “P” is, that is, what are you trying to prove. In this case, P(n) can be the proposition that the sum of 1 to n of i is (n(n+1))/2. Prove it works for the base case, 0 which holds true. Then prove the inductive step. For n≥0, show that P(n)=>P(n+1) is true. You can assume P(n) is true for the purpose of induction. For example, assume the function is true for n to show that it’s true for n+1.

Often induction doesn’t give you the answer and just proves its right once you have it. Inductions also don’t always have to start at zero, just make sure to verify your starting point and the implication for all values of n from that starting point. It is also possible to fool yourself with proof by induction and comical example tries to prove that for any set of n where n ≥ 1 horses are the same color. Base case shows the horse is the same color as itself, and then suggests that all n+1 are the same color. The flaw is that it’s never proven beyond the base case of 1 horses and doesn’t prove that P(1) => P(2).

Another problem shown is tiling a courtyard with 3-square L-shaped tiles while leaving one square in the center for a statue. The claim is “∀n, ∃ way to tile a 2ⁿx2ⁿ region where center square is missing.” This is difficult to prove as a 2 by 2 area can’t have a center, so they instead try to prove he can go in a corner, proving something similar to use as a “lemma.” It can often be easier to prove for a seemingly harder case, such as that any square could be missing. “If at first you don’t succeed, prove something harder.” Overall, when it comes to inductions, it’s all about choosing the right induction hypothesis to prove as a good one makes it easier and a bad one makes it harder.

Lecture 3 – Strong Induction

A proof generally has seven characteristics, it should be correct, complete, clear, brief, elegant, well organized, and in order. A tool that can help organize proofs that was briefly touched on is a lemma, which can be used like a subroutine in code. This lecture starts with the following problem. Find a sequence of moves to go from 3×3 grid from A-H (with G and H switched) with a blank spot at (3,3). The goal is to get all letters in order. A legal move is sliding a letter into an adjacent blank square either by row or column. The students eventually determine that there is no sequence of legal moves to invert G and H and return all the other letters to their original order, but still need to prove this. This is proved by finding an invariant where if the property holds true at start and after every step, you can only reach steps with that property. What property is held by every state moved to except the desired goal?

The first lemma, “Lemma 1” used here states that “a row move doesn’t change the order of the items.” It’s proved by moving to cell i+1 or i-1 with nothing else moving, keeping the same item order. Lemma 2 states that in a column move changes the relative order of precisely 2 pairs of items, proved that moving to i-3 or i+3 changes position with 2 other items. A pair of letters L₁ and L₂ form an inverted pair if L₁ proceeds L₂ but comes after it in the puzzle. Lemma 3 is determined and states that during a move the number of inversions can increase by 2, decrease by 2 or stay the same, proved by the previous 2 lemmas. This means that during a move, the parity (if the value is even or odd) of the number of inversions doesn’t change, proved since adding and subtracting 2 doesn’t change the parity. This fact is assigned to “Corollary 1.”

Lemma 4 states that in every state reachable from the start state, the parity of the number of inversions is odd. This is proved by induction using P(n), after any sequence of n moves from the start state. In the base case, n=0, the number of inversions in the start state is 1 and the parity is odd, satisfying the hypothesis for the base case. In the case of the inductive step, after (n+1) moves, any sequence of n+1 moves, by inductive hypothesis, we know that the parity after moves m₁…m_n is still odd. Corollary 1 shows that that the parity of the number of inversions does not change during M_n+1 => the parity after all M_n+1 moves is odd => P(n+1). This completes induction the induction and shows P(n)=>P(n+1). To finally complete the theorem stated, the parity of the number of inversions in the desired state is even (0). By lemma 4, the desired state can not be reached from the start state since its parity is odd.

Going over a few terms used in the above example, corollaries are usually a simple consequence of something else and a short proof. A lemma is usually used to prove another lemma or a bigger proof. You can also use invariants to prove that systems can never reach an undesired state such as a plane crash or anything a system needs to avoid. As an induction is itself an axiom, a stronger version also exists which can help prove theorems easier. This strong induction axiom states “Let P(n) be any predicate. If P(0) is true and ∀n (P(0)) and P(1) and … and P(n)) => P(n+1) is true, then ∀n P(n) is true.” To show an induction you can assume all these are true.

The second problem shown in this lecture is the unstacking game, where teams split stacks of 8 blocks into two stacks each turn and multiply the count of the stacks they split at each move until 8 stacks of 1 exist. For example, splitting 8 into 5 and 3 gives 15 points for that turn. The team with the most points wins. This game played by the students is the first of many trick examples as the students figure out that no matter what move they make, the result is determined from the start. The theorem states that all strategies of n-block game produce the same score, S(n). In the case of n=8, S(8)=28. Proving P(n) this by strong induction, the base case S(1)=0 is the same, so it’s down to the inductive step, assuming P(1), P(2),…P(n) prove P(n+1). First, look at n+1 blocks. N+1 -> k and n+1-k. The score is defined k(n+1-k)+P(k)+P(n+1-k), but we want to be able to say that the depends on n and not k, so that score equation doesn’t prove it. If you make the theorem stronger by saying the function, S(n) = n(n-1)/2. It fits with the base case and works for all examples, as k(n+1-k) + (k(k-1))/2 + ((n+1-k)(n-k))/2 = n(n-1)/2.

Fermat’s last theorem is also shown in this lecture which is ∀n>2 ¬∃ x,y,z ∊ n⁺ such that xⁿ+yⁿ=zⁿ and has been proved.

Lecture 4 – Number Theory 1

This lecture introduces another way of thinking using concepts from previous lectures. Number theory is the study of integers and is commonly used in cryptography, which is the study and practice of hiding numbers. An example is shown from the movie Die Hard 3 where the characters use 3 gallon and 5 gallon jugs to measure 4 gallons by pouring back and forth. The rule “m|a (m divides a) iff ∃_k a=k*m.” is used to set up the theorem and explain the situation. If you have an “a” gallon jug and a “b” gallon jug, assume a≤b. The first theorem states that if m|a and m|b, then m|any result. To show this, the lecture introduces a state machine where states are pairs of x and y where x is the number of gallons in the “a”-jug and y is the number of gallons in the “b”-jug.

The starting state is (0,0) and future states are determined by transitions. There are three types of moves that can be made. You can empty a jug, making the change, (x, y) -> (0,y) or (x,0). You can fill a jug represented as (x,y) -> (a,y) or (x,b). Or, you can pour one jug into another, stopping when full leading to four possible situations.(x,y) -> (0, x+y), (only if) x+y < or = b(x,y) -> (x-(b-y), b) = (x+y-b, b), x+y > or = b(x,y) -> (x+y, 0), x+y < or = a(x,y) -> (a, y-(a-x) = (a, x+y-a), x+y > or = a

An example of how this plays out in this representation for a=3 and b=5 is (0,0) -> (0,5) -> (3,2) -> (0,2) -> (2,0) -> (2,5) -> (3,4). This example succeeds as the goal is to get 4 gallons into a jug. Proving this theorem by induction, you can assume m|a and m|b and use an invariant you expect to hold true. In this case, P(n)= “if (x,y) is the state after n transitions, then m|x and m|y.” At the base case, (0,0), m|0 -> P(0) is considered true, which raises some questions given the “divides zero.” For the inductive step, assume P(n), suppose that (x,y) is the state after n transitions. P(n) -> m|x and m|y. After another transition, each of the jugs are filled with either 0, a, b, x, y, x+y-a, x+y-b, or x+y gallons, since m|0, m|a, m|b, m|x, m|y, m can divide any above value and P(n+1) is shown to be true. For example, plugging in 33 and 55, since both are only divisible by 11, any answer reached has to be divisible by 11 as well.

The greatest common divisor of a and b is used often and represented as gcd(a,b). For example, gcd(3,5)=1 as nothing else divides them. Additionally, a and b are considered “relatively prime” if their greatest common divisor is 1. A corollary using this for the jug problem states that gcd(a,b) can divide any result. The second theorem states that any linear combination, L=sa+t*b, of a and b with o at most L at most b can be reached. For example, 4=-2*3+2*5 for a=3 and b=5. The theorem will be more useful if s is positive and may need to be switched around to get in that format using s’ as a substitute for s. For example, adding 5*3-3*5 to above changes it to the format 3*3 – 1*5. This is proved since L=sa+tb = (s+mb)a + t(t-ma)b, so ∃_s’,t’, L= s’a+t’b with s’>0

An algorithm for solving this problem and get L gallons is to repeat the following steps s’ times. First, fill the “a”-jug, next pour into the “b”-jug, then if the “b”-jub becomes full, empty it and continue pouring the “a”-jug until its empty. By doing this, you can reach the desired state. For instance, with a=3 and b=5, this takes 3 loops.loop 1: (0,0) -> (3,0) -> (0,3).loop 2: (0,3) -> (3, 3) -> (1, 5) -> (1, 0).Loop 3: (0, 1) -> (3,1) -> (0, 4)

Assuming at the end, you’ve filled the a-jug s’ times and the b-jug is emptied u times (with u being an unknown value), let r be the remainder in the “b”-jug, with 0≤r≤b and 0<L<b and r and L calculated by r=s’*a – u*b and L=s’a+t’b, you can show that r equals L. If r=s’a-ub = s’a+t’b-t’b-ub = L – (t’+u)b, what property must (t’+u) have to make that statement true? If t’+u != 0 => [r , 0 or r > b], so t’+u=0 -> u = -t’ => r = L. The values s’ and t’ are important because they make linear combinations. Having positive integers is important because you can’t repeat steps a negative number of times, so the algorithm wouldn’t work.

Moving on, Euclid’s algorithm states that for any b and a, there exists a unique q (quotient) and r (remainder or rem(b,a)) such that b = qa+r with the property that 0 < or = r < a. This can be used to determine the lemma gcd(a,b) = gcd(rem(b,a), a) that the greatest common denominator of a and b is equal to the greatest common denominator of the remainder of b and a and a. An example with 105 and 224 shows gcd(105, 224) = gcd(rem(224, 105), 105) = … = gcd(7,14) = gcd(rem(14,7),7) = gcd(0,7) = 7. A proof that [m|a and m|b] => [m|b-qa = rem(b,a) and m|a] shows that if rem(b,a) isn’t 0, then [m|rem(b,a) = b-qa and m|a] -> [m|a and m|b] and if rem(b,a) = 0 = b-qa = > b=qa and m|a => m|b. Proving both inequalities in the lemma also proves the equality in this case.

The next theorem states that gcd(a, b) is a linear combination of a and b from. This can be proved by induction via the invariant, “P(n)=”If Euclid’s Algorithm reaches gcd(x,y), after n steps, then both x and y are linear combinations of a and b.” Ths holds true for the base case and the inductive step shows that ∃_q, rem(y,x)=y-qx and there is a linear combination of a and b => P(n+1). By combining all three theorems so far, we can get the final result and theorem, “gcd(a,b) is the smallest positive linear combination of a and b.” This is true as results from 0 to b can be reached and gcd is a linear combination that can be reached by the theorem and divides all results from 0 to b. Therefore, it must be the smallest one and the previous theorems prove this.

Lecture 5 – Number Theory 2

Moving on with number theory, this lecture dives deeper into encryption. This is an application of number theory and helps hide information so it can only be accessed by the proper people. The first step in the encryption process occurs the information transfer. Keys are exchanged that will allow the recipient to decrypt the information. For the encryption step, m’=E_keys(m), transforms keys so that output doesn’t give any information about the original message. The sender sends the encrypted message and the receiver decrypts in the decryption step. D_keys(m’), can only get the original message from the encrypted message if you have the keys. So a message needs to be sent and is encrypted and decrypted with functions based on the keys used.

Turing first proposed to use number theory in cryptography. The proposal was rejected, but this lecture is looking at some examples of his proposal. An example of version one of his code would send “victory” as m=“2209032015182513,” which has each letter’s place in the alphabet with “13” added at the end to make it prime. Beforehand, a secret prime number “k” would be exchanged. To encrypt, m’=mk and decrypt with m’k=m. IT’s hard to factor a product of 2 large primes, but if two messages got intercepted, you could use both to calculate gcd of k, making decryption possible by intercepting two messages. Because prime numbers only have 1 and themselves as factors, prime numbers as factors creates a problem in this case.

The second version of Turing’s code starts by exchanging a public prime “p” and a secret prime “k.” To encrypt the message, take the message as a number m ∊ { 0, 1,…p-1 } and compute m’=rem(mk, p). To decrypt, use m=rem(m’k¹,p). For some definitions, a and b are considered relatively prime iff gcd(a,b)=1 iff ∃_s,tsa+tb=1 and x is congruent to y modulo n iff n|(x-y). The congruent symbol is ≅, for example 31≅16(mod 5). The multiplicative inverse of x mod n is a number x^-1, in {0, 1…n-1} such that x*x^-1≅1(mod n) and rem(mk,p)≅mk(mod p). If k*k^-1≅1(mod p), then m’k^-1≅mk*k^-1≅m (mod p) and m = rem(m’k^-1, p). First we proved that the difference was a multiple of p and knew that m was in range 0 to p-1. This code is slightly more complicated to crack, but there’s a known attack using plaintext. If the adversary knows the message, m and encryption m’=rem(mk,p). P is a public prime and m’=mk(mod p). Since the greatest common denominator of the message and the prime is 1, compute m^-1 such that m*m^-1 ≅ 1(mod p) and compute k^-1(mod p).

Euler’s totient function states that φ(n) denotes the number of the integers in {1,2,3…,n-1} that are relatively prime to n. For example, n=15 has 1,2,4,7,8,11,13,14, so φ(n) is 8. Euler’s theorem states that if gcd(n,k)=1, then k^φ(n)=1 (mod n). To prove this theorem, use a couple lemmas. First, lemma 1 states that if gcd(n,k)=1, then ak≅bk(mod n) => a≅b(mod n). gcd(n,k)=1 iff k has a multiplicative inverse. This proof of this is “gcd(n,k)=1 ⇔ ∃_s,tsa+tb=1 ⇔ ∃_t n|(kt-1) ⇔ kt ≅ 1 (mod n).” Lemma 2 states, “Suppose that gcd(n,k)=1, let k₁,…,k_r in {1,2,3,…,n-1} denote the integers relatively prime to n (r=φ(n)).” So {rem(k₁*k, n),…,rem(kr*k,n)} = {k₁…k_r}. Using this, show that it’s not possible for 2 remainders to be equal. This second lemma is proven with 2 proofs. First, rem(k1*k, n) = rem(kj*k, n). If they are all different then {rem(k₁*k, n),…,rem(k_r*k,n)} should have exactly r. If we know they’re equal, kik≅kjk (mod n), you can use the previous lemma to show that k₁≅k_j(mod n) => k_i=k_j and any two remainders can only be equal if k_i=k_j. The second version of the proof states that gcd(n, rem(k1k,n) = gcd(n, k1k) = gcd(n,k)=1 and gcd(n,k1)=1, and proves set of remainders is a subset of {k₁…k_r}. A proof of k₁*k₂*…*k_r = rem(k₁k, n)…rem(k_rk, n) ≅ k₁k, k₂k,…k_rk (mod n) ≅k₁k₂…k₁*k^r (mod n) is that the whole product is relatively prime to n and you can sub a for 1 and b for k_r so 1≅k^r(mod n) so r=φ(n) proves the theorem. Another famous theorem, Fermat’s little theorem states that if p is prime and k ∈{1,2,…p-1}, then k^p-1≅1 (mod n). The proof to this is 1,2,…,p-1 are relatively prime to p, so φ(p) = p-1. k^φ(n)≅1 (mod p) => k^p-1≅1 (mod k). k*k^p-2 = k^p-1≅1 (mod p) and k^-1 = k^p-2 (mod k).

Lastly, the lecture talks about the RSA algorithm which was created decades after Turing and uses number theory to create a public key encryption scheme where the sender and receiver don’t exchange a secret key. Beforehand, the receiver creates a public key and a secret key. They publish the public key and keep the secret key. Anyone can use the public key to encrypt a message and send it to the receiver, who can use the secret key to decrypt the message. First, two distinct primes, p and q are generated and n is set equal to p*q. Select a value for s such that gcd(e, (p-1)(q-1)) = 1, and the public key is the pair (e, n). Compute d such that de≅1 (mod (p-1)(q-1)). The secret key is the pair(d, n). Encryption is done by setting m’ = rem(m^e, n) and decryption by m=rem((m’)^d, n).

This decryption works because m’ = rem(m^e, n) ≅ m^e(mod n), implying that (m’)^d ≅ m^ed(mod n). ∃_r ed=1+r(p-1)(q-1), so (m’)^d≅ m^ed≅m*m^r(p-1)(q-1)(mod n). Applying fermat’s theorem

RSA is widely used after many decades and number theory is still used as of this class’s run. Appling fermat’s theorem, If m is not congruent to 0 (mod p) then m^p-1≅ 1 (mod p) and if m is not congruent to 0 (mod q), then m^q-1≅1 (mod q). The message can be determined to be in range 0 to n-1 and (m’)^d≅m (mod n) and m=rem((m’)^d,n).

Lecture 6 – Graph Theory and Coloring

This lecture introduces the data structure known as a graph, represented as a series of nodes connected by edges. Nodes are typically labeled with edges sometimes being labelled as well. A formal definition is that a graph G is a pair of sets (V,E) where V is a nonempty set of items called vertices or nodes and E is a set of 2-item subsets of V called edges. For example, V={x₁, x₂, x₃…x₇}, E = {{x₁,x₂}, {x₁, x₃}, …, {x₅,x₇}}. Edges are also written x₁-x₂, based on a graph of V nodes connected by E edges. It’s possible to have a graph of nodes with no edges as well. Two nodes x_i and x_j are considered adjacent if they’re connected by an edge (ex. If x_i-x_j is an edge). The number of edges incident to a node is the degree of a node. A graph is considered simple if it has no loops or multiple edges, where a loop is an edge that connects a node to itself and a multi-edge is 2 edges connecting the same two nodes.

One problem brought up in this lecture was a study from the university of chicago suggesting that men had more opposite gender partners which this lecture attempts to debunk with graph theory. When modelling the number of partners, nodes are people and edges are used to connect them. |V| is used to denote all of the nodes, about 300 million, with |V_m| being 147.6 million men and |V_w| being 152.4 million women. The goal is to find |E|, the cardinality of the edges. To find this, look for the average degree of men compared to the average degree of women. A_m is the average number of partners for men and A_w is the same for women. What is A_m/A_w? UChicago says 1.74 and a study by ABC news said it was 3.33. But A_m can be calculated with A_m = |E|/|V_m| and A_w=|E|/|V_w|, and A_m/A_w is |V_w|/|V_m| which equals 1.0325. In actuality, the number could be calculated based on population counts as if a male had a female partner, then the female had a male partner.

Another graph example takes MIT classes as nodes with edges meaning overlapping student enrollment to avoid having exams at the same time. Courses that are not connected with edges can have exams at the same time. A series of slots for exams are specified, with the goal of using earlier exam slots and the goal is to put every node in a slot so that nodes hooked by an edge aren’t put in the same slot. This is an example of a graph coloring problem, where given graph G and K colors, a color is assigned to each node so that adjacent nodes get different colors. In a graph coloring problem, the minimum value of K for which such a coloring exists is known as the chromatic number of the graph (χ (G)). For example, in a graph with 3 nodes forming a triangle with edges, the minimum amount of slots that could be used is 3 since none of them can happen at the same time as each other in the test example. This is an example of an NP complete problem, where no one knows how to solve it with less than exponential time. With NP complete problems, solutions are easy to check and if you figure out how to solve one in better time then the solutions to the rest should become clear. Better solutions are generally considered impossible and are one of computer science’s biggest challenges.

An example of an algorithm that can solve the graph coloring problem, even if not great, is to first order the nodes v₁ through v_n and order the colors c₁ through c_n. Then, for i, 1 through n, assign the lowest legal color to v_i. Different orderings will result in different numbers of colors. You’ll generally do better if you color higher degree nodes first. This is an example of a greedy algorithm, it does the best it can do at each step and doesn’t go back to improve. Sometimes it works great, sometimes it doesn’t, but greedy algorithms can be a good starting point.

One theorem proposed is that if every node in G has a degree of at most “d”, then the basic greedy algorithm uses at most “d”+1 colors for G. For example, a triangle of nodes with 2 degrees leads to a minimum of 3 colors. When proving this by inductions and working with graphs, it’s important to be careful about taking the theorem as the induction hypothesis. P(d) can lead to a disaster and it’s better to induct on n nodes or edges. If every node is in an n-node graph G, fit n into the theorem and induct on n. In the base case, n = 1 and there are 0 edges so the degree is 0, and one color works. For the inductive step, assume P(n) is true and let G=(V,E) be any (n+1)-node graph. Let d be the max degree in G. Order the nodes in any way. Remove the v_n+1 node from G to create G’=(V’,E’) and remove edges connected to the node to get the previous graph. G’ has a max degree of at most d and n nodes so P(n) says the basic algorithm uses at most d+1 colors for v1 through v_n. V_n+1 has at most d neighbors => ∃ color in {c₁,c₂,…,c_d+1} not used by any neighbor, so give v_n+1 that color. The basic algorithm uses at most d+1 colors on G, so P(n+1) is true and the induction is complete.

While the theorem looks good, it’s way off on other graphs, such as where every edge is connected to a center or for the example where two groups of men and women are only connected to the members of the opposite group. A graph G = (V,E) is considered bipartite if V can be split into V_L, V_R such that all the edges connect a node in V_L to a node in V_L. It’s called this because you can display it in two colors, though ordering algorithms that aren’t great can lead to more colors mistakenly. A real world example of this are servers that have critical functions, an edge between server nodes means that they can’t be updated at the same time. Another example is the previously mentioned map coloring problem where each country is a node and you can’t color adjacent ones the same color since you wouldn’t be able to distinguish countries. Graph coloring problems are often disguised as scheduling problems.

Lecture 7 – Matching Problems

Continuing, on the subject of a graphs, an example of a graph matching problem is “given graph G=(V,E), a matching is a subgraph of G where every node has degree 1.” An example is shown of a connected graph, creating a subgraph so each node is only created to one other. This usually drops some nodes, but if you can get every node from the original in the subgraph, you get a perfect matching. A matching is considered perfect if it has size |V|/2. In some cases, some matchings are more important than others. They can be assigned weights in a weighted graph where the goal is to find the matching with the minimum weight where the weight of a matching M is the sum of the weights on the edges in M. Usually, when looking at weighted matches, you have a perfect matching. A min-weight matching for a graph is a perfect matching for it with the minimum weight. Finding minimum weight and maximum edge problems are solved and not NP complete, generally with quadratic or cubic solutions. A matching M, x and y form a rogue couple if they prefer each other to their mates in M and a matching is stable if it has no rogue couples. The goal is to find a perfect matching that is stable, it might not have the minimum weight, so every node doesn’t get its preference but also its stable as it won’t break. You can prove a graph doesn’t have a stable matching by assuming there is one.

An example of a stable graph shows there’s a love triangle where each person’s preferred choice is the next, plus a remaining one no one has a preference for. In this case, M isn’t stable as anyone matched to the fourth leads to a rogue couple. You can always find stable matchings between two distinct groups forming pairs, but this doesn’t work if any node can be matched to any other.

An example of an algorithm for finding stable matchings solves a problem where there are n boys and n girls. Each boy has ranked preferences of girls and vice versa. Lists are complete and have no ties, the goal is to find a perfect matching with no rogue couples. Giving each boy or girl their first choice won’t work in this scenario as it leads to rogue couples, but the following algorithm is a good solution. Each day, each boy goes to the highest priority girl on their list. Each girl chooses their favorite from the ones that went to them. Boys who got rejected cross that girl off their list and repeat the algorithm the next day until everyone’s matched. This matching algorithm terminates in at most n²+1 days as by contradiction, if it didn’t terminate on a day, a girl had to reject a boy who would move on to another. Progress continues based on crossouts. If there are n lists with n names each, then there are n² crossouts, which contradicts the idea it couldn’t.

This is a common proof technique where you show progress keeps getting made each loop and that if progress is being made, it will lead to the algorithm terminating. To help prove the matching algorithm leads to stable pairings, first set an invariant P, “If a girl G ever rejected a boy B, then G has a suitor who she prefers to B.” Lemma 1 states that this is an invariant, via proof by induction on the number of days. No one’s been rejected in the base case and in the induction step, assume P holds after d days. There are two cases, first the girl rejects the boy on day d+1, because there’s another boy there. So P is true on day d+1. In case 2, the girl rejected the boy before d+1, which implies that she had a better suitor on day, d and thus has the better suitor at d+1, so P holds true.

Another theorem states that everyone is married in the matching algorithm. This is proved by contradiction. Assuming a boy is not married at the end, then the boy was rejected by all girls, implying they had a better suitor and all girls are married. With 2 groups size n, if one group is married, then the other is, so there’s a contradiction. The third theorem states that it produces a stable matching. Let B and G be an unmarried pair. In case 1, G rejected B and had a better suitor so they can’t be rogue. In case 2, G didn’t reject B, which means B never got to G as B found a better partner, so no rogue pairing. Letting S be the set of all stable matchings, for each person P, we define the realm of possibility for P to be the { Q | ∃ M ∊ S, {P,Q}∊M}. A person’s optimal mate is defined as their favorite from the realistic possibilities and their pessimal mate is their least favorite. Theorems 4 and 5 state that the algorithm marries every boy with their optimal mate and every girl with their pessimal mate. By contradiction, assuming the fourth theorem that boys match with their optimal mate, if there is a stable matching M where a girl G fails worse than in the matching algorithm, B’-G in m would have to be worse than B-G in the matching algorithm. In the real world, this type of algorithm is commonly used for intern/workplace matching and load balancing algorithms, notably at Akamai where the professor works. It helps to balance cost and performance for load balancers.

Lecture 8 – Graph Theory 2: Minimum Spanning Trees

Continuing on with graph theory, this lecture focuses around minimum spanning trees. To start explaining those, two important definitions related to trees are walks and paths. A walk is defined as a sequence of vertices that are connected by edges, for example a walk from v₀ to v_k. A path is a walk where all v_i’s are different from each other. It’s possible to start with a walk and construct a path from it. The first lemma of this lecture states this possibility as “if there exists a walk from u to v, then there also exists a path from u to v. To prove this, suppose a walk from u to v exists. By the well ordering principle, a walk of minimal length exists as well, u=v₀-v₁-…-v_k=v. By contradiction, if its not a path, then the two vertices must be the same. In base case, k=0, where k is 1, u-v are directly connected with one edge. With k ≥ 2, suppose the walk isn’t a path. There is an i that isn’t equal to j, where v_ij. U = v₀-…-v_i=v_j-…-v_k=v is a shorter walk, which is a contradiction.

A graph is considered to be connected if every two pairs of vertices are connected by edges. Two nodes u and v are connected if there is a path that goes from u to v. Two additional features graphs can have are cycles and closed walks. A closed walk is a walk for which the start and end vertex are the same, for example v₀-v₁-…-v_k=v₀. If k≥3 and v₀,v₁,…v_k-1…v_k=v₀ are all different, then it’s called a cycle. This lecture introduces trees as connected and acyclic graphs, with their leaves being any node with a degree of one.

A second lemma states that any connected subgraph of a tree is itself a tree and is proved by contradiction. If the connected subgraph weren’t a tree then it must have a cycle, but since it’s taken from a tree with no cycles, removing a node won’t add a cycle, so it must be a tree. The third lemma states that a tree with n vertices has n-1 edges and is proved by using induction of n. P(n)=”There are n-1 edges in any n vertex tree.” In the base case, P(1), there are no edges which is one less than 1 vertex. In the inductive step, assume P(n) is true and let T be a tree with n+1 vertices. You can remove a leaf to get to a previous state. Let v be a leaf of the tree and delete it, creating a connected subgraph which is a tree as proved by previous lemma. By P(n), it has n-1 edges. Lastly, re-attach v so T has (n-1)+1 which is n edges.

A new type of tree introduced is a spanning tree, where a spanning tree (ST) of a connected graph is a subgraph that is a tree with the same vertices as the graph. The lecture’s first theorem states that every connected graph has a spanning tree, which is proved by contradiction. If this weren’t true, then there should be a connected graph, G, without a spanning tree. Then let T be a connected subgraph of G with the same vertices and the smallest number of edges possible. If T isn’t a spanning tree then it would have to have a cycle, yet all vertices in G are still connected after removing an edge, constructing a smaller graph with the same vertices and proving that T can’t be the smallest possible graph.

When adding weights to edges, the Minimum Spanning Tree of an edge-weighted graph G is the spanning tree of G with the smallest possible sum of edge weights. A greedy algorithm for creating a spanning tree graphs a subgraph one edge at a time, at each step adding a minimum weight edge that keeps the subgraph acyclic. The second theorem to be proved states that for any connected weighted graph, G, the algorithm will create a minimum spanning tree. A lemma that helps with this states that S is the first m edges the algorithm selects and that there is a minimum spanning tree T=(V,E) for G such that S less than or equal to E. Proving the theorem, If < n-1 edges are picked, then there exists an edge in E-S that can be added without creating a cycle. The algorithm can always select an edge without creating a cycle. Once m=n-1, we know that S is a minimum spanning tree.

The lemma can be proved by induction on m: P(m)=”∀G, ∀S consisting of the first m selected edges, ∃ MST T=(V,E) of G such that S is a subset of E.” In the base case, m=0 and S is an empty set as S is subset of E for any MST T=(V,E). The inductive step assumes P(n) and has e denote the (m+1)st selected edge and S denote the first m selected edges. By P(m): Let T*=(V,E*) be a MST such that S is a subset of E*. In the first case, e ∊ E*: S∪{e}≤ E* -> P(m+1) where ≤ refers to a subset and ∪ is the union of sets. In the other case, where e ∉ E*, S∪{e} has no cycle. T* is a tree which implies (V, E*∪{e}) has a cycle and the cycle has an edge e’ ∊ E*-S. Find e’ which is in E* but not in S. The algorithm could have selected e or e’ => weight of e ≥ weight of e’. Swap e and e’ in T: Let T**=(V,E**). E**=(E*-{e’})∪{e}, T** is acyclic because we removed e’ from the only cycle in E*∪{e}, it’s connected and contains all vertices of G. These three points together prove that T** is a spanning tree of G. We know that the weight of T** ≤ weight of T* by relation by e and e*, so we know that T* is a minimum spanning tree, and T** must be a minimum spanning tree as well.

Lecture 9 – Communication Networks

One great application of graph theory is routing packets through networks such as computers routing data flow and telephones switching networks. A new type of tree that can help with these problems is a binary tree, specifically a complete binary tree for this problem. Let O ⇔ O show two nodes with arrows allowing data to flow in both directions. O is a switch and direct packets through networks, eventually reaching a terminal which is the source and destination of data, represented here as D. A binary tree is drawn out using nodes, which communicate both ways, with terminal to end nodes being one direction with a terminal for each direction. The 4 input nodes and 4 output nodes create a 4*4 network. Input and output terminals are separated, but the graph is designed so that you can reach any output from any input since the binary trees has edges for both directions.

Latency is the time required for a packet to travel from an input to an output. Diameter of a network is the length of the shortest path between the input and output that are furthest apart. In order to figure out worst case path from input to output, so figure out diameter. If you have n inputs and n outputs, you’ll have to go all the way up to the root of the tree and all the way back down. The diameter is 2*(1+logN) where n is a power of 2 for the example and the log is base 2. About half of the switches are 3 by three switches, two nodes underneath and one node above. Node above terminals is 2*2 since it can only send in two directions. In 4*4, the diameter is 6 since 3 edges and 7 switches. With the root row having one, the following having 2 and final row having four.

Imagining traffic flowing through the graph raises the question, hat if you had one master switch that all input and outputs went into? All inputs and outputs could be connected to a single node/switch. While this looks good, it requires more complicated routing is in the monster switch with hidden complexity. A problem that can occur in these systems is congestion especially at a route node when multiple inputs are hitting a node at the same time. In the case of a binary tree, the number of smaller switches that are used is 2n-1 switches where n is number of inputs. A permutation is a function pi: {0 to n-1} -> {0 to n-1} such that no two numbers are mapped to the same value. For instance, pi(i) = pi(j) iff i=j. A permutation routing problem for pi is For each i, direct the packets at input i to output pi(i): path taken is denoted by P_i,pi(i). For a definition of congestion, the congestion of paths P_0,pi(o),…,P_N-1,pi(N-1) is equal to the largest number of paths that pass through a single switch. So, the congestion for 4*4 is 4, since all input paths could have something stuck at root at once. The max congestion for the binary tree is n.

Another type of graph that can be used for networking is a 2D array which looks likeIn₀ -> T -> O -> O -> O -> O V V V VIn₁ -> T -> O -> O -> O -> O V V V VIn₂ -> T -> O -> O -> O -> O V V V VIn₃ -> T -> O -> O -> O -> O V V V VOutputs: T T T T Out₀ Out₁ Out₂ Out₃

For any n, diameter is 2n. Switches can be smaller as every switch is 2*2, though there are n² switches. A congestion theorem states that the congestion of an n-input array is max 2. To prove this, let n be a permutation. P_i,pi(i)= path from ini rightward to column pi(i) and downward to output pi(i). If you look at the switch in row i and column pi(i), they transmit at most two packets since they can only come from left or top. If pi(0), pi(N-1) = N-1 => congestion 2.

Another type of network shown is a butterfly network which is similar but also lets nodes route diagonally to rows 1, 2 or 4 nodes away to help avoid congestion in the following example.In₀ -> T -> O -> O -> O -> O -> T out₀ 4X 2X XIn₁ -> T -> O -> O -> O -> O -> T out₁ 4X 2X XIn₂ -> T -> O -> O -> O -> O -> T out₂ 4X 2X XIn₃ -> T -> O -> O -> O -> O -> T out₃ 4X 2X XIn₄ -> T -> O -> O -> O -> O -> T out₄ 4X 2X XIn₅ -> T -> O -> O -> O -> O -> T out₅ 4X 2X XIn₆ -> T -> O -> O -> O -> O -> T out₆ 4X 2X XIn₇ -> T -> O -> O -> O -> O -> T out₇

Each node can send either to the next or to a switch NX above or below in the next column. This combines a tree structure with a grid and represent inputs and positions by bits. Columns 000, 001, 010, 011, etc. for starts of inputs by level. A switch is uniquely identified by its row and column, for example (b₁,…,b_logN, L) -> (b₁…b_L+1,…b_logN,L+1) . You gould go from (x₁ to x_logN, 0) to (y₁, …,y_logN, logN). Route the initial traffic to (y₁, x₂, x₃,…x_logN, 1), switching x₁ to y₁, and leaving the other bits the same. One by one, swapping the bits as you move, outputting 5’s binary is 101. If the input’s bit is different than the output, cross for that bit, if they’re the same, go straight. The diameter is 2 logN, switches are 2×2, the number of switches is N(1+logN) and congestion is sqrt(N) or sqrt(n/2) depending on n is even or odd.

Lastly, is a Benes network, which is similar to a butterfly network and completely eliminates congestion.In₀ -> T -> O -> O -> O -> O -> O -> O -> T out₀ 4X 2X X 2X 4XIn₁ -> T -> O -> O -> O -> O -> O -> O -> T out₁ 4X 2X X 2X 4XIn₂ -> T -> O -> O -> O -> O -> O -> O -> T out₂ 4X 2X X 2X 4XIn₃ -> T -> O -> O -> O -> O -> O -> O-> T out₃ 4X 2X X 2X 4XIn₄ -> T -> O -> O -> O -> O -> O -> O -> T out₄ 4X 2X X 2X 4XIn₅ -> T -> O -> O -> O -> O -> O -> O -> T out₅ 4X 2X X 2X 4XIn₆ -> T -> O -> O -> O -> O -> O -> O -> T out₆ 4X 2X X 2X 4XIn₇ -> T -> O -> O -> O -> O -> O -> O -> T out₇

This graph has a 1+2logN diameter, 2×2 switches though there are 2NlogN switches in total and can route for maximum congestion of 1. The final theorem of the lecture states that the congestion of the N-input Benes network is 1, when N=2^a for some a ≥ 1. This is proved by induction of a. The base case is N=2² either pi(0) or pi(1) =1 with a maximum congestion of one. Assume P(a) is true and following steps show the switches changing rows to avoid congestion. If two packets must pass through different subnetworks then there is an edge between them. The packet destined for output 0 (pi(6)=0) and the packet for output 4 (pi(2)=4) cannot pass through the same subnetwork. By dividing into subnetworks to show that inputs can be routed through different subnets so that they only have congestion of one, this shows the larger network can be handled in the same way.

Lecture 10 – Graph Theory 3

An Euler tour is a walk that traverses every edge exactly once and is a tour, so that it starts and finishes at the same vertex. It’s easy to characterize undirected graphs with Euler tours. The first theorem of the lecture states that a connected graph has an Euler tour if and only if every vertex has an even degree. If and only if statements need to be proved in both directions, so first we need to show that if G=(V,E) has an Euler tour, you can walk from v0 -> v1 ->…-> vk-1 -> vk -> v0, since every edge in E would be traversed once. Every vertex is entered and left, so if you see a vertex that has at least two edges then it should always have an even number of edges. All edges should be different since you’ll never leave or enter a node the same way and the degree(u) is the number of times it appears in the tour times two. For the other case, assume that for G=(V,E), assume degree(v) is even for all vertex in V. Let W: v₀ – v₁ – … – v_k be the longest walk that traverses no edge more than once. If an edge is not covered in the walk, the walk can be lengthened, so v_k-u not in W means a longer walk would be v₀-v₁…-v_k-u and is a contradiction as all edges incident to v_k are used in W. We want to show W will be an Euler tour and that v_k = v₀ (start and end nodes), otherwise vk has odd degree in W which by the first proof is a contradiction as it can’t be odd. Suppose that W is not a Euler tour. G is connected, so there is an edge that is not used in W, but it is incident to some vertex that is used in W. Let u – v_i be this edge: u-v_i-v_i+1– …-v_k = v₀ – v₁-v_i. This is a longer walk, which is a contradiction.

Similar to a Euler tour, a hamiltonian path is a path where every vertex is used exactly once. A directed graph or digraph is a graph where every edge has a direction and two edges are needed to go two ways. The starting point is the tail and it goes to the head on the other side of the edge. The indegree of a vertex, vi is the number of incoming edges and its outdegree is the number of outgoing edges.

For the second theorem, let G=(V,E) be an n-node graph with V={v₁…v_n}, let A= {a_ij} denote the adjacency matrix for G. That is a_ij= 1 if v_i->v_j is an edge, otherwise a_ij = 0. Easily computed by taking powers of the adjacency matrix. Let P_ij^(k) = # of directed walks of length k that go from v_i to v_j. Then A^k={P_ij^(k)}. Matrix A is (111,001,010) with (v₁,v₂,v₃) for columns and rows. A² would be (122,010,001), A³ would be (133,001,010). Proving this theorem by induction, a_ij^(k) denotes the (i, j)th entry in A^k and the goal is to show that it’s equal to P_ij^(k) for all entries. P(k) means the theorem is true for k, or P(k) = “∀_i,j a_ij^(k) = P_ij^(k). The base case is k=1, as edge v_i -> v_j: P_ij⁽¹⁾ = 1 = a_ij⁽¹⁾. If there’s no edge then P_ij⁽¹⁾ = 0 = a_ij⁽¹⁾. Assume P(k) to show P_ij^(k+1) = sum for h (such that v_h->v_j is edge G) of p_ih^(k) = sum h=1 to n of P_ih^(k)*a_hj and use the induction step to get sum h=1 to n of a_ih^(k)*a_hj= a_ij^(k+1).

A digraph G=(V,E) is strongly connected if for all u,v in vertices set. There is a directed path from u to v in G. A directed graph is called a directed acyclic graph (DAG) if it doesn’t contain any directed cycles. A tournament graph is a type of digraph where every vertex represents a team and a team can beat or lose to another team, with the results shown by edges. An edge exists between each vertex and either team u beats team v and there’s a directed edge from u -> v or v wins and the edge is v <- u. For example. A directed Hamiltonian path is a directed walk that visits every vertex exactly once.

With these new definitions, the lecture’s third theorem states that every tournament graph contains a directed Hamiltonian path. This is proved by induction on n nodes: P(n) = “Every tournament graph on n nodes contains a directed Hamiltonian path.” In the base case, n is 1 and in the inductive step assume P(n) to prove P(n+1). Take out one node v, to show that there’s still an edge between the other nodes, giving a tournament graph. Then show that you can create a graph including the removed v that creates the directed Hamiltonian path. In the case where v->v₁, there’s no problem. In the other case where v₁ -> v, find the smallest i so v->v_i, v₁->…->v_i-1->v_i->…->v_n. If v->v_i-1, then there’s a contradiction, so v_i-1 beats v. You can also use the largest i so v_i->v.

An example of a tournament graph is a chicken tournament where either chicken u pecks v or v pecks u. In addition, chicken u can virtually peck v if u->u or there’s a chicken w where u->w->v. A chicken that virtually pecks every other chicken is king and there can be multiple king chickens. The lecture’s last theorem states that the chicken with the highest outdegree is a king. This is proved by contradiction letting u have the highest outdegree without being king. This would mean there must be a v such that v -> u and for all w, w -> u or v -> w. If u -> w, then v -> w, meaning that outdegree (v) ≥ outdegree(u) + 1 which is a contradiction.

Lecture 11 – Relations, Partial Orders, and Scheduling

A relation from a set A to a set B is a subset R<= AxB. For example, R={(a,b): where student a is taking class b}. This can be written as (a,b)∊R = aRb or a~_Rb. A relation on A is a subset R≤AxA, for instance A=2: xRy iff x≅y(mod 5) or N: xRy iff x|y. A relation is nothing more than a directed graph. Set A together with R to get G=(V,E), where V=A, E=R. A relation R on A is reflexive if xRx for all x in A and is considered symmetric if xRy and yRx for all x,y in A. If xRy and yRx, then x=y means a graph is antisymmetric. It’s not possible to have a symmetric and antisymmetric graph at the same time. The last property of a relation states that a graph is transitive if xRy and yRz -> xRz. A few examples of these properties are that x≅ y(mod 5) is reflexive, symmetric, and transitive, x|y is reflexive, antisymmetric, and transitive and x≤ y is reflexive, antisymmetric, and transitive. An equivalence relation is reflexive, symmetric and transitive. Equality is an example itself (“=”), and another example is x≅y(mod n). The equivalence class of x in set A in set of all elements in A related to x by R and is denoted [x]. For instance, [x]={y:xRy}.

A partition of A is a collection of disjoint, non empty sets A₁…A_n, <=A, whose union is A, for instance, {…-5, 0, 10,…} or {…-4, 1 6, 11, …) or {…,-1, 4, 9, 14…}. The lectures first theorem states that the equivalence class of an equivalence relation on a set A form a partition of A. A relation is a (weak) partial order if it is reflexive, antisymmetric and transitive. (A, ⊑) is called a partially order set or poset which is a directed graph with vertex set A and edge set, denoted with ⊑. A Hasse diagram represents a partial order and an example is shown denoting what order people can put their clothes on in. A Hasse diagram for a poset (A,⊑) is a directed graph with vertex set A and edge set ⊑ minus all self-loops and edges implied by transitivity. The second theorem states that a poset has no directed cycles other than self-loops. This is proved by contradiction by assuming there exist at least two n, so distinct elements, a₁…a_n, so they form a cycle. Take smaller sets and use induction to show that a₁ and a_n are related. After deleting the self loops, we get a directed acyclic graph.

A and b are considered incomparable if neither a⊑b or b⊑a and comparable otherwise. A total order is a partial order in which every pair of elements is comparable and forms a straight line. A total order consistent with a partial order is called a topological sort. A topological sort of a poset(A, ⊑) is a total order (A,⊑_T) such that ⊑S⊑_T. In other words, if x is related to y, then x is related to y in the total order. The lecture’s third theorem states that every finite poset has a topological sort. x ∊ A is minimal if there is no y in A, which is different from x such that y is smaller than x. x∊A is maximal if the same but x is smaller than y. Not every poset has a minimal element, but every finite poset does which is our first lemma. A chain is a sequence of distinct elements such that a₁≤a₂≤a₃… and the lemma can be proved by letting a₁⊑a₂⊑…⊑a_n be a max length chain (existing due to finite elements). If you consider other elements, there are two cases. First, the element may not be in the chain and its not possible to get a longer chain if a is less than a_i. Otherwise, a is an element of one in the chain. If it’s less, then there is a cycle, so there’s a contradiction. It’s not true that a ≤ a₁, so a₁ is minimal.

Lecture 12 – Sums

An annuity is a financial payment with a value associated with it and a common problem is asking if you are getting enough value for what you’ll pay back. For instance, an n-year $m-payment annuity pays m dollars at the start of each year for n years. These problems have real life implications as how these were packaged and sold led to the mortgage disaster. It’s important to ask how much the package is actually worth, since usually there is a fixed interest rate and devaluation of the value based on inflation noted P. For instance 1 dollar today is $(1+P) in a year, $(1+P)² in two years, $(1+P)³ in 3 years and so on. Similarly, 1/(1+P) turns into one dollar in a year, 1/(1+P)² in 2 years, etc. You can compute the value of payments by year this way, potentially making payments in the future more valuable. V=sum i=0 to n-1 of m/(i+p)ⁱ and is the total current value or what you should pay today for the annuity. Putting the sum in a more familiar formula, x is (1+p)ⁱ, m sum i=0 to n-1 of xⁱ = m(1-xⁿ)/(1-x).

The first theorem of the lecture states that ∀n ≥1, x!= 1, sum i=0 to n-1 of xⁱ = (1-xⁿ)/(1-x). Induction proves this is the right answer, but how do you figure out the answer in the first place. One method is the perturbation method where you set S=1+x+x²+…+x^n-1, multiply the original by S and subtract from itself with most values in the series cancelling to give (1-x)S = 1-xⁿ, s=xⁱ in this case. V=m(1-xⁿ)/(1-x) = m(1-(1/1+P)ⁿ/(1-1/(1+p)))=m((1+P-(1/(1+P)^n-1)/P). With m=50k, n=20, and p=.06, the value will need to be $607,906 and the bigger p gets the less the payment is worth.

Given the choice between 50,000 a year for 20 years or 1,000,000 today, which should you choose? What about 500,000 or 700,000 today? A claim to help with this is that if n=inf, then V=m((1+P)/P). This is proven by lim n-> inf 1/(1+P)^n-1 -> 0. For m=$50k, p=.06, and V=$883,333. Based on this, you’re better off taking a million dollars today than 50k a year forever even without considering that you could put that money into to make more later. A corollary is if |x|<1, sum i=0 to infinity of xⁱ=1/(1-x) as the limit of n->inf of xⁿ=0. For instance, 1 + .5 + .25 + …=2 eventually.

Geometric series generally follow the above 1/(1-x) formula and go to some term. Another form of them is sum i=1 to n of ixⁱ=x+2x²+3x³…+nxⁿ. With perturbation, (1-x)S = (1-x^n-1)/(1-x) – 1 – nxⁿ⁺¹ => S=(x-(n+1)xⁿ⁺¹+nxⁿ⁺²)/(1-x)². Another way to calculate these sums is with the derivative method. Start with the geometric series and take the derivative. For x!=1, sum i=0 to n of xⁱ=(1-xⁿ⁺¹)/(1-x) => sum i=0 to n of x^i-1=(1-(n+1)xⁿ+nxⁿ⁺¹)/(1-x)2. Then multiply by x to get back to the previous form. Sums can be treated as polynomials and manipulated as such. You can also to a version similar to the above manipulation where you take integrals to get an i in the denominator though this is briefly touched on and not shown in detail.

The second theorem for the lecture states if |x|<1,then sum i=1 to inf of ixxⁱ=x/(1-x)x². This is useful if every year you’re trying to get the value of a company growing yearly by a fixed amount m. Annuity pays $im at the end of year i forever. In this case, m(1+p)/px² is a simple formula to tell whether to buy the company. Sum i=1 to inf of im1/(1+P)xⁱ shows that even a company paying more and more yearly still has a finite value. Payments increase linearly while a geometric decrease wipes out their value in the future.

Any time you sum powers, the answer is a polynomial to one higher degree. Integration bounds can be found for sum i=1 to n of f(i) when f is a positively increasing function. Drawing a graph of a function for each value from f(i) to f(n) shows that the sum of i=1 to n of f(i) is at least f(1) + integral 1 to n of f(x)dx and at most same sum of f(n)+integral 1 to n of f(x)dx. The sum is at least the area of the first rectangle (f(1)) + the area under the curve, at most f(n) + area under curve. This lecture is the first to mention integral and seems to be the first lecture to use calculus in this course. No one knows the exact way to calculate the To close bounds on the sum of closed form of sum i=1 to n of sqrt(1) but integrals can help estimate closed bounds as best as possible. For instance, integral i to n of sqrt(x)dx = x^3/2/(3/2) evaluated at 1 and n = ⅔(n^3/2-1). Bounds can be computed with f(1) plus integral and f(n) plus the integral. f(1)+⅔ (n^3/2-1)≤ sum i=1 to n of sqrt(i) ≤ f(n)+⅔ (n^3/2-1) shows sum i=1 to n of sqrt(i) is between ⅔ n ^3/2+ ⅓ and ⅔ n^3/2+ sqrt(n) – ⅔. For instance, evaluated at 100, the function gives an answer between 667 and 676 which helps narrow it down. Sum i= 1 to n of sqrt(1) = ⅔ n^3/2 + S(n) where ⅓ ≤ S(n) ≤ sqrt(n)-⅔ and the value compared to the answer is decreasing as sumi=1 to n of sqrt(i) ~ ⅔ n^3/2. In the above example, g(x)~h(x) means that the limit as x goes to infinity of g(x)/h(x) is 1.

When f is a decreasing positive function, integration bounds are calculated in a similar way with the values flipped. For instance, for the function sum i=1 to n of 1/sqrt(i), the upper bound is sum i=1 to n of f(i) <= f(1) + integral 1 to n of f(x)dx and the lower bound is calculated with sum i=1 to n of f(i) ≥ f(n) + integral 1 to n of f(x)dx. The same formulas are used from increasing functions but positions are switched. This becomes 2 sqrt(n) -2 < sum i=1 to n of 1/sqrt(i) ≤ 2 sqrt(n) – 1after limits are applied and sum i=1 to n of 1/sqrt(i) ~ 2 sqrt(n).

Lecture 13 – Sums and Aysmptotics

This lecture uses the previous lecture’s methods of finding sums to solve a variety of problems. Starting off, a block stacking problem similar to the one from Single Variable Calculus is shown where you stack offset blocks to try to get a block fully off of the table. If you stack high enough, you can go out from table to wall, but would require the stack getting unrealistically high. A greedy algorithm for solving this problem is to get top block out as far as it goes, then the top two, then three, etc. Given n blocks of length 1, r_i=amount by which ith block extends beyond the table. The stability constraint, the center of mass c_k must lie on the (k+1)st block where the table is block n+1. For greedy stacking, c^k=r_k+1. You can use a recursive strategy to compute c_k . The center of mass of the kth block is at r_k -½ and the center of mass of the top k blocks is calculated with C_k =((k=1)C_k-1 + 1(rk-½))/k => r_k+1 = ((k-1)r_k +r_k -½)/k = r_k -r_k+1 =1/2k which shows how much further the kth block sticks out over the k+1th block. This ends with the formula sum i=1 to n of 1/2i => r¹=½ sum i=1 to n of 1/i, a sum that comes up often enough to have its own name.

The Harmonic Sum and nth harmonic number H_n are calculated with sum i=1 to n of 1/i. H1=1, 2= 3/2, 3 = 11/6, 4=25/12 and so on. With 4 blocks, a full block hands out beyond the table. At the 1 millionth harmonic number, 7 blocks are off the table, though that requires a stack 1 million high. Harmonics numbers can grow to infinity and a closed form expression isn’t believed to exist. You can use integration bounds for a good estimation for this decreasing sum, f(n)+ integral 1 to n of f(x)dx ≤ Sum i=1 to n of f(i) ≤ f(1) + integral 1 to n of f(x)dx. Computing the integral gives ln(n), so 1/n + ln(n) ≤ H_n ≤1 + ln(n). h_n~ln(n), with extra work, H_n=ln(n) + δ + 1/2n + 1/12n2+E(n)/120n⁴ where for all n, 0<E(n)<1. δ is Euler’s constant and = 0.5772… This isn’t proved but the greedy strategy appears to be optimal in this case. You can get much further out if you’re able to use multiple blocks on a single level.

Next, the lecture looks at n! Which is the product of i=1 to n of i. ln(n!) = ln(1*2*3*…*n) which is sum i=1 to n of ln(i). Products look messy, but you can convert to sums in order to calculate better and can estimate using integration bounds for an increasing function, ex. f(1) + integral 1 to n of f(x)dx <= sum i=1 to n of f(n) <= f(n)+ integral 1 to n of f(x)dx or nⁿ/e^n-1<= n! <= nⁿ⁺¹/e^n-1. Striling’s formula states that n! =(n/e)ⁿsqrt(2pi n) e^E(n) where E(n) is between 1/12n+1 and 1/12n. Bounds of n! Have been studied and it has tight bounds. N! ~ (n/e)ⁿsqrt(2pi n) which is lower bound and very close to the right answer, though begs the question “how’s pi related to n factorial?”

Asymptotic notation and time complexity are touched on in this lecture, with its three different forms and meanings. Oh, big-oh f(x) is shown as O(g(x)) if lim x-> inf of |f(x)/g(x)| < inf. The limit of the absolute value in this case is finite and can’t diverge. In this case, f(x) grows the same rate or slower than g, since otherwise the limit would be infinity. Big Oh is also written f(x)<=O(g(x)) or f(x) is O(g(x)) or f(x) is a member of O(g(x)). Some examples of this state that if f(x)=x and g(x)=x², then f(x) = O(g(x)). This is proven since the limit of x-> inf of |x/x²|=0 < inf. Similarly, x²!=O(x) as lim x-> inf x²/x = inf. For big O, ignore constant factors since its important to analyze algorithms without taking into consideration machine power and judge the algorithm on its own. For example, “the time to multiply n*n matrices is T(n) which is O(n³)” could be used to describe a function. Any polynomial grows slower than any exponential. L’Hospital’s rule helps with these comparisons. Another example was that 10 was O(1) and that a constant function is O(a constant function). Hn = ln(n) + δ + O(1/n) means that error terms grow no faster than 1/n. This is sometimes written as Hn-ln(n)~δ.

O can’t be used as a lower bound, so f(x)≥O(g(x)) is meaningless. Instead, omega should be used such as f(x)=&F;(g(x)) if lim x -> 0 of |f(x)/g(x)| > 0. A theorem states that f(x) = O(g(x)) iff g(x) = &F;(f(x)). T(n)=&F;(n²) says algorithm is at least quadratic. Theta or F4; means both limits hold true from O and &F;. F(x) = F4;(g(x)) if lim x-> inf of |f(x)/g(x)| >0 and < inf. The last theorem for the lecture states that f(x) = F4;(g(x)) iff f(x) = O(g(x) and f(x) = &F;(g(x). For instance, 10x³-20x+1 = F4;(x³) has a limit of 0, so its not F4;. T(n)=F4;(n²) means T grows quadratically in n. In summary, O means ≤, &F; means ≥, and F4; means =, up to constant factors. A couple other terms shown are o (“little oh”) which means < and ω (“little omega”), which means >. For instance, f(x) = o(g(x)) if lim x-> inf |f(x)/g(x)|=0 and f(x) = ω(g(x)) if lim x-> inf |f(x)/g(x)|=inf. As a note of warning, never use asymptotic notation in induction hypothesis and predicates as you can prove all sorts of false things.

Lecture 14 – Divide and Conquer recurrence

Towers of Hanoi is another problem from Calculus that is reintroduced in Mathematics for Computer Science. The setup for the problem is you have 3 pegs with n disks placed in a stack on one of the pegs. The goal is to recreate the stack of disks on another peg in the minimum number of moves while making sure not to place a larger disk on a smaller disk. Defining T_n as the minimum number of moves to solve a problem with n disk, the goal is to find an algorithm to solve this. For T₁, the single disk can be moved with one move. For T₂, this requires 3 moves, one to move the top disk to a stack, one to move the bottom disk to the other empty stack as it’s bigger than the first disk moved, and a last moved to put the smaller disk back on the larger disk. Similarly, the best example found for T₃ is 7. One solution is to take a problem with n disks and solve it recursively with the following steps. First, take top n-1 and move to another stack with T(n-1) steps.Second, move the largest disk with 1 step. Lastly, move the top n-1 disks back onto largest in T(n-1) steps. Based on this, the number of moves should be T_n≤2T_n-1+ 1 which holds true for T₃ at 7 and is best we can do. The lower bound is T_n≥2T_n-1+1 as well because when the nth disk moved, everything on top of it had to move to another peg requiring at least T_n-1 steps before the larger disk moved. Then there’s at least one move for the larger disk and then another at least T_n-1 steps after it moves to get the rest of the disks back on top. This assumption can be verified by induction, in the base case, it works for 1. For the inductive case, you assume T_n=2ⁿ-1 to prove T_n+1=2ⁿ⁺¹-1. T_n+1=2Tⁿ⁺¹ which becomes 2ⁿ⁺¹-1. This can be found by guessing and verifying but another method to figure out patterns when there’s a less clear answer is to use a brute force algorithm and just filling out all needed factors, for instance, Tn = 1+2T_n-1 = 1+2(1+2T_n-2) = … = 1+2+4+…+2^n-2+2^n-1T₁ = 2ⁿ-1.

Next up, is a return to merge sort. Given n numbers, you compare them and put them in order. To sort n items, x₁,x₂,…x_n, where n is a power of two, first sort the first n/2 items and last n/2 items separately, then merge the two sorted lists. To merge, look at the two smallest items from each list and add them to the merged list until 1 list remains. Add that to the end. T(n)= is defined as the number of comparisons used by merge sort to sort n numbers in worst case. The step of merging takes n-1 comparisons in the worst case. Sorting is done recursively by splitting into smaller and smaller lists and takes 2T(n/2) comparisons for recursive sorting so T(n) = 2T(n/2)+n-1. For example, T(1)=0, T(2)=1, T(4)=5, T(8)=17, and T(16)=49. Showing this with a brute force algorithm, t(n)=n-1+2T(n/2) = … = n-1+n-2 +n-4 + …+ n-2^i-1 + 2ⁱT(n/2ⁱ) = sum of i=0 to log n-1 of (n-2ⁱ) = nlogn – (2^{log n}-1). There are n log n – n + 1 comparisons in merge sort which comes out much better than a naive sorting algorithm, so T(n) ~ n log n. With this notation, the Towers of Hanoi algorithm is T(n)~2ⁿ. An example of a linear equation is S(1)=0, S(n) =S(floor(n/2)) + S(ceil(n/2)) + 1 for n≥ 2 which is S(n)~n.

Merge sort was an example of a divide and conquer algorithm as well as the S(n) algorithm and can be solved by breaking into parts. These usually come in a form where you have a bunch of terms that can be cut down by a common factor, such as T(x) = 0 for x<1, 2T(x/2) + 8/9 T(3x/4)+x2² for x≥1 or T(x) = a₁T(b₁x+E₁(x))+a₂T(b₂x+E₂(x))+…+a_nT(ba_nx+E_n(x)) + g(x) for x>x₀ where a_i>0, Oi<1, k is fixed, |E_i(x)| ≤ O(x/log₂x), |g’(x)| ≤ x^c for constant c. Anything looking similar to the above function is likely recursive.

A theorem by two MIT students, Akra and Bazzi proposes the following equation for solving these kinds of recurrences. Set p so sum i=1 to K of a_ib_i^P=1, then T(x)=F4;(x^P+x^P*integral 1 to x of (g(u))/u^P+1)du). An example of with T(x) = 2T(x/2)+x-1: a1=2, b1=½, k=1, using 2(½)^P=1 implying P is 1 to get the following solution, T(x)=F4;(x+x * integral 1 to x of (u-1)/u²)du) = F4;(x lnx). An example of a tougher problem is T(x), 2(½)^P+8/9 * (¾)^P=1, T(x)=F4;(x²+x² * integral 1 to x of u²/u³ du) = F4;(x²ln x). This formula gives you the asymptotic growth as opposed to the actual answer and you need to calculate P to make it work. This can lead to a problem as in the case of T(x) = 3T(x/3) +3T(x/4) + x², where 3(⅓)^P+e(¼)^P=1 as P would be between 1 and 2 and not an integer. Luckly, since you’re computing the asymptotic growth, if you conclude P is less than the exponent, you don’t need to compute it as it will be a small order term, as stated by the theorem, if g(x)=F4;(x^t) for t≥ 0 and sum i=1 to k of a_ib_i^t<1, then T(x)=F4;(g(x)).

Lecture 15 – Linear Recurrences

Divide and conquer breaks into subproblems and are a type of recurrence, while linear recurrence is a different kind of recurrence. To introduce this, fibonacci numbers are introduced into this course for the first time, initially with graduate students and professor jobs as opposed to rabbits. The total number of jobs is m, which is a fixed value as universities aren’t growing. Every year, each professor generates one graduate (new prof). Except for first year professors, who don’t produce anything. Assuming faculty live forever and there are no retirements, when are all m jobs filled. If f(n) is the number of professors during a given year, f(0)=0, f(1)=1, f(2)=1, f(3)=2, f(4) = 3, and f(5)=5. For n≥2, f(n)=f(n-1)+f(n-2). Fibonacci numbers have been used across history by Leonardo Fibonacci for studying rabbit growth, studying music in India in 200 BC, as well as used to analyze Euclidean GCD, and in economics and more.

A recurrence is considered linear if it is of the form f(n) = a₁f(n-1)+a₂f(n-2)+…+a_df(n-d) = sum i=1 to d of a_if(n-i), for fixed a_i, d=order. Compared with divide and conquer, a linear recurrence subtracts a constant from n while divide and conquer divides n in order to break down a problem. One solution is to guess and verify trying, f(n)=αⁿ for constant α. f(n)=f(n-1)+f(n-2) => αⁿ=α^n-1+α^n-2=>α²=α+1 => α²-α-1 = 0 => α = (1+/-sqrt(5))/2. f(n) = α₁ⁿ or α₂ⁿ where α₁=(1+sqrt(5))/2, a₂=(1-sqrt(5))/2. If f(n)=α₁ⁿ and f(n)=α₂ⁿ are solutions to linear recurrence (w/o boundary conditions) then f(n)=c, α₁ⁿ+c₂α₂ⁿ is a solution for any constants c₁ and c₂. This means that f(n)=c₁((1+sqrt(5))/2)ⁿ + c₂((1-sqrt(5))/2)ⁿ is a solution without a boundary condition. Determine the constant factors, f(0)=0=c₁()⁰+c₂()⁰ = c₁+c₂, f(1)=1 =c₁(1+sqrt(5))¹+c₂(1-sqrt(5))¹ => c₁=1/sqrt(5), c₂=-1/sqrt(5). Using these, the solution is f(n)=1/sqrt(5) * ((1+sqrt(5))/2)ⁿ-1/sqrt(5) * ((1+sqrt(5))/2)ⁿ. As n gets large, the first half of the solution gets larger and the subtracted half means less. All m jobs are filled when f(n) ≥ m. When 1/sqrt(5) * ((1+sqrt(5)/2)ⁿ + δ(n) ≥ m = n≥ (log(sqrt(5)(m-δ(n)))/log((1+sqrt(5))/2) = F4;(log m).

In order to solve a general linear recurrence in the form f(n)=sum i=1 to d of a_if(n-i) and f(0) =b₀, f(1)=b₁,…,f(d-1)=b_d-1. Try using the following function, f(n)=αⁿ => α^d – a₁α^d-1 – a₂α^d-2-…-a_d= 0. In the simple case, all d roots are different (ex a₁ through a_d). In this case, the solution is f(n) = c₁α₁ⁿ+c₂α₂ⁿ+…+c_dα_dⁿ and you can solve for the constants, c₁,c₂,…,c_d from f(1)=bi for 0≤i1+c₂+…+c_d=b₀. In the simple case, this will always have a solution. In the more tricky case of repeated routes, use a theorem that states that if α is a root of characteristic equation and is repeated r times, then αⁿ, nαⁿ,n²αⁿ,…,n^r-1αⁿ are all solutions to the recurrence.

Recurrences are the “discrete analog” to differential equations. Suppose there’s a plant that reproduces in the first year of life, one for one, then never again. The plant lives forever. To solve this, let f(n) be the number of plants in year n. f(0)=0, f(1)=1, what is f(n)? f(n)=f(n-1)+(f(n-1)-f(n-2)) = 2 f(n-1)-f(n-2). In α²-2α+1=0, α=1 is a double root, as it occurs twice. After getting, f(n)=c₁(1)ⁿ+c₂n(1)ⁿ = c₁+c₂n, all that’s left is to plug in boundary conditions so f(0)=0=c₁, f(1) = 1 = c₁+c₂=>c₂=1, so f(n)=n. This solution works for all homogeneous equations, that is if f(n)-a₁f(n-1)-…-a_df(n-d) = 0, an equation is homogeneous, otherwise, it’s inhomogeneous.

The general inhomogeneous recurrence is f(n)-a₁f(n-1)-…-a_df(n-d)=g(n) which can be solved with a three step method. First, replace g(n) by 0 and solve the resulting homogeneous recurrence, ignoring boundary conditions. Next, restore g(n) and find any particular solution again, ignoring boundary conditions. Lastly, add homogeneous and particular solutions together and use the boundary conditions to determine constant factors. An example of this solution for f(n)=4f(n-1)+3ⁿ, f(1)=1 takes α-4 = 0, α=4 and finds the homogenous solution, f(n) = c₁4ⁿ. Then, find a particular solution to f(n)-4f(n-1)=3ⁿ, guess a constant times 3ⁿ and plug it in. For example, c3ⁿ-4c3^n-1=3ⁿ, 3c-4c=3 => c=-3 vives the particular solution, f(n)=-3ⁿ⁺¹. Add both to find the general solution, f(n)=c₁4ⁿ-3ⁿ⁺¹. The tricky part to solving these problems is guessing the particular solution. Additionally, if g(n) is exponential, you’ll need to guess an exponential of the same kind. For example if g(n) = 2ⁿ+3ⁿ, guess a2ⁿ+b3ⁿ. If g(n) is polynomial, guess polynomial of same degree (if g(n) = n²-1, guess an²+bn+c). Adding two guesses together can be useful but can go wrong if a polynomial guess fails, such as trying (an+b)2ⁿ after a2ⁿ fails.

Lecture 16 – Counting Rules 1

Probability theory is based on counting and this lecture is about learning a toolkit for counting. First, the lecture introduces sets and their uses and transformations. A set is an unordered collection of distinct elements, for example, {a,b,c}={c,a,b}. The size or cardinality of set S is the number of elements in S: denoted by |S|. Similarly to a set, a sequence is an ordered collection of elements (components or terms) which are not necessarily distinct. A permutation of a set, S, is a sequence that contains every element of S exactly once, with the number of possible permutations for a set with n elements being n!. Connecting sets of inputs and outputs, a function f:x->y is a relation between x and y such that every element of x is related to exactly one element of y. For this, x is called the domain, and y is the range. A function f:x->y is subjective if every element of y is mapped to at least once, injective if every element of y is mapped to at most once, and bijective if every element of y is mapped to exactly once (meaning it’s both injective and subjective).

For an example permutation, let (a₁…a_n) be a permutation of S={a₁,…a_n}. Define pi(a_i)=i.e., a in set S is mapped to i iff a is in the i-th term in the permutation. In this case, pi is bijective. A rule called the mapping rule states that if f:x->y is subjective then |x|≥|y|, if f:x->y is injective then |x| is at most |y| and if f:x->y is subjective then |x|=|y|. This last rule is also called the bijection rule.

A problem introduced here finds x which is the number of ways you can select 12 donuts from 5 varieties. In the example, there are 2 chocolate, 0 lemon, 6 sugar, 2 glazed, and 2 plain donuts which can be represented in a sequence of 0’s and 1’s with 0’s being donuts and 1’s dividing them into categories. The above sequence is represented as “00110000001001001” which is 2 0’s for chocolate, 1 switches to lemon which has zero, so 1 switches to 6 sugar donuts represented as 000000, and so on. Y is a set of all of the 16 bit sequences with exactly 4 ones. Another example of a bijection from subsets X={1…n} to n-bit sequences. S by function f ->(b₁,b₂…b_n) via bi={ i if i in S, 0 if i is not in S. 2ⁿ is the number of n bit sequences and the number of subsets of an n-element set.

The generalized pigeonhole principle is introduced and states that if |x|>k|y| then for all f:x->y, there exists k+1 different elements of x mapped to the same elements in y. For example if more than n pigeons in set x fly into n holes (y), then at least two pigeons will fly into the same hole. This is a famous rule in counting. Another example of this is the number of not-bald people in Boston. If Boston has about 500,000 non bald people (x), then there are 3 people in Boston with the same number of hairs on their head. Where the number of hairs on head (y) is at most 200,000, meaning |x|>2|y|. Another example of this is to pick 10 arbitrary double digit numbers, such as “21, 71, 14, 31, 25, 60, 92, 80, 29, 91.” You can find two subsets with an equal sum by mapping double digit numbers to the sum of their numbers. X is a collection of subsets of numbers and y is the set of all possible sums. |x|>|y| in this case is a non constructive proof that doesn’t give a specific example of the problem being true.

A “k-to-1” function if a function f:x->y that maps exactly k elements of x to every element of y. The division rule states that if f is k-to-1, then |x|=k|y|. The division rule generalizes the bijection rule because a function is a bijection if and only if its 1 to 1. For example, take the question, “how many ways to place two identical rooks on a chess board in such a way that no row/column is shared?” (r₁,c₁,r₂,c₂) is a way of mapping this positioning. Let y represent the set of valid rook configurations and x represent sequences (r₁,c₁,r₂,c₂) such that r₁!= r₂, c₁!=c₂. Then, every valid rook configuration is mapped to exactly once, so rook 1 and 2 can be switched and it will look the same. Every valid configuration is therefore mapped to two times. This means that f is 2-to-1, for example, |y|=|x|/2 = (8*7)²/2. For r₁ and c₁, there are 8 choices, for r₂,c₂ there are seven.

The generalized product rule states, let S be a set of length k sequences. If there are n1 possible first entries and n₂ possible second entries for each first entries, n₃ possible third entries for each combination of 1st/2nd and n_k possible k-th entries for each previous combination, then |S| = n₁*n₂*n₃*…*n_k. For example, a committee (x,y,z) selected from n members with n ways to choose x, n-1 to choose y, n-2 to choose z. Another example finds defective dollars, where some digit appears more than once in 8-bit serial number. The fraction of non-defective bills is the number of non-defective serial numbers divided by the total number of serial numbers, which is x/y. y = 10⁸, x = 10*9*8*7*6*5*4*3 = 10!/2!, so x/y is 1.8144 percent.

A₁*A₂*…*A_n is the same as {(a₁,…a_n)(a₁ selected from set A₁, a₂ from A₂…a_n from A_n}. So, by the product rule: |A₁*A₂*…*A_n|=|A₁||A₂|…|A_n|, which can be useful for counting all n-bit sequences. With the sum rule, if A₁ up to A_n are disjoint sets, then |A₁u…uA_n| = |A₁|+|A₂|+…+|A_n|. For instance, let’s say passwords are 6 to 8 symbols. If the first symbol is a letter (upper or lower case) and the rest are letters or digits. The first letter can be one of 52 elements, F and the remaining can be 1 of 62 elements S. So if P is the set of possible passwords, P=(FxSxSxSxSxS)u(FxS⁶)u(FxS⁷) and |P|=|F||S|⁵+|F||S|⁶+|F||S|⁷.

Lecture 17 – Counting Rules II

Using inclusion/exclusion techniques, you can exactly compute cardinality of union of sets. For instance, take a Venn Diagram with M and E circles, M\E is M excluding everything in E and |M| is |M\E|+|M⋂E| via the sum rule. |E|= |E\M|+|E⋂M| and |MuE|=|M\E|+|M⋂E|+|E\M|. In these examples, the symbol ⋂ means the intersection of sets. If you were to just add |M|+|E|, then the intersection would be counted twice, so |MuE|=|M|+|E|-|E⋂M| is the correct formula. Adding a third circle, S parts between two circles are counted 2 times and the intersection of the 3 is counted 3 times, requiring a more complicated function, |MuEuS|=|M|+|E|+|S|-|E⋂M|-|M⋂S|-|E⋂S|+|M⋂E⋂S|. Another example of summing unions is |A₁uA₂u…uA_n|=sum i=1 to n of |A_i– sum of every pair of sets from 1 to n of |A_i1⋂A_i2| =sum k=1 to n of (-1)^k+1 times the sum of all subsets in {1…n} such that |S|=k of |intersection for all i in S of A_i.

The next problem asks how many permutations do you have of {0,1,…9} have consecutive 42, 04, or 60? For example, (7,2,5,6,0,4,…} has 04 and 60. To represent this, P₄₂ is the set of permutations with 42, P₀₄ is the same for 04 and P₆₀ is the same for 60. The trick is to find the size of P₆₀ where P₆₀-> permutations of sets that have 60 as a single variable with the rest of the set. This is a bijection, so |P₆₀| = 9!. P₄₂⋂P₆₀ represents the permutations with 42 and 60 so |P₄₂⋂P₆₀|=8!. Continue to compute other intersections. Treat the 60 and 04 intersections as 604 in the set, which is still 8! P₄₂⋂P₀₄ can be treated as 042 in the larger set (for example, {042,1,3,5,6,7,8,9}). Look at the intersection of all 3 to find cardinality. Represent 6042 in the set which will now have 7! elements.

Next up, let’s take a look at the bookkeeper rule which states that if you have distinct copies of letters l₁,l₂,l₃ …l_k, the number of sequences with n₁ copies of l₁ and n₂ copies of l₂ up to n_k copies of l_k, and can be written (n₁+n₂+…+n_k)!/(n₁!*n₂!*…*n_k!). This is the definition of the multinomial coefficient which can be written (n₁+…+n_k choose n₁,n₂,…n_k). If k==2, binomial coefficient. (n choose k) = (n choose k, n-k). For example, the number of bit sequences of length 16 and 4 ones is (16 choose 4) = 16!/(4!*12!). According to the subset rule, the number of k element subsets of an n element set (n choose k).

Next the lecture talks about the Binomial Theorem, which is ∀_n (a+b)ⁿ=sum k=0 to n of (n choose k) a^nkb^k. For example, n=2: a²+ab+ba+b² = a²+2ab+b² and with n=3: a³+a²b+aba+ba²+ab²+bab+b²a+b³ = a³+3a²b+3b²a+b³. The number of terms that have k*a,(n-k)*b is the number of length-n sequences with k a’s, (n-k) b’s which is (n choose k). Using the binomial theorem, we can calculate the probability of different poker hands. Starting with a 52 card deck, each card is represented by one of 4 suits, H,S,C, or D and a value from 2 to 10 or J, Q, K or A. This means, each card has one of 13 possible values and one of 4 suits. A hand is a subset of 5 cards, with the number of possible hands being (52 choose 5) or 2,598,960. One high powered kind of hand is a 4-of-a-kind, which means there are 4 cards in the hand that have the same value, for instance, {8S,9D,8H,8C,8D}. Representation is the key to solving these problems with the hand first represented with 3 variables. First, the value of the 4 cards that are the same has 13 possible choices, then the value of the extra card and the suit of the extra card. So a four of a kind could be represented (8,9,D), meaning each 8 in the deck, and a 9 of diamonds for the last card. The number of sequences that can make this is 13*12*4 = 624 (via the general product product rule).

Another popular poker hand is the full house, consisting of 3 cards of one value and 2 of another, for example, {2C,2S,2D,JC,JD}. One possible representation of this is (2, {C,S,D},J,{C,D}) with 4 values representing the hand. First, the value of the triple (13 choices), then the suits of the triple (4 choose 3 which is 4 choices), third, the value of the pair with 12 choices and lastly the suits of the pair with (4 choose 2) 6 choices. By the general product rule, the number of full house hands is 3744, which is a lot more likely to occur that a 4 of a kind. Hands with 2 pairs can be represented with 6 variables, The value of the first pair with 13 choices, the suits of the first pair with 6 (4 choose 2) choices, the value of the second pair with 12 choices, the suits of the second pair which is 6 (4 choose 2), the value of the last card with 11 choices and the suit of the last card with 4 choices. When applying the general product rule here, we fall into a trap as the pairs can be swapped to create the same hand. This means there’s a 2-to-1 mapping and by the division rule, the solution from the general product rule (13*6*12*6*4=22464) needs to be cut in half to get 11232 possible hands. With examples like this, it’s important to check if there’s really a bijection. Lastly, a hand with every suit can be represented with the values of each suit in order D,C,H,S which is 13⁴, the suit of the extra card (4 choices) and the value of the extra card (12 choices). This is also not a bijection as the 5th card can be swapped with the other card with the same suit requiring the division rule again.

Combinatorial proofs are introduced here with an example being taking n shirts, keeping k and getting rid of n-k. The amount of keepers is (n choose k) and the number trashed is (n choose n-k). When choosing k elements out of n elements or (n choose k), you can also choose the number of combinations with element x=(n-1 choose k-1). Without this, (n-1 choose k) is n choose k when added together. This is known as Pascal’s identity. To solve this, define set S, show |S|=n (potentially by counting) and show |S|=m to conclude that n=m. The last theorem of the lecture states that the sum of r=0 to n of (n choose r) times (2n choose n-r) = (3n choose n). This is proved by setting S to the subsets of n balls chosen from a basket of n red balls and 2n green balls. |S|=(3n choose n) which is the number of n-element subsets from a 3n element subset. So the number of subsets with r red balls is (n choose r)(2n choose n-r) by sum rule, |S|=sum r=0 to n of (n choose r)(2n choose n-r).

Lecture 18 – Probability Introduction

This lecture starts with the Monty Hall problem where contestants pick a prize behind one of three doors with one having a grand prize behind it. They are then shown a door with a bad prize they didn’t choose and asked if they want to switch to the other. The question is whether or not they should switch. While it may seem to be a 50/50 shot either way, MIT and the professor say that math shows that it’s always correct to switch. To start defining the problem, the sample space for an experiment or probability game is the set of all possible outcomes. An outcome (also known as a sample point) consists of all the information about the experiment including the values of all random choices. An outcome of the Monty Hall game when the contestant switches consists of the box with the prize, the box chosen first and the box revealed. So, a sample point (2,1,3) is the outcome where the prize is in 2, player chose 1 and 3 is revealed, and switching wins in this case. (1,2,1) is not a valid sample point as the box with the prize can’t be revealed.

The sample space is represented as a tree with each of the 3 values being a new layer of the tree. The layer for the box with the prize has 3 branches, each branch has 3 branches for each of the potential player choices, and each of those has 2 branches for the revealed box. Labelling every outcome a win or loss gives 6 wins and loses so it seems to show that it’s 50/50, but what’s missing is the likelihood of each outcome, requiring a probability space. A probability space consists of a sample space and a probability function. “Pr”: S->R, such that 1, ∀w∊S, 0≤P(w)≤1 and sum w∊S of Pr(w)=1. So for every outcome in the sample space, the probability must be between 0 and 1 and the sum of all of the probabilities must be 1. ∀w∊S, Pr(w)=probability that w is the outcome.

To get the right probabilities, some assumptions must be made. First, the prize is in each box with probability ⅓. Second, no matter where the prize is, the player picks each box with probability ⅓. Lastly, no matter where the prize is, if there’s a choice of box to reveal, each box is picked with probability ½. This is important as if there was a strategy to the reveal the player could gain information which matters when there are 4 or more boxes. Each branch is labeled with its probability. The first 2 levels of branches have probabilities of ⅓ while the 3rd level either has ½ probability where there’s a choice and 1 otherwise. The probability of a sample point is the product of the probabilities of the probabilities on the path in the tree leading to the sample point. (1,1,2) is 1/18, but (1,2,3) is 1/9, meaning some outcomes are more likely to occur. Based on adding up these probabilities, the chances of winning are ⅔ if you go into the game with the intention of switching from the start. An event is a subset of the sample space so the probability that an event occurs is the sum of w∊E of Pr(w). Ex: Pr(EL)=⅓. I wasn’t 100 percent convinced but this starts making some sense after seeing branchingpaths.

The next game shown is a dice game where there are 3 dice, A=(2,6,7), B=(1,5,9), C=(3,4,8). Each player picks and rolls a die and the winner gets a dollar from the loser. Probability of a roll is ⅓. A sample space S is uniform if every sample point has the same Pr(w) = 1/|S|. Instead of rolling, the players add up all combinations to find probability. C beats A 5/9 in an example, but no matter what the first player choses, the second player can pick a die that wins. B beats C in 5/9 of the situations and A beats B in 5/9 as well. Probability trees are drawn out here as well and outcomes are added up and compared. If you roll twice and add the results then the previously losing die ends up winning 42/81 of the time. Different numbers of rolls create different results.

Lecture 19 – Conditional Probability

Conditional Probability is the probability of one outcome given another. For instance, Pr(A|B)=Probability of A given B. In Monty Hall, if B was the event prize being in box 1 and A was the vent the contestant chose box 1, then Pr(A|B) would be ½. Also, if Pr(B)!=0, Pr(A|B)= Pr(A⋂B)/Pr(B). As long as it’s possible for B to happen, the probability of A given B is the probability of A and B given B. As a side note, Pr(B|B) is 1 as it’s always true. Using the product rule, Pr(A⋂B)=Pr(B)*Pr(A|B) and the general product rule can be used to show that Pr(A₁⋂A₂⋂…⋂A_n) = Pr(A₁)P_r(A₂|A₁)Pr(A₃|A₁⋂A₂)…Pr(A_n|A₁⋂A₂⋂…⋂A_n-1). For example, in a best 2 out of 3 series, the probability of winning the first game is ½, the probability of winning immediately following a win is ⅔ and following a loss is ⅓. The product rule gives probability of WW = Pr(WW)=Pr(W 1st)Pr(W 2nd|W 1st) = ⅓. Pr(WLW)=Pr(W 1st)Pr(L 2nd|W 1st)Pr(W 3rd|W 1st L 2nd) = 1/18. If A is the event that series is won and B is the event that first game is won, then Pr(A|B)=Pr(A⋂B)/Pr(B) = (⅓+1/18)/(⅓+1/18+1/9) = 7/9.

Apostieri probabilities are the values of Pr(B|A) where B precedes A in time. For instance, the probability of winning the first game given you won the series. The example of Pr(B|A)=Pr(B⋂A)/Pr(A) = 7/9 is the same either way in this case, but is generally different. For example, if circle A is part of circle B, the probability you’re in B given you’re in A is 1 but the same isn’t true the other way around. Another example, given 2 coins with 1 fair (Pr(H)=Pr(T)=½) coin and 1 unfair coin (Pr(H)=1,Pr(T)=0), suppose the coins are shuffled and a randomly returned coin flips to show heads. What is the probability it’s the fair coin. The probability of A given B in this case is ⅓. Saying fair is P and unfair is 1-P, the result of A|B is p/2-p, changing how it’s defined changes the result. The probability of choosing the fair coin is important. What if you instead you see how many times you get k straight heads, (½)^k and 1-(½)^k. B is the event of K straight heads and A is the probability of the fair coin. So, Pr(A|B)=Pr(A⋂B)/Pr(B) = P*2^-k/(p*2^-k)+(1-P) = P/(P+2^k(1-P). As k gets big, the chances of a fair coin get close to zero. Also, if P is plugged in as 1, the probability equation becomes 1.

Next is a medical testing problem where 10% of the population has a disease. If you have the disease, there’s a 10% chance test gives you a false negative. If you don’t have the disease, there’s a 30% chance of a false positive. If a random person tests positive, what’s the probability they have the disease? A is the event that person has disease and B is event person tests positive, so Pr(A|B) = Pr(A⋂B)/Pr(B). Writing these into a tree shows a ¼ chance of having the disease after testing positive which seems low given the small chance of getting the disease. If you just say no but don’t actually test you’re right 90% of the time in this example. The same sort of issue comes up in weather prediction.

Next is the carnival dice game, a player picks a number N from 1 to 6 and rolls 3 fair dice. The player wins if N matches at least one die. You can’t just add up the probabilities of the dice and get a correct result as it’s possible to roll 6 dice without getting the desired result. Since the probabilities are disjoint, inclusion/exclusion needs to be used to solve this. So, Pr(A₁UA₂UA₃)=Pr(A₁)+Pr(A₂)+Pr(A₃)-Pr(A₁⋂A₂)-Pr(A₁⋂A₃)-Pr(A₂⋂A₃)+Pr(A₁⋂A₂⋂A₃) = ⅙+⅙+⅙+-1/36-1/36-1/36+1/216 = .421. If looking the probability of A and B given C, the equation is Pr(AUB|C)=Pr(A|C)+Pr(B|C)-PR(A⋂B|C). An example of misleading mathematics shown is a comparison by a newspaper on American airlines and America West, showing the number of flights and rates for 5 cities. The newspaper concluded American West was better because it had a 90% on time rate vs 87% for AA, though AA had a better on time rate for each individual city. Taking separate percentages made AA look like the obvious choice, adding all up and taking percentage makes AW look better despite looking worse in all cases. The issue was weighting where a lot of AW was out of phoenix with a good rate due to weather impacting the total, even though AA seemed like a better choice.

Lecture 20 – Independence

An event A is independent of an event B if Pr(A|B)=Pr(A) or if Pr(B)=0. In order words, A is independent if knowing B happened doesn’t change the probability of A. For example, flip 2 fair “independent” coins. B is the event 1st is heads = ½ and A is the event where the second coin is heads (½). Then Pr(A|B)=½=Pr(A). Another example is a sample space with two separate event circles B and A. Only one can happen so disjoint does not mean independent as you know the probability of one given the other. A picture where the outer circle has an inner circle and half of outer is A and half of inner is B is also independent. The lecture’s first theorem shows a product rule for independent events, if A is independent of B then Pr(A⋂B)=Pr(A)Pr(B). The proof of this for a first case is Pr(B)=0, then Pr(A⋂B)=0=Pr(A)Pr(B) and for the second is Pr(B)>0, then Pr(A⋂B)=Pr(B)Pr(A|B).

The second theorem states the symmetry of independence, if A is independent of B, then B is independent of A. The next example takes 2 independent fair coins where A is the event results match and B is the event the first coin is heads. Are A and B independent? Pr(A|B)=Pr(2nd coin H) = ½, Pr(A)=Pr(HH)+Pr(TT) =½. Since the probability of A is the same as the probability of A given B, they are independent events. This is misleading because if you don’t have fair coins, they are dependent. For instance, Pr(H)=P, Pr(T)=1-P, Pr(A|B)=P, Pr(A)=P²+(1-P)². A and B are independent iff R(B)=0 (p=0) or P=1-2p+2p² which is only true if p is 0, ½, or 1. So only double sided or fair coins lead to independent results.

An example of independent vs dependent factors in real life came up during OJ’s trial where because 1/10 people match type O blood. ⅕ match rh factor +, ¼ match another marker. Prosecutor argued 1/200 chance of matching. This is true assuming that the events are independent, but if dependent on each other could be as small as a ten percent chance of occurring. Events A₁,A₂…A_n are mutually independent if for all i and for all sets J subsets of [1,n]-{i}. Pr(A_i|⋂_{j in J}A_j)=Pr(A_i) or Pr(⋂_{j in J}A_j)=0. In other words, events are mutually independent if no other event influences event I. An equivalent definition for a product rule form states that A₁,A₂,…A_n are mutually independent if ∀J≤[1,n], Pr(⋂_{j∊JAj)=product (pi^{j∊J Pr(A_j). For any subset of events, the probability of all happening is the product of their individual probabilities. For example, n=3. A₁,A₂,A₃ are mutually independent if Pr(A⋂A₂=Pr(A₁)Pr(A₂), Pr(A₁⋂A₃=Pr(A₁)Pr(A₃), Pr(A₃⋂A₂=Pr(A₃)Pr(A₂) and Pr(A₁⋂A₂⋂A₃)=Pr(A₁)Pr(A₂)Pr(A₃).}}

Another coin example flips three fair, mutually independent coins. A₁ is the event where coin 1 matches coin 2, A₂ is the event coin 2 matches 3 and A₃ is the event 3 matches 1. Are these events mutually independent? Pr(A₁)=Pr(HH)+Pr(TT)=¼+¼=½, Pr(A_i⋂A_j) = Pr(HHH)+Pr(TTT)=⅛+⅛=¼=Pr(A_i)Pr(A_j). This is missing the last case, where Pr(A₁⋂A₂⋂A₃) = ¼. Unfortunately, where separate probabilities would be the same for the previous examples, Pr(A₁)Pr(A₂)Pr(A₃) = ⅛, which isn’t ¼. Because of this, the three events aren’t mutually independent, though all pairs are. Events A₁, A₂…,A_n are considered pairwise independent if ∀i,j (i!=j), A_i and A_j are independent. Pairwise doesn’t imply mutual, though mutual implies pairwise. If you knew blood matching factors were pairwise, the best you could say is 1 in 50. The more independence you assume, the better the bounds can get.

The last problem from this lecture is the birthday problem. Given N birthdays (ex N=365) and M people (ex M=100). What is the probability that 2+ people share a birthday? First a couple assumptions should be made, namely that birthdays are mutually independent and distributed uniformly though neither is true in reality. The sample space for the birthday problem is shown as a tree where the first person has N possible paths and the tree has M levels. S={b₁,b₂,…b_m} where 1<=b_i≤N. |S|=N^M and Pr((b₁,b₂,…b_m)) = (1/N)^M. This is a uniform sample space since all probabilities are the same, though it’s easier to count number of sample spaces where all birthdays are different. N(N-1)(N-2)***(N-M+1) = N!/(N-M)!, so the probability that all birthdays differ can be represented as N!/((N-M)!*N^M). Stirling’s formula says N!~sqrt(2 pi N) *(N/e)^N and is within .1 percent when N is at least 100, given this, the probability all birthdays differ is roughly e^{(N-M+1/2 )ln(N/(N-m)-1}. For example, if N is 365 and M is 100, then the probability all birthdays differ is 3.08*10^-7, which is very small. When do you get to 50/50? Assuming M=o(N^⅔), then Pr(all differ)~e^{-(m^2)/(2N)}, which is ½ if M²=2N ln(2) or M=1.177 *sqrt(N). This square root of N is what’s known as the birthday principle and appears in problems where there’s a decent chance that two items go into the same bin.

Another use of a function that will likely place multiple inputs into the same bin is a hashing function where h:L->S, possibly having to handle collisions. For a definition, x collides with y if h(x)=h(y), but x!=y. The same is true for memory functions and error handling. The birthday principle is formally defined, “if |S|≥100, L’<=L, |L’|≥|2sqrt(|S|), and if values of h on L’ are random (uniform) and mutually independent, then with probability ≥ ½, there’s a collision.” The square root of S is enough to give a collision in the function and a technique known as a birthday attack can be used to get a collision and break cryptography.

Lecture 21 – Random Variables

A random variable Rv is a function Rv:S=>R (sample space to real numbers). For example, if flipping 3 coins, Rv could be the number of heads. Rv(H,T,H)=2 and another variable M would track if they all match by returning 1 if so and 0 otherwise. M in this case is known as an indicator (aka, Bernoulli or characteristic) random variable and is a random variable with range = {0,1}. This indicates which sample point has a certain property or characteristic. A random variable partitions a block, with M representing that they’re all the same or now and Rv representing the number of heads in the example.

{w | Rv(w)=x} represents the event Rv=x. Rv can also be written just as R. Pr(R=x) is the sum of all outcomes w for which R(w)=x of Pr(w). For example, Pr(R=2) =Pr(HHT,HTH, THH) which is ⅜ and Pr(M=1) = Pr(HHH,TTT) which is 2/8. For a subset A of the set of real numbers, Pr(R in set A) = sum a in A of Pr(R=a).

Two random variables R₁ and R₂ are independent if ∀x₁,x₂ ∊ R (real numbers) Pr(R₁=x₁|R₂=x₂) = Pr(R₁=x₁) or Pr(R₂=X₂)=0. An equivalent definition of this is written ∀x₁,x₂ ∊ R, Pr(R₁=x₁⋂R₂=x₂) = Pr(R₁=x₁)Pr(R₂=x₂). For example, Pr(R=2⋂M=1) =0 != Pr(R=2)Pr(m=1). 0 is not ⅜ * ¼, so R and M are not independent. Another example uses 2 fair and independent 6-sided dice, D₁ and D₂. If S=D₁+D₂, and T= 1 if S=7 and 0 if not, S and D₁ would be dependent on each other. Pr(S=12⋂D₁) = 0 != Pr(S=12)*Pr(D₁=1). Oddly, in the case of S, Pr(T=1|D₁=1) = ⅙ = Pr(T=1), and the same is true given other numbers. 7 is a unique number where for every D₁ there is exactly D₂ that adds to 7.

R₁, R₂, …,R_n are mutually independent if ∀x₁,x₂,…,x_n∊ R = Pr(R₁=x₁⋂R₂=x₂⋂…⋂R_n=x_n) = Pr(R₁=x₁)…Pr(R_n=x_n). Given rv R, the probability (or point) distribution function (pdf) for R is f(x)=Pr(R=x). The cumulative distribution function F for R is F(x)=Pr(R<=x) = sum for y <=x of Pr(R=y). Certain common distribution functions that frequently occur are the Bernoulli or indicator, where f(0)=P and f(1)=1-P, F(0)=P F(1)=1 and another is the uniform random variable on [1,n], where all values are equally likely, fn(k)=1/n for 1<=k<=n. Fn(k)=k/n.

This lecture goes into a guessing game where envelopes contain y and z ∊ [0,n] where y is less than z. The player picks one of the envelopes and can keep or swap based on the number inside. If they think the other value is higher than the one they see, then they swap. Usually the teacher would pick numbers within 1 place of each other. The optimal strategy for this is shown to be for the player to choose a value x uniformly in {½, 1.5,2.5,…n-.5} as a guessed midpoint between y and z. The player hopes it’s between the two and opens a random envelope, swapping if x is greater than the envelope’s value. The possible outcomes are shown with a tree with the first layer containing 3 paths. Either you guessed too low (x<y<z) at prob y/n, ok (y<x<z) w/prob (z-y)/n, or high (y<z<x) with prob (n-z)/n. The next branch is the revealed value either r=y or r=z with ½ probabilities. The probability of winning, Pr(W) = y/2n + z-y/2n + z-y/2n + n-z/2n = ½+(z-y)/2n, so there’s at least ½ + 1/2n giving a better than 50 percent chance of guessing right with this strategy. This is an optimal strategy for the guesser, but for a person choosing y and z, the optimal strategy is to pick random y between 0 and n-1 and make z be 1 more. If the chooser picks random and adds one, you can’t do better than ½+1/2n.

Lastly, the lecture goes into some binomial distributions. In an unbiased binomial distribution, f_n(k)=(n choose k)2^-n n≥1, 0<=k<=n (p=½). For the general binomial distribution, f_n,p(k)=(n choose k)P^k(1-P)^n-k 0<P<1. For example, with n components, each fails mutually independently with probability p(0<P<1) with P failures. The final theorem in this lecture is Pr(R-k)=f_n,p(k) for 0<=k<=n, which is proved using a tree to show the first component fails with p or succeeds with 1-p, the same is true for the second, and keeps going to nth component with fails with p and succeeds with 1-p. The sample space has 2ⁿ options and (n choose k) sample points have k failed components, each failing with a probability p^k(1-p)^n-k, so Pr(R=k) = (n choose k)P^k(1-P)^n-k=f_n,p(k).

The last example of the course flips 100 and asks the chances of exactly 50 heads, showing it’s less than 1/10 and 1/100. The probability of getting at most 25 heads is less than 1 in a million. The reason for this is shown by fn,p(an)<= and ~ 2^{[a log p/a + (1-a)log (1-p)/(1-a)}n}/sqrt(2pi – a(1-a)n) for o<=a<1. The maximum possible value at a is P, so fn,P(Pn)<= 1/sqrt(2pi p(1-p)n. For n=100 and p=½, Pr(50 heads) = 1/sqrt(50pi)or about .08. For n=100, p=½, a=¼, Pr(25 heads) 1.913*10^-7, or about 1 in 5 million. n in the exponent makes it exponentially small.

Lecture 22 – Expectation 1

The expected value (or the average or mean) of a random variable R over a probability space S is Ex(R)=sum of w∊S of R(w)Pr(w). An example of this in use is rolling a fair 6-sided die, with R being the outcome. Ex(R)=⅙+2/6+3/6+4/6+⅚+6/6 which is 3 and ½. The expected value doesn’t have to be obtainable. The median of R is x∊Range(R) s.t. Pr(Rx)>½. The median of the die roll is 4 as it satisfies this rule where 3 doesn’t.

The first game in this lecture is a coin toss game where 3 players wager money on the result of a coin toss and a student named Adam ends up losing. In order to show that this result was just luck, trees are graphed with each player having a possibility of choosing heads or tails. All possible outcomes of three players’ choices and coin toss results have a chance of 1/16. Adding up the expected gain seems to be zero, but certain paths weren’t taken as the other two players alternated their bets, removing any chances of him taking 4 dollar pot, making expected gain for Adam -½. Whether intentional or not, this is a common trick in betting games. Colluding with players to always pick differently gives an edge in situations where you wouldn’t think an edge could be gained. An MIT professor beat a state lottery by realizing people didn’t pick randomly. Picking randomly helps avoid spikes giving gain of 7%. The state started using random machines to generate numbers afterwards. The expected return can often be better by guessing unlikely options when too many people use likely results.

The lecture’s first theorem states that Ex(R)=sum x∊Range(R) x*Pr(R=x) and is proved by Ex(R)=sum w∊S of R(w)Pr(w) = sum x∊Range(R) sum w∊S whereR(w)=x of R(w)Pr(w). Which is sum x∊Range(R) sum w∊S:R(w)=x of xPr(w) = sum x∊Range(R) * sum w∊S:R(w)=x of Pr(w) = sum x∊Range(R) * Pr(R=x). This is the first calculation I’ve seen yet with a double sum. A corollary states that if R:S-> N, then Ex(R)= sum i=1 to inf of i Pr(R=1) with N being natural numbers in this case.

The lecture’s second theorem states if R:S->N, then Ex(R)=sum i=0 to infinity of Pr(R>i) = sum i=1 to infinity of Pr(R≥i). This is proved with sum i=0 to infinity of Pr(R>i)=Pr(R>0) + Pr(R>1)…to infinity = Pr(R=1)+Pr(R=2)+…to inf = 1*Pr(R=1)+2Pr(R=2)… = Ex(R). An example occurs where a system fails with probability P at each step (mutually independent). R is the step when the first failure occurs. Ex(R) = sum i=0 to infinity of Pr(R>i). Pr(R>i) which equals the Probaility of no failure in first i steps which is Pr(ok in 1st)Pr(ok in 2nd)…Pr(ok in ith step) = (1-P)(1-P)…(1-P) = (1-P)i = ai where a=1-P. At the end, Ex(R) = sum i=0 to inf of aⁱ = 1/(1-a) = 1/P. For example, if having kids until you get a girl, assuming Pr(Boy) is ½, R will be the number of boys which is mutually independent. Ex(R)=1/P – 1 = 1/1/2 = 1. The expected number of boys is the number of expected children minus the girl, since that stops the process, 1 in this case.

Another example is the latency of a communications channel with an average of 100 attempts. D is the delay of a packet on a channel. f(x)=Pr(D=x) is the probability distribution function for D and Pr(D>i) = 1/i. Via the theorem, Ex(D)=sum i=1 to inf of Pr(D≥ i) = sum i=1 to inf of i/i is infinite. i going from 1 to infinity is infinity, so averaging sample points can be a trap. What went wrong? The chance of seeing anything beyond 100 ms is beyond 100, but rare sample points can blow up the equation. In practice, you’ll eventually be hit with an anomaly even if many try to ignore.

Moving on, the stated most important theory in probability is the linearity of expectation. The theorem states that for any random variables R₁ and R₂ on a probability space S: Ex(R₁+R₂)=Ex(R₁)+Ex(R₂). A corollary to this is ∀k∊N, and random variables R₁,R₂…R_k on S, R=R₁+R₂. For instance, Ex(R)=Ex(R₁)+Ex(R₂) = 3.5+3.5=7 in the example of dice. This works regardless of independence. For a new example, the hat check problem is introduced. In this game, n men get their hats mixed up and receive a random hat. R is the number of men who get their hat back and Ex(R)=sum k=1 to n of kPr(R=k). The probability that R=k is 1/(k!(n-k) if k≤ n-2 or 1/n! If k=n-1,n. Trying to get this into the form of an Ex(R) function would create a mess and take a while to solve. The trick is to express R as the sum of random variables. R=R₁+R₂+…+R_n, R_i=1 if ith man gets right hat, 0 otherwise. Ex(R)=Ex(R1)+…+Ex(R_n) = Pr(R₁=1)+…+Pr(R_n=1) leads to 1/n+1/n+…+1/n = 1 man is expected to receive his hat back.

Lecture 23 – Expectation 2

This lecture starts its first theorem, Given probability space S and events A₁,…A_n <=S, which is the expected number of these events to occur. To prove this, let T_i(w) be 1 if w∊A_i, 0 otherwise. T_i=1 iff A_i occurs. Let T=T₁+T₂+…+T_n. Ex(T) = sum i=1 to n of Ex(T_i) = sum i=1 to n of Pr(T_i=1) = sum i=1 to n of Pr(A_i). An example of this is flipping n fair coins. A_i is the event the ith coin is heads, and T is the number of tails. Ex(T)=Pr(A₁)+Pr(A₂)+…+Pr(A_n) = n/2. You don’t need independence for linearity of expectations and it wouldn’t matter if results were tied together. Another method of calculating this would be Ex(T)=sum i=0 to n of i Pr(T=1), which does require mutual independence but will reach the same result of n/2, but proves that sum of i=0 to n of i (n choose i) = n2^n-1 that sum of i=0 to n of i (n choose i) = n2^n-1.

The second theorem is shorter stating Pr(T≥1)<=Ex(T) andis proved as Ex(T)=sum i=1 to infinity of Pr(T≥i). T must be defined in the natural numbers for this to work and greater than Pr(T≥1). A corollary to this is Pr(T≥1)≤ sum i=1 to n of Pr(Ai) which is proved by the first theorem. This theorem is useful when you’re trying to upper bound a disaster, such as a nuclear plant meltdown. Think of every possible event and its probability, crazy operator, terrorists, earthquake. Add up probabilities using the equation. There could be 100 ways it can happen each with 1 in a million chance. This means a 1/10000 chance. With 100 reactors, 1/100 chance of disaster. Buy 100 lottery tickets with the same chance and you have a 1/10000 chance of winning. With the equation, if there are 1000 (n) ways to cause a melt down with a probability (Pr(Ai) of 1/100, then 10 disasters are expected to happen (Ex(T)=10). If results are tied together then it’s possible that one disaster leads to another.

The third theorem in the lecture is Murphy’s law – given mutually independent events A₁,A₂…,A_n, the Pr(T=0)<=e^-Ex(T). Ex(T) is the expected value of T and the probability of no rare events occurring is very small. This is proved with Pr(T=0)=Pr(!A1⋂!A2⋂…⋂A_n) = product i=1 to n of Pr(!A_i) = product i=1 to n of (1-Pr(A_i) <= product i=1 to n of e^-Pr(Ai) = e^-Ex(T). For any x, 1-x <= e^-x. For instance, we expect 10 or more mutually independent events to occur, the probability that no event occurs is less than e^-10 which is less than 1/22,000. Some weird thing is generally going to happen, when something unlikely happens, you aren’t thinking about everything else that didn’t happen.

Next, a game is played where people choose one of 9 starting points going through a deck of cards where the card they see becomes their card and they count it’s value through the next cards to get a new card, repeating the process. Despite starting off counting 1 to 9 cards (10 and higher are treated as 1’s) through the deck and moving forward at intervals based on the card, everyone ended up landing on the same card, since there would eventually be a place where players collided.

The fourth theorem is the product rule for expectation stating for any independent random variables R₁ and R₂: Ex(R₁*R₁)=Ex(R₁)Ex(R₂). For example, rolling 2 6-sided fair independent dice, the expected product can be calculated as follows. Using R₁ as the value of the 1st die and R₂ as value of 2nd, Ex(R₁R₂)=7/2*7/2 =12.25. This only works since the results are independent however. If R₁,R₂,…,R_n are mutually independent then Ex(R₁,R₂…R_n)=Ex(R₁)…Ex(R_n). For constants a, b and any random value. R, Ex(aR+b)=aEx(R)+b.

The next example takes an article comparing code written for and running on different processors and shows a mistake when making comparisons. Let x be a benchmark, R(x)=code length for RISC on x, Z(x)= length of Z8002 processor and Pr(x)=probability of seeing x (uniform). The paper takes the expected value of z/r which does not mean that Ex(z) = 1.2 != 1.2Ex(r), which it tried to use to prove one was faster. Ex(R)=805, Ex(z)=500 is all you can include based on table. Another case R is 2 and Z is 1 for probability 1 and they reverse for 2. Z/R on 1 is 2 and ½ on 2. R/Z is the reverse. The message is never to take averages of ratios without knowing what you’re doing as the average of ratios is not the ratio of the average.

If Pr(R=1000)=½ and Pr(R=-1000) =½, then Ex(R)=0. So far, we’ve measured these by expected value, but it’s also important to look at variance. The variance of a random variable R is var(R)=Ex((R-Ex(R))2), with R-Ex(R) being the deviation from the mean, which is then squared and has its expected value taken. The variance is the expected value of the square of the deviation and the move it deviates, the higher it will be. With the example of winning or losing 1000, the variance is 1,000,000. If that were a betting game it would have high variance. With 1 and negative 1 put in place of the thousands, the variance would only be 1. This means, the expected values can be the same with different variance and risk averse algorithms can avoid high variance. The square is taken instead of simply the Ex(R-Ex(R)) in the above example as the later would be zero. Lastly, for a random variable R, the standard deviation of R is σ(R) = sqrt(var(R) = sqrt(Ex(dev2)) which is the root of the mean of the square of deviation.

Lecture 24 – Large Deviations

This lecture talks about the probability a random value deviates a certain amount from expectation. Markov’s Theorem states that if R is a non-negative random variable then ∀x>0, Pr(R≥x)<=Ex(R)/x. The proof of this is Ex(R)=Ex(R|R≥x)Pr(R≥x)+Ex(R|R0, Pr(R≥cEx(R))≤1/c. For instance, the probability you get twice the expected value is at most ½ and can be proved by setting x to cEx(R) in the theorem. As an example, if the R is the weight of a random person and Ex(R)=100, Pr(R≥200)<åå=100/200=½. The fraction of this population that can weigh at least 200 lbs is at most ½. This gives an upper bound as opposed to an exact solution, for instance, with the hat check problem, it gives 1/n instead of the 1/n! answer. It also requires that R is not a negative number. A corollary is if R≤u for some u∊RealNumbers, then ∀x<u, Pr(R≤x)≤(u-Ex(R))/(u-x). To prove this, Pr(R≤x)+Pr(u-R ≥u-x). u-R is never negative and Pr(u-R≥u-x)≤ (Ex(u-R))/(u-x) or (u-Ex(R))/(u-x). An example of this function in use takes the expected score of a random student. If the max score is 100 and Ex(R) is 75, Pr(R≤50)=(100-75)/(100-50)=½. Sometimes Markov is dead on and others it’s way off, but if you know the variance analog solutions can give better bounds.

Next up is Chebyshev’s Theorem says that ∀x>0 and any rv R, Pr(|R-Ex(R)|≥x)≤Var(R)/x². The proof for this is Pr(|R-Ex(R)|≥x)=Pr((R-Ex(R))²≥x²)

And by Markov’s theorem, the first half is less than Ex((R-Ex(R))²)/x²= var(R)/x². A corollary to this is that Pr(|R-Ex(R)|≥c sigma(R))<=Var(R)/(c² sigma(R)²) = 1/c² and is another version of Markov’s theorem based on variance. If R is the IQ of a random person, assume R≥0, Ex(R)=100, and sigma(R)=15, with sigma being the standard deviations. For Pr(R≥250), Markov says <=100/250 or 40%, while Chebyshev says Pr(R≥250)=Pr(R-100≥150)=Pr(R-Ex(R)≥10 sigma(R)) ≤ Pr(|R-Ex(R)|≥10 sigma(R)) = 1/100 and gives much better bounds due variance squaring deviations to make them count more. The next theorem states that for any rv R, Pr(R-Ex(R)≥c sigma(R))<=1/(c²+2) and Pr(R-Ex(R)≤-c sigma(R)) ≤ 1/(c²+1) and using these bounds 250+ IQ’s probability is at most 1/101.

The next topic talked about is Chernoff Bounds. To calculate, let T₁,T₂…,T_n be any mutually independent rvs such that ∀_j 0<=T_j<=1, let T= sum j=1 to n of T_j. Then for any c>1, Pr(T≥cEx(T))<=e^-zEx(T) where z=c ln(c)+1-c>0. For example, if Ex(T)=100, C=2 => z=2 ln(2)+1-2 > .38, then Pr(T≥2Ex(T))<=e^-.38*100 = e^-38. An example of this is a hundred million people playing a lottery with a win probability of 1/10000. The expected number of winners is 10,000,000/10,000 or 1000. T=T₁+T₂+…+T_1000000 in this example and Pr(≥2000 winners)≤e^-380 with independent picks, while Markov only gives ½. The probability of greater than 1100 winners with c=1.1 and z=1.1 is ln(1.1)+1-1.1 meaning it’s greater than .0048 and less than e^-.0048*1000 or about 1/100. The Chernoff Bounds theorem can be proved by using Chernoff for case T_j for either 0 or 1. Pr(T≥cEx(T))=Pr(c^T≥ c^cEx(T)) ≤ Ex(c^T)/c^cEx(T) by Markov and T=T₁+T₂+…+T_j => C^T=C^T1C^T2…C^Tn => Ex(c^T) = Ex(product j=1 to n of c^Tj) = product j=1 to n of Ex(C^Tj). You can use the product rule as the the T’s are mutually independent, so Ex(c^Tj)=c’Pr(T_j=1)+c⁰Pr(T_j=0)=cPr(T_j=1)+1-Pr(T_j=1) = 1+(c-1)Pr(T_j=1) = 1+(c-1)Ex(T_j). Because 1+x≤e^x, 1+(c-1)Ex(T_j) ≤ e^(c-1)Ex(Tj), implying that Ex(C^T) ≤ product j=1 to n of e^(c-1)Ex(Tj) = e^{(c-1)Ex(sumj=1 to n of Tj)} = e^(c-1)Ex(T). Plugging that back is as the upper bound finishes the equation Pr(T=cEx(T))≤ e^(c-1)Ex(T)/c^(c)Ex(T)=e^{-cln(c)Ex(T)+(c-1)Ex(T)}=e^-zEx(T). This is one of the messiest proofs but has one of the most important results.

A common theme so far is that Pr(A>B)=Pr(f(A)>f(B)) <=Ex(f(A))/f(B). In Chebishev, f was the square function while in Chernoff it was the exponentiation function. An example of this in use is a load balancing device dividing N jobs, B₁,B₂,…,B_n across M servers S₁, S₂,…,S_M. Set n=100,000 and M=10. B_j takes L_j time (0<=L_j≤). Let L=sum j=1 to N of L_j. Assume L=25,000, L/M=25000/10 which is a 2500. With a bunch of jobs and servers, assign servers to jobs. This is better done randomly in practice and random matching algorithms are used in state of the art load balancers from the time of this course. Let Rij be the load on server S_i from job B_j. If B_j is not assigned to S_i, then there’s no load. R_ij=L_j if B_j assigned to S_i, this is probability 1/m, otherwise it’s 0 with 1-1/m probability. Let R_i be sum j=1 to N of R_ij=load on S_i. According to linearity of expectation, Ex(R_i)=sum j=1 to N of Ex(R_ij) = sum j=1 to N of L_j/M = L/M = optimal

And the sum of L_j is just L. The probability of deviation is too much load on the ith server. Chernoff shows that Pr(R_i≥c*L/M) <= e^-2L/m, z=cln(c)+1-c With, c=1.1 and z=.0048, e^-.0048*2500<1/160,000, which is the probability a specific server gets 10 percent too much load. More important, is calculating the worst server’s probability of receiving too much load. Pr(worst of m servers ≥c*L/m) = Pr(R1≥cL/M u R₂≥ cL/mU…UR_N≥Cl/m) <= sum L=1 to M of Pr(Ri≥cL/M)=m/160,000 = 1/16000. This is important, as mistakes such as the mortgage disaster were made by assuming independence that didn’t exist.

Lecture 25 – Random Walks

The last lecture in the course introduces random walks and starts with the problem of a gambler’s ruin. A player starts with n dollars and each bet they win one more with probability p or lose one dollar with probability 1-p. You play until you win m more dollars or lose all n dollars, going broke. An example of this problem is roulette where p is roughly 47.3%. Because the odds are stacked, the target m is less than n. m=$100, n=$1000 in this case. If you get up a small percent, you go home “happy.” Regardless, it’s almost certain you’ll go broke before you win 100 dollars regardless of how much you bring as the probability of winning is less than 1/37,648.

The above problem is an example of a 1 dimensional random walk, where a value goes up, down or stays the same each time some happens. Moves going up and down in this case are independent in this case and are graphed with actions on the x axis with dollars between 0 and m for the y-axis. When you have a random walk with mutually independent moves, it’s called a Martingale. If P isn’t ½, then the random walk is considered biased, otherwise it’s unbiased. W* is event where the walk hits $T=n+m before 0. D is the number of dollars at the start, which is n. X_n=Pr(W*|D=n) and X_n is 0 if n=0 and 1 if n=T, PX_n+1+(1-P)X_n-1 if 0<n<T, proved as X₀=Pr(W*|D=0)=0, X_T=Pr(W*|D=T)=1 and for 0n=Pr(w*|D=n). E is defined as the event where you win the first bet. !E is the event you lose the first bet. With these, Pr(w*⋂E|D=n)+Pr(w*⋂!E|D=n) = Pr(E|D=n)Pr(w*|E⋂D=n)+Pr(!E|D=n)Pr(w*|!E⋂D=n). The probability of winning the first bet is just P and however you got to n+1 dollars doesn’t matter, so this equation can be further simplified to P*Pr(w*|D=n+1)+(1-P)Pr(w*|D=n-1) and PX_n+1+(1-P)X_n-1. Solving for x_n+1, pX_n+1-X_n+(1-p)X_n-1=0, with X₀=0 and X_T=1. A characteristic equation, pr²-r+(1-p)=0 helps. r= (1+/- sqrt(1-4p(1-p))/2p = (1+/- sqrt(1-2p)²)/2p = 1 or (1-P)/P. If p is not ½, then X_n=A((1-P)/P)ⁿ+B(1)ⁿ=A((1-P/P)+B. This can be solved by figuring out A and B using boundary conditions. The first boundary condition is 0=X₀=A+B=>B=-A and the second is X_T=A((1-P)/P)^T-A => A=1/(((1-P)/P)^T-1) and B=-1/((1-P)/P)^T-1. Plugging back into the formula gives X_n=(((1-P)/P)ⁿ-1)/(((1-P)/P)^T-1). A useful technique, when you have a fraction less than 1, adding 1 to the numerator and denominator makes it bigger. The formula is upper bounded by ((1-P)/P))^n-T. (P/(1-P))^T-n = (P/(1-P))^M.

Putting a theorem to show how bad the odds of winning at gambling are, if P<½, then the probability that you win m dollars before you lose n dollars is at most (p/(1-P))^M. For roulette, P is 9/19, so p/1-P = (9/19)/(10/19) = 9/10. With m as 100 and n as 100, the probability of winning is less than (9/10)¹⁰⁰ or 1/37,648. The answer doesn’t even change if you brought a million dollars as the formula has nothing to do with the amount of money brought to the table, it’s all about the low chances of winning. Even the chance of winning 10 dollars is less than 35 percent. These results change drastically if the game is fair and p is ½. In that case, X_n=(A_n+B)(1)ⁿ = A_n+B and has the boundary conditions 0=X₀=B=>B=0 and 1=X_T=AT+B=AT->A=1/T, making X_n=n/T = n/(n+m). If the game is fair with p=1/2, then Pr(win $m before lose $n)=n/n+m. A fair game matches intuition, but the slight percentage changes drastically due to double root. In the case of unbiased, it’s more likely to hit the smaller boundary because it’s random. In a biased case, there’s a downward baseline making it more likely to lose. When it’s a drift (1-2P) downward vs random swings and in random walks, drift usually takes over. After x steps, you’ll have drifted (1-2P)x, Ex(loss)=1(1-P)-1(P)=1-2P. You lose 1-2P x times while swing is just theta(sqrt(x)) and the linear draft overpowers the logarithmic swing, meaning you’ll likely go broke.

Knowing you’re likely to lose, it’s then asked how long it will take. S is the number of steps until the boundary is reached. En=Ex(S|D=n). En=0 if n=0, 0 if n=T and 1+PE_n+1+(1-P)E_n-1 if 0<n<T. The proof is the same as last time looking at both cases. PE_n+1-E_n+(1-P)E_n-1=-1 where E₀=0 and E_T=0. First, a homogenous solution, En = A((1-P)/P)ⁿ+B for P!=½. Then a particular solution, guess E_n=an+b and plugin a as -1/(2P-1) or (1/1-2P) and set b to 0. E_n=A((1-P)/P)ⁿ+B+n/(1-2P) => En=n/(1-2P)-T/(1-2P) * (((1-P)/P)ⁿ-1)/(((1-P)/P)^T-1), which is close to 18,999.44 bets to lose 100 dollars. As M->inf, En~n/(1-2p), you can’t plug in for a fair game as it divides by zero. If P=½, a homogenous solution is E_n=An+B. A particular solution before a polynomial fails is an²+bn+c, where a=-1 and b=c=0. The general fair for this is E_n=An+B-n² with boundary condition E₀=0=B. E_T=0=AT-T₂=>A=T => En=Tn-n₂=(n+m)n-n²=nm, where p=½. In this case, you can expect to play forever.

The last theorem of the course says to quit while you’re ahead. If you start with $n and p=½ and you play until you go broke, then the probability of going broke is 1, even if you expect it to take infinite time. This is proved by contradiction, assuming there’s a number n of dollars and an E that’s greater than 0 where the probability of going broke is less than or equal to 1-E. In that case, for all m, the probability of losing n dollars before winning m is less than or equal to 1-E, implying for all m, m<=(1-E)(n+m) and for all m<=((1-E)/E)n which leads to a contradiction.

Conclusion

Overall, the examples in this course felt a lot more similar to the coding problems I’m more familiar with solving, with a heavy focus on proving why answers were answers. A lot of these proofs made sense, while sometimes I would still feel lost, especially when going back to write up my notes into this form. I’m starting to realize that these course note summaries aren’t going to be very readable. However, it’s still going to be an easier format for me to search through than the original video course and at some point I’ll hopefully come back to further condense the information and my learnings and present them better. For now, I’ve enjoyed this course and look forward to the next course which is physics should be the last math one for a while before moving into the computer science courses.

NikCreate