I’ve reached the final course in my list and one that I’ve been pretty excited for. While I haven’t been the most consistent blogger, the first time I realized the blog could be useful for sticking to projects was with my series of posts going through Kaggle competitions. Before that, I’d started watching through machine learning courses on various sites, sometimes going through the code examples, but I’d eventually get distracted and stop progress pretty quickly. After enough of that, I realized that if I set a list of past Kaggle competitions focusing on different areas to go through and concluding each with a blog post on the resources I learned from, techniques I tried and results I got with each. When doing this, I was a bit frustrated at the lack of understanding with what exactly I was doing as a lot of what was happening was hidden by layers of tools and Python libraries, which led to another reason I wanted to start this project.
As I expected, MIT’s Artificial Intelligence course was a lot more like the algorithms courses and Mathematics for Computer Science than the last few courses about how computers, networks and programming languages work. The course starts off with some examples of games and search problems before moving on to introducing learning techniques. First, it looks at nearest neighbors and has lectures focusing on neural networks, genetic algorithms, support vector machines and boosting. It broadly covers concepts such as computer vision and representing language and covers a lot of the history of artificial intelligence. Overall, this course is a pretty broad introduction to the field of artificial intelligence but does a pretty good job explaining how concepts work when most of my knowledge had just been on how to call pre built functions.
MIT Courses #12: Artificial Intelligence – 6.034 by Prof. Patrick Henry Winston
Lecture 1 – Introduction and Scope
Artificial intelligence is about building models that are targeted at thinking, perception and action. MIT is all about building models and this course is about models of models of thinking. To have a model, you need a representation. So artificial intelligence is also about representations that support building those targeted models. We’ll accumulate a suite of representations for building intelligent programs. An example of the farmer, fox, goose, and grain problem is shown with drawn diagrams and there are 16 possible states drawn out as a graph. With the right representation, constraints are exposed. AI is about “algorithms enabled by constraints exposed by representations that support models targeted at thinking, perception, and action.” The generate and test method of problem solving works by generating possible solutions, feeding them into a function that tests them and checking if they’re failures or successes. A generator shouldn’t give the same solution twice and should be informable, able to categorize to avoid extra work. The Rumplestilskin principle says once you can name something, you gain power over it and can start to talk about it. Purge the word trivial from your vocabulary as trivial suggests little worth, while the most simple ideas can be the most powerful.
AI discussion started with Ada Lovelace, the world’s first programmer who wrote programs 100 years before computers could run them. In 1842, she said that an analytical engine couldn’t originate anything and could only do what we tell it to. In 1950, Alan Turing introduced the Turing test. The dawn age and modern era began with a paper by Marvin Minsky called “Steps toward Artificial Intelligence.” Not long after, Jim Slaygal (sp?) wrote a program that did symbolic integration. The professor then demonstrates a humorous chatbot program called Eliza. Programs that do geometric analogy (A is to B as C is to what?) were another important step. In the late dawn age, programs started looking at vision and shapes and forms. Programs were written that could learn from a small number of examples. The most important thing in the late dawn age was expert systems and a program written at Stanford diagnosed bacterial infections in the blood better than most doctors, though it wasn’t used, though led to several companies. Bizs is a system that knows how to park airlines and saves Delta a lot in Jet Fuel in the “Biz” age. The age beyond this was the “bulldozer” age which had Deep Blue and people saw they had unlimited computing and that computing could be traded for intelligence. Deep Blue beat the world chess champion.
The current age (of 2010) is the age of the “right way.” The initial definition of AI is incomplete as it’s about loops that tie thinking, perception and action together. AI can imagine scenes and determine information from its own internal model. A humorous example returns videos in response to words, with a small database of videos to return. An accident of evolution separates humans from other species, though we don’t know what happened 50,000 years ago to cause this. Chompsky’s conclusion is that a small group of humans learned how to combine two concepts to create a third without disturbing the first two. In other words, we learned to describe things in a way that was connected with language. We think with our eyes, so studying language isn’t enough. Language enables us to describe, which lets us tell stories and lets us “marshal the resources of our perceptual systems” and have our perceptual systems imagine things they’ve never seen.
Lecture 2 – Reasoning: Goal Trees and Problem Solving
If a program can calculate an integral is it intelligent? The kind of problem solving seen today is similar to “generate and test,” which humans do without thinking about it. For instance, solving integral(((-5x^4)/(1-x^2)^(5/2)) dx). Taking that problem and converting it into a simpler problem is called problem reduction. There are lots of calculus techniques for simpler transformations, many of which are safe and can always be done. You should have a skill at some level, but you must understand it lower down and have witnessed it at an even lower level. Removing the constants is a safe transformation. Trig substitution is for day two and doesn’t always work. Another safe transformation is that the sum of the integral is equal to the integral of a sum. Another transformation shown is that the integral of “- f(x)” = “- integral of f(x).” Fourth, if you have the integral of P(x)/Q(x), then you divide. This forms the core of a program which integrates almost nothing, but is easier to integrate. The first step in an integration program is to apply all safe transforms. Next, we check a table to see if we’re done and report “success” if so. So, integral(((-5x^4)/(1-x^2)^(5/2)) dx) becomes integral of ((5x^4)/(1-x^2)^(5/2)) dx which becomes integral of ((x^4)/(1-x^2)^(5/2)) dx.
Applying all of the safe substitutions isn’t enough on its own and some heuristics are needed such as trig substitutions. Another is a family of transformations such as integral of f(tan(x))dx = integral of (f(y)/(1+y)^2) dy. This family transforms trigonometric functions to polynomial ones. Another group converts polynomial to trigonometric forms. 1-x2 becomes x = sin y. Similarly, 1+x2 becomes x = tan y. The integral function can be further simplified to integral of ((sin4y)/(cos4y)) dy. In a graph, an AND node has a graph drawn over it. For problem reductions you can use a tree that is sometimes called an AND/OR tree or a Goal tree. OR nodes split the tree into different possible paths where only one can be taken. A good heuristic for which to take is to take the easier path, with Slaygal choosing the one with the lower depth of functional composition. In this case, “integral of tan4(x) dx” is chosen over “integral of (1/cot4(x)) dx” for the next step. With “y = tan x”, this becomes integral of y4/(1+y)2 which becomes integral of y2 – 1 + (1/(1+y2))dy which is a sum and divides into three pieces split with an AND node. The pieces are integral of y2 dy, integral of -1 dy and integral of 1/(1+y) dy. The second can be transformed again to negative integral of the integral of dy. The third can be reduced to the integral of dz. All of these simplificiations are in the shown table and the problem is solved.
The problem is solved with a loop of running safe transformations, checking the table to see if the problem is done, and if not, applying heuristic transformations before going back to the safe transformations. AND nodes are used to split sums and OR nodes are used to split branching paths. The function always decides between the leaves of the tree even if it backs up because a previously avoided branch is simpler. Describing Jim Slaygal’s program is a miniature introduction to artificial intelligence. The machine Slaygal used had 32K of memory and was stored in MIT behind glass. Given 56 of the hardest problems, it got 54 right. It was missing a couple transformations it needed to solve the 2 it missed. In the worst case, the tree depth was seven levels and on average its depth was approximately 3. The number of unused branches was only about 1, not going too far down a useless tree. In bad paths, nodes quickly hit dead ends.
It’s important to ask questions about knowledge. What kinds of knowledge was involved in this problem? Knowledge about transformations, goal trees and when a problem is done, and what can be looked up in a table. How is the knowledge represented? Expressions, a table of integrals and expressions, goal trees were represented in a procedure. How was it used? Table serves as the bottom of the tree, transformations are used to simplify the problems. A table of 26 elements was enough to solve all of the problems. Knowledge about knowledge is power.
Lecture 3 – Reasoning: Goal Trees and Rule-Based Expert Systems
A program is shown which has a set of blocks stacked in some way. You tell it to put one block on top of another and it will do so. First it has to find space on the target block, grasp a block, move it on to the target, and ungrasp. Grasp also needs to clear blocks off the block being grabbed. So grasp calls clear top which calls get rid of which calls put on. This leads to recursion and the program appears complex. To find space, you also have to call get rid of once or more. For instance, given a setup where BX is on B1 and BY is on B2, to PUT ON B1 B2, a tree divides into four subroutines. FINDSPACE, GRASP, MOVE and UNGRASP. GRASP calls CLEARTOP B1 which calls GET RID OF BX which calls PUT ON BX TABLE, starting another set of tasks. To answer the question of why, go up a level and see why the call was made. To answer how something was done, go down a level. The four nodes PUT ON leads to are AND nodes and this is an example of a goal tree which allows for answering certain questions. Simon’s ant is a metaphor saying the complexity of behavior is the max of the complexity of the program and environment. A simple program can lead to complex behavior for a complex problem.
Rule-based expert systems were started in the 1980’s and people believed that all of human intelligence by writing simple rules such as if/else statements. Thousands of these expert systems were written. An example shown is playing 20 questions with animal details to figure out an animal is a cheetah. For instance, if an animal has claws, sharp teeth and forward pointing eyes, it’s a carnivore. A fast mammal that has spots and is a carnivore is a cheetah. These rules are combined with AND gates which lead to different animals. This kind of system is called a forward-chaining rule-based expert system. This set of rules and AND gates is the same as creating a goal tree and the system can answer questions about its own behavior. Why were you interested in an animal’s claws? Because I wanted to know if it was a carnivore. How do you know it’s a cheetah? Because it has spots, is a carnivore, etcetera. You can also build a backward chaining rule-based expert system that checks that an animal is a cheetah taking a cheetah and checking if it’s a carnivore, mammal, etc. The system can also avoid answering questions that will no longer determine anything based on what’s known so far. Whether backward or forward, this is a deduction algorithm which uses information to get other information. Once you prove something is true, it can’t be false. If thought of as a programming language, however, rules could add or subtract from a database which could change information and take something away. This sort of twenty-question’s style of asking questions is called a Mycin’s dialogue, with Mycin’s rules determining what is learned.
A few principles of knowledge engineering are useful for solving problems. The first is to look at specific cases to gain information, such as watching someone bag groceries. The second is to ask questions about things that appear to be the same but are handled differently. For instance, in bagging groceries, a bag of peas and can of peas are handled differently because of their containers. Asking that question will lead to new rules and power over a domain. The third principle is to try building a system and see where it fails. Eventually a grocery bagging program would either do something wrong or fail and you’d know a case is missing. These are also mechanisms humans use to make themselves smarter. Systems like this have a thin veneer of intelligence which starts to crack. For instance, even if a bagging program knew to put a bag of potato chips on top of heavier objects, but did not know its to avoid them cracking or a customer being disappointed. There are some situations where rules could play a role in human’s understanding of things similar to computers. “When we tell a story, it’s mostly a matter of controlled hallucination. I know what rules are in your head, so I can take advantage of that in telling a story and not have to tell you anything I’m sure you’re going to know.”
Lecture 4 – Search: Depth-First, Hill Climbing, Beam
A map of Cambridge is shown with a starting and goal node and different nodes at different points in the city. A British Museum Search is an algorithm which finds every possible path. You can’t save much work by being clever if you have to try everything. Search isn’t about maps, it’s about choice. Depth first search (DFS) is about barreling ahead in a single minded way. For instance, you could start by going the leftmost path in every choice first and then back up to the last choice and make the other one in a binary tree. This backup is also called backtracking and is useful with depth first search. The natural companion to depth first search is breadth first search (BFS) which builds up a tree level by level. At some point, after completing a level, you’ll find the path that reaches the goal. On maps, these are stupid algorithms as they go through paths multiple times. An improvement on BFS only extends a node if it goes to a node that wasn’t already extended. This requires keeping a list of places which were already the last node seen on an extended path. If a path terminates in a node and a path had previously terminated in that node and was extended, it’s a waste of time to repeat. This speeds up BFS significantly and helps with DFS as well. You don’t backtrack with BFS.
A hill climbing search works like DFS but breaks ties by choosing the node closer to the goal. Hill climbing doesn’t always choose the optimal path but reduces backtracking and improves on DFS. If you have a heuristic for telling you you’re getting closer to the goal, you should do it. Hill climbing is the first “informed” search, taking advantage of extra information. Just as hill climbing is an analog of DFS, beam search adds an informing heuristic for BFS. Beam search limits the number of paths it considers at a level to a fixed number called the beam width. If beam width is two, it will choose the two nodes on a level that get closest to the goal. A final variant is best first search which always works on the path closest to the goal and skips all over the tree to the best known path. Hill climbing can lead to problems such as getting stuck on a local maximum and being fooled into thinking you’re at the top.
Lecture 5 – Search: Optimal, Branch and Bound, A*
How do you find the shortest path as opposed to just a good path? An oracle could tell you about a path but you may still need to check if the given answer is correct depending on how much you trust the source. To find the shortest path, distances travelled so far are kept track of and you can avoid paths that already have a higher accumulated length than the length of a found path to the goal. You won’t always be able to rely on an oracle, however. Branch and bound tries expanding the path with the shortest distance travelled so far at each step. Until the goal is reached, the path to it is treated as having a length of infinity. Once the goal is found, you still need to check the other paths in the same way as when there was an answer from an oracle. This algorithm works by initializing a queue, testing the first path and extending the first path (if not already extended) and sorting elements, it then tests the first path again, creating a cycle. Keep track of nodes that have been extended and don’t extend them again. This is still an improvement of branch and bound and not a different algorithm. By adding in airline distances between points, you can extend the path with the shortest potential distance at each step (current distance travelled plus straight line distance away). If the heuristic estimate is guaranteed to be a lower bound of the actual distance, it’s called an admissible heuristic and using an extended list works much better in the example. In general, it depends on the problem. Putting the extended list and admissible heuristic on branch and bound gives the A* algorithm which gets the improvements of both and is generally better than either working independently. Sort is an expensive computation and you can test the shortest path instead of the first path instead, giving the complete A* algorithm. Since search isn’t just about maps, an admissible heuristic won’t be good enough unless it’s a map and may avoid extending something that needs to be extended. Admissibility only works with maps. If the heuristic distance between point X and a goal is at most the actual distance between them, the heuristic is admissible and A* works if it’s a map. If not a map, we need the stronger heuristic of consistency where |H(X,G)-H(Y,G)| is at most the actual distance between X and Y. Mapping programs make use of these algorithms.
Lecture 6 – Search: Games, Minimax, and Alpha-Beta
This lecture talks about ways to design a program to play games like chess and what programs like Deep Blue add besides speed. The first approach a machine could take to play Chess is to make a description of the board and mix analysis, tactics and strategy to determine a move. The trouble is none of the programs today (2010) know how to incorporate any of that stuff. Another way is to use if-then rules to determine moves based on the board state. This just uses predetermined rules without any added analysis. No one’s made a good Chess player that works this way, though it’s been done for Checkers. The third strategy is to look ahead and evaluate. Look at all of the possible moves and determine which leads to the best outcome based on an evaluation technique. A Chess board contains a lot of features which can be passed to a function to give a static value. This is a value of the board from your perspective and is typically a linear polynomial. A way of playing the game would be picking the value with the best value at each step. There are other scoring methods besides linear scoring polynomials.
Another approach would be to evaluate the entire tree of possible game states. The branching factor (B) of the tree is the number of child nodes from each parent and the depth (D) is the number of steps from the root to the furthest leaf. There will be B^D nodes in a tree with equal number of branches at each node. In chess, there are about 10120 leaf nodes at the bottom of the tree and you’d have to do that many static evaluations to see which nodes were best at the top. There are 1080 atoms in the universe and pi*107 seconds in a year. The British museum algorithm won’t work even with cloud computing.
The fifth approach, and approach that will actually be used, is to look ahead as far as possible. An example shows a tree with leaf nodes of different values. One player wants to reach the highest possible leaf and the other wants the lowest possible leaf. With the minimax algorithm, you go to the bottom and compute the expected values at each parent node based on who would take the turn. The max player wouldn’t expect the highest result since they’d know the min player would try to minimize. Each level of the tree takes a different player’s moves into consideration. For instance, a given binary tree has leaves of 2, 7, 1, and 8. The root node is 2 levels up and starts with the max player. They know the next level will be dependent on the min player, so they’ll either get 2 (2 vs 7) or 1 (1 vs 8) and view the max score they can get as 2. The goal is to get as far down the tree as possible. Eventually, if you get far enough, piece count will be the key measure in Chess. Since the museum search isn’t possible, the next approach is to figure out how to cut out unneeded branches in the tree. The alpha-beta algorithm is a layering on min-max which cuts off large sections of the search tree. In the given example, if max determines that one side gives 2 and sees 1 as a child in the other half, it doesn’t need to look at the 8 leaf since it already knows it will get less than 2 from that side. This gets much more effective in larger trees. The algorithm is called alpha-beta because it has two parameters and saves on both static evaluation and branch traversing. In the optimal case, this requires only 2B^(D/2) static evaluations instead of B^D.
With a time limit it could be important to have an answer ready even if calculations aren’t complete. Looking back at minimax and ignoring alpha beta, there are B^D values at the bottom level of an even tree, B^(D-1) above that, etcetera. If worried about not getting all the way through the calculations on the final level, an insurance policy could look at the level above that. With a branching factor of ten, this would only require 10 percent of the work. If the factor is really big, you may have to get insurance at another level up. You could try starting from the first level to get an insurance policy for every level. The computation needed is S=1+b+b2….+bd-1 to get an insurance policy at each level which is basically bd-1. This idea is called progressive deepening and is useful since the amount of computation at the penultimate level is only slightly less than computing insurance for every level.
Alpha-beta is a layer on top of minimax that doesn’t give you a different answer but speeds it up, using the “dead horse” principle. Progressive deepening is used to calculate different values at different parts of the tree to have an answer ready at any time (an “anytime algorithm”). Progressive deepening is viewed as a use of the “martial arts principle,” using the enemy’s characteristics against them. Deep Blue was minimax and alpha beta plus progressive deepening along with a lot of parallel computing and an opening book, special purpose stuff for the end game and uneven tree development. For instance, if it thinks it might be able to capture a queen, it might prioritize options that lead to that, instead of just calculating each level. Deep Blue still substitutes raw power for human sophistication and uses “bulldozer” intelligence. All game playing algorithms combine minimax, alpha beta and progressive deepening.
Lecture 7 – Constraints: Interpreting Line Drawings
Artificial intelligence is a new field and the professor knows most of the people in it. Guzman was tasked with coming up with a program for determining the number of objects pictured in a line drawing. The program succeeded by using lines as “quanta of evidence” for “faces” belonging to an object. Certain connections of lines are “links” and numbers of links were used to try and conclude the number of objects. Marvin Minsky liked the results and David Huffman didn’t like it and thought it was adhoc. The world is full of three faced vertexes which generally project into an arrow or a fork in the drawing. Guzman was an experimentalist and Huffman was a mathematician, so they approached the problem differently.
Huffman instead looked at a world with certain characteristics for solving the problem. The world was presented in a general position such as from a specific angle showing a cube in a way that shows three faces instead of from the side, the world is trihedral and all vertexes are formed from three planes. How many kind of junctions can you see if the world is composed that way? Guzman only recognized two kinds of lines, ones with links and ones without which isn’t dealt with in the real world. Huffman wanted to get his constraint out of the physical world and dealt with four kinds of lines including concave(+), convex(-) and boundary(<- or ->), depending on which side of the object stuff is one. Labels used are vertexes, edges, junctions and lines. There are 18 ways to arrange labels around a junction and everything else can be excluded. Having these complete constraints is useful for being able to solve the problem. If not in the list of possible junctions, the shape cannot be constructed. Passing this test doesn’t mean it’s possible for the shape to be constructed.
Later, Waltz added cracks, shadows, non-trihedral vertexes and light. With this, Waltz went from 4 labels to more than 50 labels. 18 labels for lines to come together around junctions became thousands in Waltz’s world. Guzman had the problem and something that worked, Huffman had a method that worked on the wrong problem and Waltz brought the problem together. Huffman’s examples could be worked by hand but Waltz’s couldn’t and wrote a unique DFS algorithm. First, Huffman’s algorithm starts by “plopping” on some junction all labels the answer has to be drawn from. Then it looks on the neighboring junctions to see if anything placed is disallowed by a neighboring junction and eliminates ones that aren’t, so it isn’t a complete search through every possibility. Sometimes answers are ambiguous, but it will be able to get correct answers for unambiguous situations. A similar mechanism will be used for other problems finding solutions with a lot of constraints.
Lecture 8 – Constraints: Search, Domain Reduction
This week starts by looking at a depth-first search for a map coloring algorithm choosing different colors at each level to avoid US states that are next to each other from having the same color. Attempting all combinations with a DFS isn’t as bad as Chess, but still takes a long time. For instance, with states like Texas which have a lot of neighbors, if you start far away, you’ll get to Texas at the end and find there isn’t a proper coloring. Constraint propagation won’t work well either. Possible colorings for the US map are red, blue, green and yellow in the example. Solving this problem uses the “martial arts principle” as undiscovered local constraints will cause problems. For some definitions for the problem, a variable V is something that can have an assignment. A value x is something that can be an assignment. A domain D is a bag of values. A constraint C is a limit on variable values. Looking at Texas surrounded by four states, states have the role of variables, colors are values, domains are the remaining color values usabe on a state, and the constraint is “no states sharing a boundary can have the same color.” A procedure can be written from these definitions using pseudocode for a DFS.
For each DFS assignment:
For each variable V considered:
For each xi in Di:
For each constraint C (Xi, Xj) where Xj is an element of Dj
If there isn’t an Xj such that that constraint is satisfied:
Remove Xi from Di
If Di is empty, backup
This is called the domain reduction algorithm. A couple of improvements on DFS are considering assignments, checking neighbors, propagating through domains that reduce to 1 value and propagating checking through variables with reduced domains. What about ordering? Is it better to start with Maine (bordered by 1 state) or Texas (bordered by 4). Ordering states from most constrained to least constrained would work fine with ordinary DFS. Similar to games, use everything you have to deal with the problem and one or more incorporated factors will work great.
This is also useful for all sorts of resource planning problems. Another example is a resource allocation problem, trying to maximize flights with a minimum number of planes which is basically the exact same problem. An aircraft can’t handle two flights at the same time and there’s a minimum ground time constraint. These problems can also be made easier by adding colors or planes. Good resource allocation can be done if you use the most constrained first, propagate through domains reduced to a single algorithm, and converge on a narrow range where it takes a while. Too many resources is fast to complete and too little is fast to terminate.
Lecture 9 – Constraints: Visual Object Recognition
At the end of the lecture, we’ll know what it takes to recognize faces and objects in general. Attempts at solutions have evolved slowly for the 30 years prior to 2010 and will continue to evolve slowly. David Maher suggested finding edge fragments based on images from a camera and uses the edge-based description or “primal sketch” to figure out what was in the world. The primal sketch was decorated with vectors showing where facial features were oriented. This was a 2.5-D sketch since it attempts to say something about the 3D arrangement of faces from a 2D picture. The third step converted this 2.5D sketch to generalized cylinders. A regular cylinder can be described as a circular area moving along an axis. Allowing the circle to change size as it moves, it would shape other things such as wine bottles. The 2.5D sketch converted into a cylinder and matched to a library of descriptions, results in definitions. This was a good idea in that it used transformations, but no one could make it work and it couldn’t distinguish car types or faces. 15 years later, the alignment theories, notably produced by Shimon Ullman. With three images of an object, any view of the object can be reconstructed (if far enough). How do you tell if a 4th object is the same as the one used for three pictures. Pick places on all four to map to each other. The relationship between the points on the unknown object and points on the stored library are linearly related and the points on the fourth can be generated based on the other 3 with linear operations.
If you know the equations are correct, you can solve for unknown variables to figure out the shapes. For a point XU, it could be calculated by the corresponding positions on the first three objects. XU=alpha XA + beta XB + gamma XC + tao. Tao is the translation and all other points can be calculated the same way. Positions of points will be calculated based on the rotation along the x and y axis. Points positions will be able to be calculated, but cos, sin and theta variables will need to be calculated from those points. Looking at a single point on the object, what happens to it between two images. First it’s put in a standard position. The X coordinate in the standard position is XS and Y coordinate is YS. The image is rotated three times, once to form the A picture, one for B and one for C. All of these pictures are drawn on a graph. Targets are based on vectors thetaA, thetaB, and thetaA. How does XA depend on XS and YS. All points on the object will be rotated by the same angle. For instance, if XA = Ss cos thetaA – yS sin thetaA and XB = SS cos thetaB – yS sin thetaB on the same object, all of the images with a particular rotation have constant angles and cosines for all points on the object. Treating these as constant, XS and YX (and same for u), Xu=alpha XA + beta XB where alpha and beta are constants. For orthographic projection (taking projection along x axis), this equation needs to hold. Allowing translation along with rotation adds a constant tao. Adding translation adds a required picture to what’s used for rotation. This works well on identical manufactured objects, but the natural world isn’t static in the same way.
The Goldilocks principle looks for things that aren’t too small or too big. Two faces having similar eyes might not be enough to match them, but eyes and mouth could be enough. Asking for eyes, nose and mouth to be identical might be too much to ask when looking for someone in a crowd. Considering a one dimensional face with the same “signal” in an image at a different point, you could do an integral of the signal in the face and image to see how it multiplies out. In other words, maximixe (over offset X) the integral over x of a face (f(x)) * g(x-X). When the offset, t, is equal to the offset, you’re effectively multiplying the thing by itself and integrating over the extent of the face, giving a large number if lined up and small if not. This is still true after adding noise to the image. If not 1D, then Run Max over X,Y integral over x of f(x,y)g(x-X,y-Y). Other courses remain the “custodian” of topics such as normalization and this is the extent matching faces in images work. An animated program adds noise, blurring an image with lots of faces. A correlation is run over the image using a student’s face as a “mask” and it can find the face at a certain point in the image. Two eyes is a medium sized constraint as 1 matches many different faces. A lot of noise can be added while the program can still find the desired image. One problem is that this only matches the same photo while people can recognize each other from different angles. Do we just record a head turning at all possible angles? What about stretching or turning images? What about when verbs are happening in an image?
Lecture 10 – Introduction to Learning, Nearest Neighbors
“What would a dog think a diet coke is for? That’s another puzzle that you could work on while we go through the material of the day.” There are two kinds of learning. One kind is learning based on observations of regularity, which computers are particularly good at. Nearest neighbors, neural networks and boosting are examples of learning based on regularity. Nearest neighbors is something used heavily in pattern recognition. It’s been around since 1910 (?) and is simple and useful. Neural networks attempt to mimic biology. Boosting is the “gift of the theoreticians.” Regularity-based learning is the branch of bulldozing computing. The other kind of learning is based on constraint and is more human like. For instance, one shot learning teaches something definite for each experience. Explanation based learning is another kind of constraint based learning.
This lecture focuses on nearest neighbors and pattern recognition. In nearest neighbors, some mechanism known as a feature detector generates a vector of features and outputs a vector of values which go into a comparator of some kind. The comparator compares the feature vector to vectors in a library of possibilities. By finding a closest match, the comparator determines what the object is, doing recognition. An example represents electrical outlet covers based on total area and hole area. After plotting values, a closest neighbor plotted on the graph to determine what a new item is, first plotting idealized points to use to represent new ones. Boundary lines are drawn on the graph to make it easier to show the label to give to new points. These lines are known as decision boundaries. A weak principle has if something is similar in one respect, it’s similar in others. If only the hole area is measured on what, there’s a probability it has the same total area as the ideal point plotted in that area, though this may not be true.
An example problem has a collection of articles. The goal is to figure out how to address a question. How do you find the articles relevant to the question? This has been studied by people interested in information retrieval. Count up the words in the articles in your library and compare word counts to those in the probing questions. An example looks for the words hack and computer (plotted on x and y axises). Articles from Wired magazine would include lots of references with a higher x and y value. Articles that use “hack” might refer to a farming magazine, “hacking away” at. If you’re asking a probe question about hacking, the number of references to hacking will be small due to probe size and the farming magazine would appear closer. Instead of just comparing word counts, forget about euclidean distance and look at the angle between the vectors. Cos theta (the angle between vectors) will equal the sum of the unknown values (Ui times the article values (Ai) and divide all that by the magnitude of U times the magnitude of A. That’s a fast computation for seeing if they’re in the same direction. If an article is used to probe the article space, it will find itself, which is good and works better than word count.
Another problem is robotic arm control, trying to get an arm to move a ball along a trajectory with a desired speed, velocity and acceleration. The arm is divided into three parts. One is a stand, an arm segment comes off the top of the stand at an angle and another arm segment (with a hand) comes off the first arm segment. The angle between the top of the stand and first segment is theta1 and angle the second segment comes off the first is theta2. The first problem is the kinematic problem, translating the X,Y coordinates of the ball that are desired into the theta1, theta2 space. Then there’s the problem of getting the ball to go along the directory with desired speeds and acceleration, using Newtonian mechanics. The equations are very long and complex due to the complicated geometry. A table would include values for theta 1, 2, and their accelerations and velocities, as well as Tao1 and Tao2 (torque on first and second motors) as columns. With a lot of these records, the desired trajectory can be divided into small pieces with values that can be looked up in the tables for the set of values that’s closest to the goal. This way, the answer can just be looked up in the table. It works pretty well for some problems. A machine is shown training to bounce a ball repeatedly using a time that gets better every run. The first few bounces perform badly but it writes to it’s table to have more to look up, getting better and better as it tries more times.
If the arm is to be used as a baseball pitcher, for each segment, 100 bytes per joint are recorded and 100 joints are used (as an example). The pitch is divided into 100 segments and a pitcher throws 100 pitches a day (not sure on these numbers). How much memory is needed to record all pitches a pitcher pitches in a career (of 100 days for 100 years). This requires 1012 bytes to store. Our brain has 1010 neurons in the brain with 1011 in cerebellum alone (what). This means the cerebellum dwarfs the rest of the brain. A neuron has a variable number of synapses, but the ones in the cerebellum have 105. This just means you don’t have to worry too much about not having enough storage for 1012 for arm control. Do humans just use a table for motor skills? If data is centralized around a smaller range of x, it’s easy to fix by normalizing the data. What if the answer doesn’t depend on y at all? You’ll get messy results. What if the answer doesn’t depend on the data at all? Then you’re wasting time and confusing yourself.
Lecture 11 – Learning: Identification, Disorder
Nearest neighbors doesn’t work for identifying vampires because candidates either cast a shadow or not. A lot of characteristics either don’t matter or aren’t numeric and can’t be plotted. Additionally, some characteristics matter, but only matter some of the time. Some vampire tests may be more expensive than others, such as going out during daylight to see if they have a shadow or forcing someone to eat garlic. For this kind of problem, you’re talking in terms of tests and not real values. (I’m guessing this lecture was Halloween?) This problem is visualized as a tree of tests. This kind of tree is called an identification tree. A decision tree is a label for something else. An identification tree is good if it is the simplest possible tree that successfully breaks into proper labels. The simplest explanation is often the best explanation due to Akkam’s razor. The tests are shadow, garlic, complexion and accent with outcomes of yes (vampire) or no (person). The shadow test divides the sample test into three groups, yes, no and unknown (some MIT students aren’t seen during the day). No shadow means a vampire and a shadow means not. If someone eats garlic, they aren’t a vampire, if not, it’s unknown. Complexion has three choices, average, pale or ruddy. Accents are normal, heavy and odd. No tests match everyone to their result off the bat. In the example, shadow puts the most people into homogenous sets (everyone except unknowns) and garlic tells if people are human. Depending on the data set, people could be divided into exclusive groups without it being conclusive. A shown tree first calls the shadow test and sends the unknowns to the garlic tests.
We need a way of measuring disorder for overall quality of tests. According to information theorists, the disorder of a set, D(set)=-P/T * log2 P/T – N/T log2 N/T, where P is the number of positives, N is the number of negatives and T is the total number of cases. If P/T is ½, D(set)=1. If P/T=1, D(set)=0 (using L’Hospital’s rule). Given the shadow rule produces three different sets which can have their disorder measured, how do you value the whole test. An initial solution is Q(test) = sum over sets produced of D(set) times the number of samples in the set divided by the number of samples handled by the test. The shadow test still has the lowest disorder and is still used as the first test and then garlic has the lowest for the remaining tests. This sort of evaluation is used in practice for coming up with good test procedures and can also be used with numerical data. For numerical data, you use threshold’s to lead to categorical results. For numerical data, you may find yourself doing the same test at different layers with different thresholds. You don’t need to do all of the tests and can prioritize tests that do the best work. This lecture was before Halloween.
Lecture 12A – Neural Nets
This lecture is divided into two parts which replace the original Lecture 12: Neural Nets and Back Propagation. The topic of neural nets was almost removed from the course after 2010 as none had been effective yet. Two years later, there was success with classifying pictures using neural nets. An example from Toronto had 60 million parameters and was good at characterizing pictures. For instance, it could recognize a container ship, motor scooter and leopard and stated probabilities for its guesses. Some bad examples are calling a convertible a grille and a “madagascar cat” as a squirrel monkey. Since then, lots of effort has been put into neural network technologies. A neuron has a cell body with a nucleus and a long stretch called the axon. One the other side, it has a branching structure that looks like roots called a dendritic tree. At the end of the amazon will be a thickening and another neuron’s dendrite. Similarly, another neuron’s axon hits the dendrite of the neuron modeled. If there’s enough stimulation in the tree, a spike goes down the axon and the neuron goes quiet to recover strength in a refractory period. When the axon is stimulated it dumps vesicles into the inter synaptic space between neurons. To model this, you have a binary input as neurons either fire or don’t. Inputs are weighted (modelling the influence of the synapse on whether the axon is stimulated) and collected together. If the input exceeds a threshold it outputs a 1, if less it’s 0. This is an all or none model with cumulative influence that tries to model synaptic weight. A real neuron also has a refractory period or axonal bifurcation. A pulse in a neuron will go down one branch or another but we don’t know how a lot of it works. Does the timing of arrival of pulses in the dendritic tree have anything to do with what the neuron will recognize? There are a lot of unknowns with actual neurons.
The skull can be viewed as a big box full of neurons. Neurons are full of weights and thresholds. A variety of inputs enter the skull box and a bunch of outputs come out based on the influence of weights and thresholds. Vector Z = f(x,w,t), which are all vectors. A neural net is trained by adjusting those weights and thresholds so what you get out is what you want. A neural net is a function approximator. Some sample data could give a desired vector d=g(x). How well a neural net is doing can be measured by comparing the desired vector to the actual output vector for a set of inputs. What should that performance measuring function be? A simple technique is to measure the magnitude of the difference. A better solution squares this, P=-||d-z||2, where the best value is zero and it’s always a negative value otherwise. The goal is to take the weights and thresholds and maximize performance. A hill climbing approach won’t work with 60 million parameters. Instead of taking a step in every direction, we take partial derivatives. Delta w = rate(∂P/∂wi * i + ∂P/∂w2 j).
Unfortunately, gradient descent requires a continuous surface. Thresholds get in the way, though another input of negative 1 can shift the function over so it fires over 0 and removes the need for a threshold. A step function isn’t good and it’s replaced with a sigmoid function with an S-like shape, 1/(1+e-alpha). Alpha is the input, the bigger alpha goes up to 1 and if alpha is negative the function goes to zero. This is a convenient function. Now we can take partial derivatives to train the neural net. 1 neuron isn’t a net, but 2 is. The simplest neural net has 2 neurons and 2 parameters. An input x goes into a multiplier where it’s multiplied by w1, that value (P1 goes into a sigmoid box which outputs y which goes into a multiplier where its multiplied by w2. That value p2 goes to another sigmoid box which outputs z. The performance function is P=-½ (d-z)2. ∂P/∂w2=∂P/2z * ∂z/∂w2 by the chain rule. With another chain rule, ∂P/∂w2 = ∂P2/∂w2 * ∂z/∂p2 * ∂P/∂z. ∂P/∂w1 = ∂P1/∂w1 * ∂y/∂P1 * ∂Pz/∂y * ∂z/∂p2 * ∂P/∂z. If beta is 1/(1+e-alpha) and we want, dbeta/dalpha, that equals d/dalpha(1+e-alpha)-1 which is (1+e-alpha)-2 e-alpha. The derivative comes out to beta(1-beta) and the ∂ variables can be substituted.
The neural net is shown being trained to make the output being set to the input. Randomization is required and weights are random. It’s important to be careful with the rate constant. If it’s too small, it won’t move and if it’s too big it can jump over a hill. A good function can vary this as it gets closer to the top. A second identical pipeline is added with a x2 value. Values from either can be sent as inputs to parts of the other which can get messy with a variety of paths with weights all over. There appears to be an exponential number of paths which could lead to a blow up as the size grows. Part of the partial derivative with respect to w1 was already calculated during the process for w2. That sort of repetition will occur over and over again and the system won’t blow up exponentially. The influence of changes of P on performance is also all we care about looking back at the network. This reuse of partial results is a similar idea to Fast Fourier Transform. This network is linear in depth and the computation grows linearly as the number of inputs does. With respect to width, any neuron in one row can connect to another, so it will grow exponentially but is readily computed. This was overlooked for 25 years. Simplicity requires tricks and observations that look simple in retrospect.
Lecture 12B – Deep Neural Nets
This lecture goes from a neural net with two parameters to 60 million, which is a pretty big jump. A key idea is that columns are pretty well divided and you can’t change something later on without going through transformations in the column. The number of paths can be exponential, but there is still a limited number of ways one point can affect another. In all cases, take the partial derivative of the performance with respect to weights. Many parts of those calculations are identical even though used in different parts of the neural nets. As you move further back from outputs to the inputs, lots of work is reused. A deep net was used for the image recognition system. That system had an error rate of 15 percent for having the answer in the top 5 and 37% if only looking at the top answer. Dot products of weights and values are used heavily in neural nets.
An example is shown of convolution with a 256 by 256 pixel image. The image is run over by a neuron that looks only at a 10 by 10 square. The neuron is shifted repeatedly, producing an output each time. This process is called convolution and results in an image with a lot of points. Values of local neighborhoods in those points are checked for maximum values and sliding over a field checking for maximum values produces a smaller group of points. This grouping process is called pooling and since we’re taking the maximum, it’s called max pooling. The neuron moved across the image is called a kernel. The output can be repeated, typically 100 times to add more inputs to a neural net.
In the old days before massive computer power was available, an easier to visualize neural net was used. There would be a smaller number of pixels due to computing power. Each pixel would be a value fed as an input to a neuron and you could have a small number of columns of neurons looking for different things in the image. Everything is scaled up massively today in our current day giving performance no one expected was possible.
Given some input values passed to an input layer (of neurons) and then to a much smaller hidden layer which expands to an output layer of the first layer’s size. The output layer can be compared to desired values, equal to the inputs here. This net is trained so that the output is equal to the input and backpropagation is used for training. For this to succeed, all of the original inputs must be compressed into the hidden layer’s smaller space while maintaining the ability to recapture their original information. With training, this process can be quite successful. The process doesn’t know about the values you’ll try to get out of it (such as animal types), but can learn to generalize the values it sees. The neural nets come up with encoded generalizations. This process is known as autoencoding. Each layer can be used to train the next layer and lots of knowledge about the environment can be gained by the machine this way, before passing classes and attempting to classify at the end.
A final classifying layer may look something like a summer taking a -1 value going through a multiplier with a threshold value, and an input value multiplied by a weight. The summer’s output goes into a sigmoid function to give an output z. Z will equal 1/(1+e-wx+t). The function is based on the value of the weight and threshold. Changing the threshold value will shift the sigmoid left or right while changing weight will change its steepness and shape. Values of classes may be associated with specific points on the sigmoid curves. The probability of seeing the class is the probability of seeing that value. You can also have values corresponding to classes not associated with a neuron. The probability of seeing the negative value is 1 minus the point on the curve. The right transformations of the sigmoid function values are passed to will be based on the values seen, by changing threshold and weights. The probability of seeing desired data is based on the shape of the sigmoid curve and the goal is to maximize that probability.
Back propagation has layers of ideas layered on top. The next idea layered on it is that output values are functions of different classes. Instead of picking the value with the highest output and saying it’s the winner, you can provide probabilities for multiple classes. The probability of class 1 is F(c1)/(sum over i of F(ci). The probability is gained by dividing the function output by a normalizing factor. Softmax is the idea of giving a range of classifications with probabilities for each. Combining softmax and autoencoding, you can train the middle layer up and detach it from the output layer while retaining weights connecting the input to the hidden layer. This leads to a trained first layer and untrained output layer. The input layer is then freezed and the output layer is trained with the sigmoid curve. This leads to a good match between outputs and desired outputs. The problem with a lot of neural nets is they get stuck on local maximums. They train better if you effectively flip a coin for each neuron on each iteration with different sets removed or turned off for each. This removal is known as dropout. Deep nets should be called wide nets as they’re rarely more than 10 columns but can get incredibly wide.
All of the sophistication with output layers that are probabilities and training using autocoding doesn’t help much relative to using backpropagation. Backprop with a convolutional neural net seems to do about as good as anything. Widening neural networks also helps to avoid getting caught in a local maxima as the network has a way of avoiding them. To ask if neural networks can really see, it labels incomprehensible static and patterns as objects and animals. They have a pattern for labeling and lots of flaws can appear based on that. Minimized versions generated by the neural nets can be identified by computers but not people. Sometimes they’re based on an outline and identifiable. Deep neural nets are an engineering marvel capable of great things, but they don’t see like we do.
Lecture 13 – Learning: Genetic Algorithms
If it’s been hard to develop an understanding of intelligence and some solutions attempt to solve problems by mimicking evolution to get around this. A cell with a pair of chromosomes duplicates chromosomes with 1 pair ending up in each of two child cells in mitosis. Reproduction is more complicated as chromosomes get mixed up and recombine. Four chromosomes are split into a pair each with two and those split again, giving four cells each with a chromosome produced by twisting of the rope and recombination. One of these along with another cell with a chromosome combine to create a new person. A mother and father’s chromosomes aren’t recombined, the grandparents are. During this process, there is a lot of randomness to determine which cells got to fuse to create an individual. For genetic algorithms and nature, there are lots of choices and places to intervene.
To mimic this, chromosomes could be simulated and some could be mutated, such as 0’s becoming 1’s or vice versa, with chromosomes represented as bytestrings. How many mutations are allowed is a choice you could make. After that, the crossover occurs with two crossing to produce a new chromosome by combining the the front half of one with the back half of another. How many crossovers are allowed is another choice you could make. Now, with a population of modified chromosomes via mutation and crossover, there’s a genotype to phenotype transition where the chromosome determines the individual and the code is interpreted to be something. The code is the genotype and the determination is the phenotype. Because of the varying composition, different individuals have different fitnesses (potentially scored). How the genotype determines a phenotype and the individual’s fitness is scored are additional choices. The fitness could also determine the probability of survival in an environment to the next generation. With this probability, a selection algorithm can take place and phenotypes create genotypes, a new set of chromosomes to begin the loop again with.
The probability of drawing an individual is proportional to the fitness of the individual and all of the probabilities add up to 1. So, Pi = fi/sum over i of fi. There’s a lot of choices for how to calculate the fitness. A genetic algorithm looks for an optimal value in a space. In an example, the fitness f(x,y) = (sin omega x)2 (sin omega y)2e+(x+y)/sigma. The goal is to get the optimal value. A chromosome in this example will consist of an x and y value. The mutation takes one and changes it a bit and the crossover function exchanges the x and y values of pairs. The algorithm for optimizing is a hill climbing algorithm. Initial examples of mutation and crossover in an animation seem ineffective as they climb up local hills. The next attempt is to try and use fitness to get a better probability of survival. This formula is strange since, for instance, if temperature is a measure, the ratio of whether you’ll survive vs the person next to you will depend on if you measure in fahrenheit or celsius. Based on this, instead of caring about the exact fitness, we use a “rank space” method to look at rankings instead. Here, the probability of the highest ranking individual, Pi = Pc, some constant based on choice. If that guy doesn’t get selected, P2 (second highest’s probability) = (1-Pc)Pc and P3=(1-Pc)2Pc. PN-1=(1-Pc)N-2Pc. If you get to the last guy, PN=(1-Pc)n-1. With a higher step size and certain constant, Pc, it’s able to get to the optimal value and not get stuck on local maximum. Increasing the step size at first to cover the space and then reducing it to crawl up to the maximum in the right area is called simulated annealing.
With a large moat area (lots of space without values between smaller values and the largest), it still gets stuck in the current example. To solve this, look at both fitness and diversity, taking fit individuals and ones that are different from the values selected, with fitness and diversity as x and y axis, and trying to optimize for each. With fitness and diversity ranks, it’s able to cross the big moat. With crossover it crosses the best of the x’s and y’s. Removing diversity once near the optimum, the whole space of individuals clusters at the highest point. This is still naive evolution and we don’t fully understand how it works. The naive idea used has lots of space for intervention to try and optimize. Mutation is basically hill climbing, crossover combines strong features of multiple individuals into one. Genotype to phenotype is how to translate zero’s and one’s to an if/then rule up to the developers.
This process has found practical application. What kind of processes involve taking a good front half and back half to make a good whole. Planning is one example where good start steps and end steps can combine. Another example was a student who had a rule-based expert system for predicting the winners of horse races. Using rule’s to determine conclusions, he could use mutations and crossovers to draw conclusions from mutated values. An example is shown of different objects evolving to walk or swim. A comical example talked about creatures moving by hitting themselves. Ones trained to swim were tested on land locomotion. They could be trained to follow a dot and some learned to jump. A food competition shows some going for food and some just trying to keep the opponent from getting food. Does the credent lie in the ingenuity of the programmer or the genetic algorithm itself. In this case, it’s the former.
Lecture 14 – Learning: Sparse Spaces, Phonology
Basic methods such as identification trees and nearest neighbors have been around for a long time and are still useful when you don’t know what to try. Neural nets and genetic algorithms attempt to mimic biology but usually don’t get things done too well and require a lot of work to get them to appear to perform well. This lecture talks about research that was done to attempt an account on things humans do well. Cats and dogs are pluralized differently. “Dog” has more of a “z” sound. How can a machine learn phonological rules like that? Yip and Sussman went to solve problems like this by first looking at the science. Start with a person who wants to say something. An acoustic pressure wave comes out of their mouth. If two people say the same words, the wave looks very different, yet both can understand it. An ear processes it and an output is a sequence of distinctive feature vectors. There are around the order of14 distinctive features determining which phone you’re saying. There are 2^16 or about 16,000,000 possible combinations, yet no language has more than 100 phones (english has about 40). This sequence can be viewed as producing meaning after a set of operations. Many of the distinctive features are hallucinated and feedback into the sequence. The McGurk effect says if you take a video of someone saying a word a different word is said, you may think you hear a word that wasn’t in sound. It’s difficult to pronounce a word correctly if you don’t see the speaker, as well.
A distinctive feature sequence for “apples” could be separated into phones, “ae”, “p”, “l” and “z”. One feature of interest is called syllabic, asking if the sound can form the core of a syllable. “Ae” can, but the others can’t. The voiced feature asks if the phone is voiced and all but “P” are. The continuent feature asks if the vocal apparatus is open. Again, all but “P” are open. The strident feature asks if you use your tongue to form a jet of air, which is only done in “z”. This is a matrix of distinctive features for representing the word apples. Going across the word is done in time. Yip and Sussman started by designing a machine to interpret words and sounds to produce the sounds of the language. The imagined machine had a mystery apparatus for looking at the world and determining what it sees (ex, two apples). A set of registers holds values for concepts such as “noun,” “verb,” and “plural.” Then a set of words is stored with “apple” being one of them. The words know how the concept is rendered as a sequence of phones/distinct features. Most importantly, there’s a set of constraints such as a “plural” constraint which connects to other parts of the machine. Lastly, there’s a buffer of phones to be uttered and translated into an acoustic wave form. The words are connected to the buffer. The plural register is connected to the seen objects/apparatus which is also connected to the words. The plural constraint is connected to the plural port in the register and will activate itself when something is plural. A “z” sound port connects from the plural constraint to the buffer and a “+voiced” value is connected to another part of the buffer. Information can flow in any direction across ports.
In an example, the vision apparatus produces the notion of two apples in stage 1. In stage two, the word lexicon selects apples and plural is selected in the register. In activity number three, the word lexicon (connected to the registers) tells the register apple is a noun and not verb. At the same time, A, P and L are written into the buffer in different elements. The constraint sees stuff in the buffer and see’s that “P” isn’t voiced (in the second to least element where the port to voiced is) and L isn’t a “z” sound (as L is the last element connected to the z check). As time passes, elements flow to the left of the buffer and shift over to the speaker in phase four. In phase 5, L is in the penultimate position and voiced. With voiced and plural, the machine adds “Z” to the end of the buffer. Phonological rules are expressed in constraints and since data can flow in any direction, constraints are called propagators (which are useful in complex systems).
To learn rules, you need positive and negative examples. What about cats vs dogs? Syllabic, voiced, continuent and strident are four of the 14 features of words. In phones KATS, A and T are syllabic, A is voiced, A and A are continuent, and S is strident. For DOGZ (only looking at the last two columns), G is syllabic, G and Z are voiced and Z is continuous and strident. Why does only “dogs” get the z sound? Yip and Sussman started by collecting positive and negative examples and picking a positive example or “seed” to start from (“KATS” is our seed). After this, the next step is to generalize picking spaces in the matrix not to care about, setting them to an ignored value. The seed becomes a pattern and to pluralize a word in the same way, the matrix has to be matched. If you generalize too far, a negative example will be matched and in the algorithm you generalize that far before quitting. Until a negative example matches, keep generalizing. A beam search is done through the space in an example which finds that if the last sound is a “t,” the word uses a z. If strident, a sound ends with “s.” One reason the element works is because it’s a sparse space. Sounds will be about equal distances from each other, so it’s easy enough for the program to tell examples apart. Constants are easy to separate though vowels are much harder. Maher’s catechism says when dealing with an AI problem, first specify the problem, then come up with a representation and then determine an approach or method and pick a mechanism or devise an algorithm and experiment. Usually these steps are looped through to improve steps. In AI, people have tended to fall in love with mechanisms and try to apply them to every problem, such as trying to solve every problem with neural nets. Sussman and Yip is an example of AI congruent with Maher’s catechism. The right representation makes things explicit and exposes constraints. An example where the right answers can be determined in a smaller number of areas is better than one which is more spread out. Avoid mechanism envy.
Lecture 15 – Learning: Near Misses, Felicity Conditions
How can you learn something from every example? Writing a program to figure out what an arch is from pictures of blocks rearranged, the goal would be to represent blocks as symbols. An initial model shows them arranged as an arch. A similar example that differs by a little would be shown such as two standing blocks with the one going on top on the ground instead. The difference would be the relation between the same blocks and this would be a near miss. This near mess makes it clear that support relations are necessary. Another near miss is found when the two support blocks are touching. Exactly one difference can show the touch relations interfere with the arch status. In two steps, two important facts are learned about the nature of arches. Another example could classify an identical image with a red arch top showing that an arch can be red or white (if those are the two colors of arch tops). If we’re working in a world of red, white and blue and confirm that the top can be blue, then we know the color relation doesn’t matter. It’s important at least to know that anything goes for color. Lastly, what if the top is triangular instead of flat block? If it’s still an arch, then there’d be a relation showing that the top is a wedge and the normal top is a block. If it can be both, and the examples are toys, you can choose how far to generalize, with a conservative estimate just saying it’s a brick or wedge compared to another saying it could be any physical object.
What about the problem of distinguishing two sets of trains with one having a short car and closed top. A computer can figure this out in a similar way to arches. In the first case, the computer is shown one at a time learning from each. In the second, it gets all values at once and tries to figure out how to distinguish them. How to solve this, start with one set to be seed or positive examples. Take the same heuristics and search for one that loosens the description to cover more positives. If you include a negative example (an example from the other set), you can make a new description to cover the data better. A large number of heuristics is needed and a beam search is used. This exact program was used to build descriptions of soybean diseases. We now have two ways of deploying the same heuristics. These heuristics are named. For realizing something is essential, that’s the “require link” heuristic. If a link can’t exist, that’s the “forbid link” heuristic. Extending a set with a new possible value is “extend set” and changing a feature to “anything goes” is the “drop link” heuristic, even though it isn’t removed and just pointed to an anything marker. Generalizing from bricks and wedges to blocks is the “climb tree heuristic.” Extending sets, dropping links and climbing trees are generalizations. Near misses lead to specialization. Specialization steps draw in the number of matchable items and generalizations make it broader. Depending on the use case, one of these examples may be more appropriate than another.
The goal of an interaction between a teacher and student is to transform an initial state of knowledge to a new one so the student is smarter and can do things they couldn’t before. The student has a learner and user program to use what it learned. The teacher has a style, so if learning is to take place, one side has to know something about the other. It’s helpful if the teacher knows something about the student. What you know forms a network. Initially you don’t know anything and as you learn, you develop quanta of knowledge which are linked together with prerequisite relationships saying how to get from one point to another with links. What you know at a particular time is a wave front in that space and if the teacher knows what that is, they can better correct what you forgot or didn’t know. An earlier link is likely a mistake and one past the way front can be covered later. Errors at where the wavefront is are an opportunity for learning. Sometimes, you can be taught stuff you can’t use. The student uses the teacher’s style to determine how to learn or whether to come to class.
To learn anything, a machine has to build descriptions to tell something to itself. If you’re like a machine, you can’t learn without building descriptions and talking to yourself. A study showed that the worst students on a test only said 10 things to themselves and smart ones said 34 things to themselves during the test. If you encourage people to talk to themselves when they wouldn’t have normally, would they have scored better on the test. Talking to yourself seems like a useful habit in general (if not in public). Packaging ideas better is a useful way to improve as well and you’ll be able to sell people on your ideas as well. For examples to be memorable, an idea should have a visual symbol (arch), slogan (near miss) and a surprise (machine learning something definite from each example (one shot learning)). It should have a salient idea which sticks out (one shot learning can be obtained via near misses). Lastly, incorporate a story, since humans like stories (he described the example as a martian learning about arches). With this “star” structure, it’s a lot easier to convey and sell ideas.
Lecture 16 – Learning: Support Vector Machines
This lecture comes back to dividing up a space with decision boundaries. Support vector machines or SVMs are a more complex idea than seen before, but have broad applications. Nearest neighbors, ID trees and neural nets can add decision boundaries in different ways. You’d think there wouldn’t be any more breakthroughs until Vladamir Vapnik proposed to divide positive and negative values with an ideal straight line. The straight line is drawn with the goal of adding the widest street from separating the examples. The professor calls this the widest street approach. How do you make a rule using this boundary. Imagine a vector W of any length, perpendicular to the median line of the street. If a vector U also points to an unknown point in the street, determine if it’s on the left or right side of the street. The the dot product W*U and check if its at least some constant C. The dot product takes the projection onto W and the bigger it is, the further out on the line the projection lines, eventually big enough to cross the median line of the street. Without loss of generality, you could also say, W*U+b is at least zero to give a positive value (where b is a constant and c=-b). This is the decision rule, but there isn’t enough constraint to fix a particular b or W. With enough constraints, b and W can be calculated. Taking the dot product of W*X+, adding b should lead to a value of at least value. W*X–+b should be at most -1. A variable yi is introduced where yi is 1 for positive samples and -1 for negative samples (yi will be negative for negative values still). yi(Xi*W+b) – 1 is at least zero and with this mathematical convenience the equations will work for greater than or less than zero. For samples that end up in the gutter, the value will be exactly zero.
The goal is to get the largest line between the two gutters. Drawing the road and putting a vector to the lower edge X– and a vector to the top X+, you could take the difference between the vectors, X+-X–. The width of the street is equal to X+-X– multiplied by a unit vector. If W is normal, then you can take the dot product of X+-X– * W/|W|. Using the constraint for Yi, the result is 2/|W| and is the width of the street. To maximize that width, minimize the magnitude of W (or min ½||W||2 for convenience). LaGrange multipliers will give a new expression which we can maximize or minimize without worrying about constraints. L=½||W||2-sum over all constraints of alphai [yi(W*Xi+b)-1]. Values that aren’t zero will lie in the gutter. To get the maximum/minimum, find derivatives and set them to zero. How do you differentiate a vector? It has a form that looks the same as differentiating with a scalar. ∂L/&partW=W-sum of aiyixi=0, implying W=sum over i of aiyixi and W is a linear sum of the samples. L also needs to be differentiated with respect to anything else that could vary, such as B. ∂L/∂b=-sum of aiyi = 0, implying the sum of aiyi=0.
L=½(sum of aiy*Xi)(sum of aiy*Xi)-sum of aiy*xi*(sum of ajyjxj)-b sum of aiyi+sum of ai, which equals sum of ai – ½ double sum over i and j of aiajyiyjxi*xj. This will be handed off to a numerical analysis. Still it was good to figure out the dependence of the function on the vectors. The optimization only depends on the dot product of pairs of samples. Compared to neural nets, this uses a convex space and can never get stuck in a local maximum. When stuck, due to samples that aren’t linearly separable, you can try and take the points and put them into a different space where they’re easy to separate. A transformation is needed between the current space and a more convenient space (phi(X)). Dot product of phi(xi)*phi(Xj) is needed to maximize and phi(X<i>)*phi(u) is needed to recognize. This means if you have K(xi,xj=phi(Xi)*phi(Xj). That kernel function provides the dot product of the two vectors in another space without needing to know the transformation. Popular kernels are the linear kernel, (U*V+1)n, and a kernel e-||xi-xj||/sigma. This process works much better for solving problems by transforming across spaces easily. Overfitting can still occur, but this is useful for local maxima problems. This originated in a PHD thesis in the 60’s though it made it to the US in the 90’s. Vapnik didn’t like neural nets and thought SVMs would work better. When kernel’s were shown to work on a handwriting recognition problem, Vapnik continued work on it. It was 30 years from the concept to US attention and another 30 before the appreciation of their importance.
Lecture 17 – Learning: Boosting
Boosting is an essential item to have access to for learning. This lecture only looks at binary classification problems. Assuming there’s a set of classifiers to draw on, h produces either a -1 or a +1. Coffee is +1 and tea is -1 for instance. Normally, the world doesn’t give us good classifiers and there will be an error range from 0 to 1. Ideally, the rate would be zero and it’s terrible if it’s 1. If just a little better than a coin toss, the classifier is weak. Can you make a classifier a strong classifier (close to zero) by combining weak classifiers and letting them vote. For instance, H(x) = sign(h1(x)+h2(x)+h3(x)). If two out of 3 or all 3 agree, we’ll get a plus or minus 1. One guy can be wrong as long as the other two are right. For now it’s trained on a sample set, though later it can be tested on new values. With a lot of classifiers to choose from, the undisturbed data is used to produce h1. The data with an exaggeration of h1 errors is then used to produce h2, looking at a distorted set of values where the poor performing ones are exaggerated. The samples where h1 and h2 give different answers are used to get h3. This is part one of a multipart idea.
If this is a good idea, then you effectively have a tree with H(x) as the root, depending on children h1, h2 and h3. These can be votes of other tests as well, expanding the tree. What kind of classifiers are we talking about? Each test is a classifier and decision tree stump. Boosting doesn’t depend on decision tree stumps and can be used with any classifier. Decision tree stumps are easy to visualize, however. The error rate is sum over wrong cases of 1/N (the number of samples). Each plotted sample can have a weight associated with it. At the beginning, they may all be equal. After the first test gets some wrong, the error rate is the sum ovr the wrong choices in the current step times Wi 1/N. The sum of the weights in the whole space needs to be 1. Weights are adjusted to prioritize values as appropriate. To build up the classifier in multiple steps, H(x)=sign(h1(x)+h2(x)+h3(x)+…). Just adding things up wouldn’t be enough, and a weight alpha can be added to each parameter. All will make errors in different parts of the space, so we’re trying to use the wisdom of a weighted crowd of experts.
Start by letting all weights at time 1, W1i=1/N and then pick a classifier which minimizes the error rate. The error rate is used to determine the weight alpha. Then, use the classifier to revise weights, wt+1 and use that to pick a new classifier, running the loop until the classifier produces a perfect set of conclusions on all of the sample data. Suppose Wt+1i=Wti/Z * e-alphat*ht(x)y(x), where Z is a normalizing factor. Y is plus 1 or minus 1 depending on whether the output should be plus or minus 1. This is what was seen in the last lecture. The y just flips the sign if you have the wrong answer. The alphas are the same as in the H(x) formula as weights. The normalizing factor is used to make all the weights sum up to 1. The goal is to minimize the error in the H(x) function. The minimum error bound for the whole thing is achieved if alphat equals ½ ln(1-Et)/Et. The error rate can go up as terms are added, but it will eventually converge on zero. With the formula for alpha, that can be plugged back in to the weight formula to get wt+1i=wti/Z * {sqrt(Et/(1-Et) if correct and sqrt((1-Et)/Et) if wrong. Z is equal to 2*sqrt(Et(1-E)). If correct, Wt+1i=wt/2 * 1/(1-E). Otherwise, it’s equal to wt/2 * 1/E.
The “thank god hole” for dealing with boosting problems is if you sum the new W’ over the correct answers, you’ll get 1 half and the same for the ones you got wrong. Take the weights you got the right answer for with the previous test and sum them. To get the weights for the next generation, scale them so they equal ½. The sum of those weights is just a scaled version of what they were before. You don’t have to compute Z, logarithms, alpha, etc. You just need to old weights. “Thank god hole number two” looks at decision tree stumps. Normally, you’d assume you need tests equal to 1 plus the number of samples. In reality, no test lying between two identically classified samples will do any good and can be removed. While not totally understood, boosting doesn’t seem to overfit.
Lecture 18 – Representations: Classes, Trajectories, Transitions
The problem with systems that use SVMs, boosting, and neural nets seen so far don’t understand what they’re doing. Humans have only been around maybe 200,000 years when dinosaurs were 60 billion years ago. Neanderthals may have had larger brains, so size can’t be everything. They’re stone tools didn’t change much over 10’s of thousands of years. It seems that a group of humans in Africa gained something and took over the world. Symbol carvings and jewelry came about thousands of years ago. According to Noam Chompsky, the ability to take two concepts and put them together to form a third without limit and without disturbing the originals is what lets humans get ahead. Whatever happened seems to be all of a sudden, perhaps with a brain forming that could finally pass a threshold. Humans being able to tell and understand stories is what the professor believes separates them from other primates. The language we use to think isn’t necessarily the same as the one we use to communicate, though they’re related.
A semantic net is a network with nodes and links and has meaning. In a story, nodes could be characters and edges could be actions taken on another, with property edges pointing to object properties. Combinators are a fancy name for the links connecting nodes. Links/edges can also be connected. A murder by one Shakespeare character on another leads to one having a dead propery. Minsky suggested the need for a localization concept with frames which have agents and victims. The agent would be MacBeth and the victim would be Duncan. Once you have combinators and reification, you have something that’s universal, though without enough organization to go to the next level of achievement. Another problem over the graph is the idea of parasitic semantics, where there’s a tendency to project our own understanding to the computer. Putting the diagram into machine form doesn’t mean the machine knows anything.
Classification is a useful concept. We know about various different names on different levels. You can’t easily picture a “tool” or two people could be picturing different things. A piano is much easier for multiple people to visualize and more specific brands and details can be mentioned. They can also tie knowledge to words. A graph comparing general, basic and specific names starts small at general (instrument), has a large jump at basic (piano) and a small jump at specific (Yamaha). Most of the details come with the basic understanding. Another situation shows a car crash in different states. How do you talk about the speed and condition of the car and its distance to the wall before, during and after the crash? What’s changing? Before the crash, the distance changes. During the crash, all three change and afterwards, nothing changes. This is the idea of transition and any system involving stories will involve transitions. The same actions can take place with photons colliding with receptors, though with other differences. The third idea is trajectory. A lot of what we say is about objects moving along trajectories. A trajectory frame has an object moving along a trajectory from source to destination, probably arranged by an agent which may assist the motion. Objects and characters can have different roles in the frame and prepositions such as with, by, to, from, and for can help explain.
In 100 sentences, you’ll find about 25 transitions or trajectories. To put these concepts together, you can use a story sequence which may consist of multiple sentences. For instance, “Pat comforted Chris” lets you construct a role frame with an agent Pat, an unclear comforting action, an object of Chris and a result of a transition frame. The transition frame involves an object, Chris, with a mood that’s presumably improving. These representations allow for some questions to be answered. A more specific action such as “kissed” let’s people form a mental image which will be a trajectory frame with an object of Pat’s lips and destination of Chris’s lips, or other scenario based on relationship. For the language, a story sequence is important for being able to put events and actions in an order. Typically, you’ll have to go back to some start of a sequence for a scenario to make sense. Story libraries can divide up stories into categories. For instance, disasters and parties are both events with various subclasses. Each of these subclasses can provide frames for filling out information in various events. A wedding says something about a marriage and its characters, an event has a time and place, an earthquake frame has a fault and magnitude.
Lecture 19 – Architectures: GPS, SOAR, Subsumption, Society of Mind
Newell and Simon at Carnegie Mellon came up with the idea of the general problem solver. The idea was if you start in a current state and want to get to a goal state, observe the symbolic distance from where you are to where you want to be. Select an operation determined by the distance to get to an intermediate state. After that, the new distance will determine the new operation to get to the next intermediate state. This is abstract but can be used to solve a problem. For instance, if you’re at MIT and want to go home, you might have to take an airplane, as determined by the large distance. Another problem is to get from MIT to the airport. This is recursive as to take the T, you’ll have to walk to a station. In the example, the larger distances are calculated first and the smaller distances to get there are added. There was an imagined chain reaction where this architecture would improve itself and become smarter than humans. The problem of collecting differences and finding operators needs to be solved by humans before it can be used. Building the table relating the two was a hard job.
Eventually Newell and his students developed a more elaborate architecture called SOAR, which stood for state operator and result. SOAR had a long term memory, a short term memory and connections to the outside world such as a vision and action system. Most of the activity takes place in short term memory with long term memory’s contents funneling in and out. The whole program is a rule based system with assertions and rules. They also had a preference system which handled ties where multiple rules could work. Problem spaces were the idea that if you were going to solve a problem, you should develop a space and search through it. Lastly, SOAR used the idea of universal subgoaling. Whenever you can’t figure out what to do next, that becomes your next problem. The general problem solver believed everything was based on the “problem solving hypothesis.” SOAR was based on the symbol system hypothesis, representing everything as symbols. Both architectures believe problem solving is the most important thing we do.
Marvin Minsky was concerned with problem solving and how it comes in layers. In his layered model of an emotion machine, a person has reflective thinking which can be self reflective or self conscious, deliberative thinking, learned reactions, instinctive reactions. It’s important to think about multiple layers, but the blocker for computers is based on the common sense hypothesis. In order to do all of that, you have to have common sense like people. What is that and how can computers gain it? Lots of people have looked into that.
The most prominent alternative to the problem solving approach is Brooks’ subsumption architecture. Brooks believed people were thinking in the wrong way, building vision, reason and action systems which could do something but basically nothing. When vision and reasoning are improved, it breaks and can’t do its old tasks. Brooks turns the idea on its side with layers for different tasks of working with the world. One layer could avoid objects, another could wander, a third could explore and the highest could seek. This is the idea of layers of abstraction and abstraction variables. Each layer could have its own vision, action and reasoning system. Once one layer works, you don’t have to change it. There’s no representation, but the hypothesis is the creature hypothesis. Once you can get a machine to act as smart as an insect, the rest is believed to be easy. Representations make models possible, so what do you do? Use the world instead of the model. Everything you do is reactive in Brooks’ plan. You don’t need a memory of a table to walk around it, just react to it. In their purest form, these were finite state machines. Brooks’ big hit was the Roomba, based on his subsumption architecture. An example of an MIT robot is seen navigating this way as well.
Minsky’s deliberative thinking layer is what SOAR is about. Subsumption is about instinctive and learned reactions. What about his other layers? A genesis system is centered on language which enables the description of events and perception of the system to describe things. Perception is important for imagining what might happen in different events without having to store them in a database. If you can describe events, you can tell and understand stories and gain an understanding of culture. This is either important or could be everything. A study showed putting a piece of food in a corner of the room and spinning a rat, child and person who would go in the same angle afterwards. When a wall was painted blue, the adult human was able to go in the right direction. People become adults when they can describe left and right. If humans have to do a different language process as well, they’ll fail the rat test, suggesting it inhibits a language process. To become smarter, look, listen, draw and talk. Beware of fast talkers because they jam your language processor so you can’t think.
Lecture 21 – Probabilistic Inference I
Lecture 20 is about the business of AI and isn’t available on Open CourseWare for some reason. These next two lectures talk about the use of probability in AI. A table is shown with columns for a statue appearing, a hack having occurred and an art show. All possible combinations are rows in the table and for binary values, there are 2-to-the-number-of-variables values in the table. If a tally is kept for each, the probability that one occurs is the number of tallies for that event divided by all tallys. With a joint probability table, miracles can be performed using it. If you know there’s an art show, you can sum the tallies from scenarios with art shows and use it as the denominator instead. You can also check the probability of a hack or art show, the probability of a hack given a statue and hack given an art show. Another table shows the probability of a burglar given a barking dog.
Unfortunately, the problem with these tables is there are a lot of rows. Adding another binary variable would double the number of rows and AI can involve lots of variables. Probabilistic inference has a value but isn’t’ the only factor in AI. There are lots of scenarios where you don’t have enough information to get some information. How do you deal with probabilities without working out the full table. All probability flows from a small number of axioms. First, a probability is between zero and one. The probability of true is 1 and false is 0. Lastly, P(a)+P(b)-P(a and b)=P(a or b). If a is a sample space represented as a circle in a rectangle, the probability of A is the area of A divided by the area of the rectangle. The probability of A given B is the P(a,b)/P(b). The probability, P(x1,…xn) = product from i=n to 1 of P(i|xi-1,…x1). Independence says P(a|b)=P(a) if a is independent of b. Conditional independence says that P(a|bz)=P(a|z).
A dog barking doesn’t attract a burglar or a racoon to show up. The burglar and racoon showing up are causal relationships for the dog barking. The racoon could also lead to the trash can getting knocked over and the dog could lead to calling the police. This is drawn as a diagram where given its parents, every node is independent of other non-descendents. The burglar and racoon have edges going from them to the dog in the dog’s direction. The dog has an edge aiming toward calling the police. The police call only depends on the dog and trash only on the raccoon. The burglar may have a probability of .1 and racoon of 0.5. The probability the dog barks is calculated for each combination. The ideal dog always barks for a burglar, though a dog could bark even with none showing up. Tables are also drawn for the dog causing the police to be called and racoon knocking over the trash can. Ten numbers were specified. If you tried to build the joint probability table straight away, you would have needed 32 numbers. Using the belief number to get a smaller joint table leads to a lot of saved space.
Lecture 22 – Probabilistic Inference II
The joint probability table can tell us all we want to know and an inference network, sometimes called a Bayes network, was able to simplify the table. The chain rule is used to show the smaller table works from the larger. There are no loops in these networks and there will always be a bottom. Anything that can be done with the table can be done with the network. In a similar matter, the tables can be extended to keep track of tallies to calculate probabilities. With probabilities we can start to simulate what the model will do. The model with probabilities can be used to create samples, though this depends on everything being right.
The next task looks at naive bayesian inference. Since P(a|b)=P(a,b)/P(b), P(a|b)P(ab)=P(a,b)=P(b|a)P(a). This means, P(a|b)=P(b|a)P(a)/P(b). An example classifaciton problem takes symptoms and tells you what disease you have. If a is the class and b is the evidence, finding the probability of the class given the evidence could be difficult while getting the probability of the evidence given the class may be easier. P(c|e)=P(e|c)P(c)/P(e). Selecting between multiple classes, if you know the initial probability of the class and probability of the evidence given each class, you have the two elements in the numerator and are done. The denominator is the same for all classes and is just the probability of the evidence. Summing everything up will add to 1 anyways. What if evidence is actually pieces of evidence, P(ei,…en|ci)P(ci)/d. If all pieces are independent, the joint probability is equal to the product of individual probabilities. P(e1|ci)…P(en |ci)P(ci)/d.
The coin flip example seen in another course is shown. Pick a coin, 1 is rigged with a probability of heads of 0.8 and the other is fair with probability 0.5. A coin is picked randomly with probability 0.5 and flipped twice, once with each result. For the odds the rigged coin is flipped, .5*.8*.2 leads to .16 and for the fair coin, probability is .25. With an additional flip that’s heads, rigged is 0.128 and fair is 0.125. If only flipped once, then it’s just .8 vs .5. A similar example looks at a parent’s political party given their children’s. We can use the Bayesian hack to decide between two different models in a similar way. Bayesian classification can be used for “structure discovery,” but only when looking at a small number of models. If looking at many models, then you can search through the model space. With random numbers, you can keep track of the best examples so far, while doing a random restart when it gets stuck. Bayesian classification is used for problems such as medical diagnosis or school quizzes. Bayesian and probabilistic calculations are the right way to work when you don’t know everything. In diagnosis, all you have are symptoms and need to use them to determine the cause. The structure discovery talked about is the last topic of the course.
Lecture 23 – Model Merging, Cross-Modal Coupling, Course Summary
Bayesian methods have the potential utility of finding structure where you may not find it. For instance, a story graph with nodes representing events. What you want to get is a finite state graph that gives the collection of stories. You might find different events are similar and it’s possible that branches can go off the main graph and merge back later based on conditional events in a story. You can merge actions such as stalk and chase and run away and flee. When you do know stuff, you can make more efficient processes. Humans don’t use clouds of computers and learn to associate mouth form with sounds. How can you use multiple modalities and correspondences between them to sort out the modalities. When we talk, we produce a Fourier transform that moves with our speech. If we say a vowel, it produces a fairly constant Fourier spectrum. Where are the peaks and how do they correspond to the form of the mouth when saying a vowel. When you smooth the graph lines, peaks are called formants. The two modalities here are graphs of ellipses in the mouth and the formants. We do use some sort of cross-modal coupling as humans, but don’t study these graphs. If points are close together on one side, they can be clustered on the other side. Suppose we look at how the examples on one graph project on the other. Taking the projections as components of a vector, two vectors will be in the same direction with cosine zero and can be combined. Both sides can help determine how a system is organized with one giving insights the other can’t align. Regularity is important even without being concerned with Bayesian probability. There’s likely a lot of this going on in human intelligence as we have to make sense of lots of unlabelled data. Something similar to cross-modal coupling likely informs a lot of our knowledge.
With computer science having languages for procedures and the ability to enforce bounds and detail, AI offers new ways to make models and opportunities to experiment. Two types of approaches are how to use all the data on the internet vs how to make conclusions from the least amount of data. Mechanism envy is trying to look for things to do with a particular algorithm. A better approach is to start with a problem and choose the right representation and mechanism.
What classes should we take next? Minsky’s Society of Mind (6.868) has no prepared lectures and is just a conversation with Marvin. It’s an opportunity to see one of MIT’s geniuses think out loud. Bob Berwick teaches Language (6.863) and Evolution (6.048). Sussman is teaching a Large Scale Symbolic Systems course (6.945). The professor also recommends his course, Human Intelligence Enterprise, which is about ideas and is like a humanities course.
Conclusion
Overall, this course was a good introduction to machine learning and explained how concepts I’d tried with more limited understanding work from a more mathematical approach. Overall, this seems like a good conclusion to the process, approaching something I’ve learned about and blogged about in a different way. Before focusing more heavily on AWS and this project, one field I’d wanted to learn more about was natural language processing and using AI for translation. Going through this definitely makes me want to do that more and it’s definitely a candidate for my next project on this blog. While a bit burnt out on these courses for now, I’m also curious how MIT’s Introduction to Machine Learning course compares to this one. It lists Calculus 2 as a prerequisite and is something I’d like to come back to later.





Leave a Reply