MIT Courses #10 – Computation Structures

Throughout my time coding, it often occurred to me that I had little knowledge about how computers work. I had a base level knowledge of bit calculations and knew compilers and interpreters were used to translate the high level Python code to instructions the computer could understand. This course goes into a lot more details on how computers are built on multiple levels. First, it starts off by using concepts from Physics 2 to store information, represent bits and transmit information as voltage differences. The first third of the course focuses on using physical components such as MOSFETs to create logic gates and run functions in circuits. From there, it goes into the base level components of a computer showing internal implementations for CPUs and memory components and tradeoffs between different options. The course shows how to compile a higher level language such as C to an assembly language and how instructions are run on the machine. It explains how virtual machines can work by separating context from each other, how to use parallelization and multiple cores to increase performance. Overall, this course is a high level overview of how computers work both on a physical and a software level.

MIT Courses #10: Computation Structures – 6.004 by Chris Terman

Section 1 – Basics of Information

Information is defined as data communicated or received that resolves uncertainty about a fact or circumstance. For instance, if you receive data about a randomly drawn card, it’s better to know the identity of a card instead of just the suit or number. Given a discrete random variable X with N random possible values, each with associated probabilities. The smaller the probability, the more uncertain it is that X takes on that value. The information received when learning a choice was x_i is I(x_i)=log₂(1/p_i). The uncertainty of a choice is inversely proportional to its probability. Log base 2 is used to measure the magnitude of uncertainty in bits, where bits are digits zero or one. Even if the data doesn’t resolve all uncertainty, such as just knowing the suit of a card leaving the missing value. The formula for information content still works, for instance, given the card is a heart, the information received is I(heart)=log₂(1/(13/52))=2 bits. A common situation is faced with N equally probable choices, you receive data narrowing it down to M choices. The probability that specific data would be sent is M*(1/N) and the received information is I(data)=log₂(1/(M*1/N))=log₂(N/M) bits. For instance, learning the result of a coin flip, we go from 2 choices to a single choice, going from N=2 to M=1 and information content is log₂(2/1)=1 bit. Learning a card is a heart (N=52, M=13) gives log₂(52/13)=2 bits. For the roll of two dice learning the outcome of the roll, N=36 and M=1 would give info content of log₂(36/1) = 5.17, but you can’t have .17 bits, and this problem will require a different encoding. Overall, more bits of information are gained when more uncertainty is eliminated. Learning the card is not the ace of spades gives low information, while knowing it’s the king of hearts gives the most information about a card.

The entropy, H(X) is the average amount of information contained in each piece of data received about the value of X. H(X)=E(I(X))=Sum i=1 to N * p_i*log₂(1/p_i). Given a sequence of data describing the values of random variable X, the entropy is the lower bound on the number of bits to transmit. Ideally, you’d send exactly H(X) bits, but often have to settle for getting close. Sending too few isn’t enough to properly transmit information and too many wastes resources. An is an unambiguous mapping between bit strings and the set of possible data. For example, if A is 00, B is 01, C is 10, and D is 11, an encoding for ABBA would be 00 01 01 00, which would be a fixed length encoding since the bit strings used to represent the symbols all have the same length. Given a bitstring and encoding key, the next bits in the bitstring can be looked up to see the value they represent, so the process can be done backward. If B is represented as 1 instead, then a variable length encoding occurs, and ABBA would be 011101 which is shorter and more efficient. A problem could occur if coding is ambiguous. For instance, if A is 0, B is 1, C is 10, and D is 11, then ABBA is 0110, but so are ABC and ADA. Encodings can be represented as binary trees, with two branches from each node for the next zero or 1, with each leaf containing the encoded value. Read the bits in order until a leaf is reached, then write down that value and go back to the root and start again with the remaining bits.

If symbols we try to encode are equally likely or there’s no reason to be otherwise, then a fixed length code is used and each leaf is an equal number of nodes from the root. For instance, for the entropy of random variables with N equally probable symbols is sum of i=1 to N of (1/N)log₂(1/(1/N)) which is log₂(N). For example, to encode decimal digits, a four bit code binary code is required. Binary-coded decimals are BCDs and the entropy is log₂(10)=3.322. ASCII for printing characters requires 7-bits as entropy is log₂(94)=6.555. Positive integers are a common use of encoding. The formula for converting an N bit binary representation to the corresponding integer is done by multiplying each binary digit by its corresponding weight in the base 2 representation. Value = sum i=0 to N-1 of 2ⁱb_i. For instance, given a string 011111010000, V=0*2¹¹+1*2¹⁰+1*2⁹+… = 1024 + 512 + 256 … = 2000. This keeps only the non-zero terms, as 2 to the 10th power is 1024 and the slot for the 11th power is 0 and therefore empty. With this n-bit representation, the smallest number is zero and the largest representable is 2^N-1. Many digital systems support operations on binary encoding numbers of the same fixed size, such as 32 or 64 bits. Multiple operations would be required for numbers too big to be represented as a 32 or 64 bit string. Long strings of binary digits are tedious and error prone to transcribe, and a higher-radix notation is often used, choosing a radix where it’s easy to recover the original bits string. A good choice is a radix 16 representation called hexadecimal or hex where each group of 4 digits are represented by a hex digit. Since there are 16 possible representations of 4 binary digits, 0-9 and A-F are used as hex digits. For instance, 0000 in binary is 0 in hex and 1111 in binary is F in hex. To convert from binary to hex, group binary digits into sets of four, starting with least significant digits. For instance, 0b011111010000 is 0x7D0 in hex. To avoid confusing binary and hex, binary digits are prefixed with 0b and 0x for hex, as used in many programming languages.

How do you represent signed variables? The first character in a binary string could be zero for a positive number and 1 for a negative value, though there are a couple problems with this. There would be two representations for 0 and different circuitry would be needed for addition and subtraction. Most modern systems use the two’s complement encoding where the high-order bit of the N-bit two’s complement number has a negative weight. For instance, -2^N-1. Negative numbers have a 1 in the order bit. The most negative number is 10…0000, which is -2^N-1 and the most positive number is 01…1111 which is +2^N-1-1. If all bits are 1, then you get the sum of the most negative and most positive number, giving -1. If all bits are 0, then that’s the unique representation for zero. Adding negative 1 and positive 1 with 11…1111 + 00…0001 gives 0000000 which is zero, so two’s complement works for N-bit arithmetic. To compute B-A, you can use addition and compute B+(-A), but what’s the representation of -A. Since A+(-A) is 0, which is 1 + – 1, then -A=(-1 – A) + 1 which is ~A+1. If a bit A_i is zero, then 1 minus that gives 1 and if it’s zero, then 1 minus zero is 1. In each column, the result is the bitwise complement (flip all bits) of A, ~A. So -A is the bitwise complement of A plus 1.

If choices don’t have the same information content and equally likely chance of occurring, you can do better than fixed length. Consider the expected length of an encoding computed by considering each X_i to be encoded weighted by P_i, the probability of its occurrence. We can find encodings with shorter expected lengths which ideally matches the entropy H(x). Practically, if X_i has a higher probability, it should have a shorter encoding and and if it has a lower probability, it should have a longer encoding. Encodings where X_i can have different length codes are called variable length encodings. For example, if B has probability 1/2, then it could be encoded 0, while C with probability 1/12 could have an encoding of 100 while A with ⅓ probability is 11. With D of probability 1/12, the expected length of the encoding would be 1.667. Comparing the expected length for a 1000 symbols, fixed-length for 2 bits per symbol would require 2000, variable length would be 1667 and the lower bound or entropy would be 1.626. The entropy means another variable length encoding could do better.

Given a set of symbols and their probabilities, Huffman’s Algorithm shows how to construct an optimal variable-length encoding. This means, assuming we’re encoding each symbol one at a time, no other variable-length code could have a shorter expected length. Huffman’s Algorithm builds an optimal tree from the bottom up. It starts by building a subtree using 2 symbols with the lowest probability (meaning the highest information content and longest encoding). If two have the same probability, choose one arbitrarily. At each step it choses two symbols/subtrees with the lowest probability and combines them to form a new subtree. 0’s and 1’s can be switched, to create different encodings, but overall the lengths from the root will prove optimal. The expected length of an encoding can also be improved by looking at pairs of symbols instead of just individual ones. With Huffman’s algorithm on pairs, average bits per symbol could be 1.646, which is a slight improvement from the original variable length for the A through D example. An example usage of this is natural language processing where some words occur more frequently. LZW on wikipedia is an example of a data compression algorithm.

What if there’s an error and one or more bits get corrupted. For instance, a coin flip can have a result of heads or tails. Encoding heads simply as 0 and tails as 1 can lead to a single-bit error if the bit is somehow switched in transit. Hamming distance is the number of positions in which corresponding digits differ in two encodings of the same length. For instance, 0110010 and 0100110 differ by two. If the distance is zero, two encodings are identical. The hamming distance between a valid binary code word and the same word with a single bit error is one, since only a single bit is changed. The problem with the simple encoding is that a single-bit error can change a valid code word into another valid code word, and a corrupted tails can’t be distinguished from a real heads. Ideally, the minimum hamming distance is at least two, which can be achieved by adding a parity bit to the original code words. Using even parity, the additional bit is chosen so the number of 1 bits in the new code is even. This would change heads from 0 to 00 and tails from 1 to 11. A single bit error will then be detectable, since corrupted code words would have a parity error. If the number of 1 bits is odd, a single bit error occurred. The boolean function exclusive or can help perform parity checks. Parity won’t help if there’s an even number of bit errors and more sophisticated checks will be required. For detecting multi-bit errors, to detect E errors, a minimum Hamming distance of E+1 is required between code words. For the heads and tails scenario, by increasing the Hamming distance between valid code words to three, the set of words produced by single bit errors in each scenario don’t overlap. With at most a single error, error correction can be performed and the valid code can be recovered. To correct E errors, a minimum Hamming distance of 2E+1 is required between code words. Coding theory is a research area focused on these algorithms and there are entire courses on this topic.

Section 2 – The Digital Abstraction

This section focuses on finding physical representations of bits and finding devices for carrying information. Designing a machine to manipulate information, how should it be physically encoded? Bits should be small and inexpensive as we’ll want to be able to carry billions in music files and access trillions on the web. Bits should be stable and repeatable (once a zero, always a zero). It should be easy and quick to manipulate bits, accessing, transforming, combining, transmitting and storing them. Information can be represented using the phenomenon associated with charged particles. The presence of them creates a difference in potential energy measured as voltages. The flow of charged particles can be measured as currents. The phase and frequency of electromagnetic fields associated with charged particles, which form the basis of wireless communication. In this course, voltages are used to encode information. For instance, 0 volts could represent a zero bit and 1 volt could represent a 1 bit. For sequences of bits, multiple voltage measurements could be used either from multiple wires or a single one with volts spaced out over time. Voltages are easy to generate with outlets and batteries and easy to detect. Small circuits can be built to store, detect, and manipulate voltages and they can run on basically zero electrical power in a steady state. The downsides are that voltages are easily affected by changing EM fields in the surrounding environment, and it’s important to be connected by a wire to transmit information. Changing the voltage on a wire takes time based on resistance and capacitance of the wire. In modern times, these slowdowns are small, but sadly not zero.

Using voltage to represent information in a black and white image, each (x,y) point can use zero volts for black, 1 volt for white, and some percentage for the grey between them. If we can reliably distinguish voltages that differ by 1/2^N volts, we can represent N bits of information by voltages in the range 0V to 1V. In theory, N can be arbitrarily large, but realistically it will be constrained by the ability to distinguish voltages quickly. To represent the complete image, scan points in a raster order, left to right, top to bottom, generating a voltage waveform. This converts an image to a time-varying sequence of voltages. Original televisions worked this way, encoding a picture as a voltage waveform that varied between the representation for black and white. The range of voltages was expanded to allow the signal to specify the end of the horizontal scan and the end of the image which were sync signals. This was a continuous waveform, indicating it could have any value in a specified range at a point in time. Trying to build a system to process this signal, a system is shown with 2 processing blocks. A copy block takes an input voltage and reproduces it on its output. An inverting block converts input v to output 1-v, converting white to black and black to white. Prepackaged blocks help with modularity and abstraction. In theory, an input passed to an even number of copy and invert blocks would produce an exact copy of the original. In practice, the output is a fuzzy version of the original. In reality, the blocks won’t quite match the expected results adding some amount of error to what they produce. It’s still a processable signal, which is the corrupted version of one image and an accurate version of an image which already had the blur, so the system wouldn’t notice. Errors accumulate as the image passes through blocks, where the larger the system, the larger the error. Noise and inaccuracy are unfortunately inevitable and systems should tolerate some amount of error to be able to process information reliably.

Digital abstraction uses the continuous world of voltages to represent a small finite set of values (0 and 1 here). The world isn’t digital but real physical phenomena can help implement digital designs. To use voltages digitally, encode one bit of information at a time and use the same uniform representation for every component and wire in the digital system. A first attempt divides a range of voltages into two subranges with a threshold voltage. If the incoming voltage is less than the threshold, it’s treated as zero, otherwise it’s one. Determining the correct voltage gets more and more time consuming as it gets closer to the threshold. This attempt has an appealing mathematical simplicity but isn’t practical. The second attempt uses two threshold voltages L and H. Voltages less than or equal to L are 0 and greater than or equal to H are 1. Anything between L and H is forbidden to be asked for particular behavior and a system can interpret voltages in the range as either 0 or 1 or not interpret them at all. This allows for a quick and easy voltage to bit converter and only needs to guarantee the correct interpretation as stated. A device is a combinational device if it takes one or more digital inputs, has one or more digital outputs. The output of one device should be able to be connected to the input of another one. A combination device should have a functional specification telling us the value of each output for every valid combination of inputs. Lastly, a device should have a timing specification telling how long it takes for the output to reflect changes on the inputs. At a minimum, there should be an upper bound t_PD saying how long it takes for the output to have a correct value based on inputs. A set of interconnected elements is a combinational device if each circuit element is combinational, every input is connected to exactly one output or a constant voltage representing 0 or 1. The interconnected components can’t contain directed cycles. Systems built with these combination rules will be combinational devices, allowing for big systems to be built from smaller ones.

Looking at a system where a combinational device is trying to send a signal V_L-epsilon, which is a valid zero to a downstream device. Noise on the connected wire causes the downstream to receive V_L+epsilon which is in the forbidden zone. A small amount of noise causes unclear behavior. One fix is to ensure sure that valid outputs have tighter bounds on what’s valid than inputs. Noise can be caused by electrical effects such as IR drops in conductors by Ohm’s law, capacitive coupling between conductors, and L(dI/dt) effects caused by inductance in the components leads and changing currents. Voltage deviations can be caused from manufacturing variations and temperature variations or external EM fields. Noise is unavoidable, but it’s magnitude can be detected. The proposed fix is separate specifications for inputs and outputs. In this case, digital outputs are zero if less than or equal to V_OL and 1 if greater than or equal to V_OH and digital inputs are zero if less than or equal to V_IL and 1 if greater than or equal to V_IH. The O values are further from the center than the I values. The differences between the input and output voltage thresholds are the noise margins. The smaller of the two noise margins is the noise immunity and the goal as engineers is to design signalling specifications to provide as much noise immunity as possible. Combinational devices obeying this specification work to avoid noise before it causes signalling errors.

A buffer is a simple combinational device which has a single input and output, and delays the input before outputting it. A voltage transfer characteristic is shown as a plot of the voltage output vs input where each measurement is taken after any transients have died out. Forbidden zones are shaded regions on the graph and the VTC avoids this region. VTC doesn’t tell you about how fast a device is and measures static behavior, not dynamic. Gain is the change in output voltage for a given input voltage. Devices exhibiting a change in gain across their operating range are non linear. For combinational devices, nonlinear devices with gain (magnitude of the slope of the curve) greater than 1 (in a region) are needed. Gain is the change in output voltage for a given input voltage.

Section 3 – CMOS

Systems should dissipate as little power as possible as batteries have to sustain power for long periods of time, ideally losing zero. A MOSFET is a metal-oxide-semiconductor field-effect transistor. MOSFETs are four-terminal voltage controlled switches and currents flow between diffusion terminals if the voltage on the gate terminal is large enough to create a conduction channel, otherwise it’s off and terminals aren’t connected. With Moore’s law, these have been able to shrink every two years by a factor of two. Integrated circuits today can have billions of devices. The MOSFET is so small, they can’t be viewed by ordinary microscopes. The substrate the IC is built on is a doped p-type silicon substrate with impurities added to make it conductive. The impurity here is an acceptor atom such as boron. The IC has an electrical contact to the p-type substrate called the bulk terminal allowing the voltage to be controlled. To provide electrical insulation between conducting materials, a layer of silicon dioxide is used. When used to isolate the gate of the transistor from the substrate, it needs to be very thin so the field from charges on the gate conductor can affect the substrate. The gate terminal of the transistor is a conductor, in this case polycrystalline silicon. The gate, think oxide insulating layer and the p-type substrate form a capacitor, so changing the voltage on the gate causes electrical changes to the p-type substrate. Early on, the gate was metal and MOS refers to that. After the gate terminal is in place, donor atoms such as phosphorus are implanted into the p-type substrate in two rectangular regions on either side of the gate, changing those regions to an n-type semiconductor which become the source and drain, which are the final two terminals, which are physically identical and distinguished by their role. The MOSFET acts as a voltage controlled switch connecting the source and drain terminals of the device. When the switch is conducting, current flows from the drain to source front he conducting channel formed as the second plate of the gate capacitor. A MOSFET has two main dimensions, length l, which is the distance from drain to source, and width w, which is how much current it can carry. The current is proportional to W/L. In summary, the four electrical terminals of the MOSFET are gate, bulk, source and drain. The length and width are under control of the designer, usually setting the length as low as possible, and width to get a desired current flow.

Conventionally, the diffusion terminal with the highest voltage potential and the lower is the source. This way, any current flowing through the switch flows from drain to source. A MOSFET is designed to have a threshold voltage V_TH which says when the switch goes from non-conducting to conducting. In a modern process, this would be around half a volt in this situation. The substrate must always have a voltage less than or equal to the voltage of the source and drain. The MOSFET is controlled by the difference between the voltage of the gate, V_G and source V_S, V_GS which is V_G-V_S and when V_GS is less than the threshold, the switch is open and there’s no connection between source and drain. A depletion region is an insulator between the source and drain. As V_GS increases, the type of the semiconductor changes from P type to N-type and forms a conducting path between source and drain. This type inversion makes this an inversion layer that acts like a resistor, governed by Ohm’s law. I_DS=V_DS/R(resistance of the channel). If V_GS falls below the threshold, the inversion layer disappears. When V_DS is larger than V_GS. A large V_DS changes the geometry of the electrical fields in the channel and the inversion layer pinches off at the end of the channel with the drain. The electrons tunnel across the pinch off point to reach the conducting inversion layer.

There are two types of MOSFET. An NFET has an n-type source/drain diffusion in a p-type substrate. They have a positive threshold voltage and inversion forms an n-type channel. In the examples, the bulk terminal will be connected to ground, so the voltage of the p-type substrate is always less than or equal to the voltage of the source and drain diffusions. A MOSFET can also be built by flipping all the material types, creating p-type source/drain diffusions in a n-type substrate, leading to a PFET of P-channel MOSFET. Here, control voltages that cause N-channel to be on will cause P-channel to be off and vice versa. Using both times of MOSFET gives transistor types which work in complementary fashion and logic facilities, hence CMOS. These can be useful for building circuits to manipulate information encoded as voltages.

There are two rules for building CMOS which help abstract the behavior of the MOSFET as a voltage controlled switch. First to only use NFETs in pull down circuits to connect a signalling node to the groundrail of the power supply. When the pulldown circuit is conducting, the signalling node will have zero volts and qualify as the digital value zero. Basically, for NFET switches, if not-conducting, the value will be zero and if it is, the value will be one. PFET switches are only used in pullup circuits connecting a signalling node to the power supply voltage. When conducting, the signalling node qualifies as digital value 1. PFETs have a negative threshold and require V_GS to be less than that for the PFET switch to be conduction. If the gate voltage is a digital zero in PFETs, the switch is on. If it’s a 1, the switch is off. This is the opposite of the NFET switch. For complementary pullup and pulldown logic, the pulldown should be off when the pullup is off and vice versa and a system should have a way of handling errors when this isn’t true (ie both are signalling or the output node has no connection to either voltage). CMOS can be used to implement OR. A CMOS gate with only one pullup and pulldown can only implement inverting functions where rising inputs lead to falling outputs and vice versa. You can’t build positive logic such as AND with a single CMOS gate.

The propagation delay (T_PD) of a combinational logic gate is the upper bound on the delay from valid inputs to valid outputs. Identify the time when the input becomes a valid input 1 and output becomes a valid zero, for instance for rising. To make the effective resistance of the MOSFET switch smaller, you could increase its width, but that would add capacitance to the gate terminal slowing down transitions on the input terminal connector. Minimizing propagation delay is an optimization puzzle for transistor design. Similar to propagation delay, contamination delay is a lower bound on the delay from an invalid input to invalid output. The t_CD is generally assumed to be zero. NAND means Not AND. NOR is Not OR. A lenient combinational device guarantees outputs to be valid when any combination of inputs sufficient to determine the output value has been valid for at least t_PD. This tolerates transitions and invalid levels on irrelevant inputs. Most CMOS implementations of logic gates are lenient. Lenient components are important for memory components.

Section 4 – Combinational Logic

This focuses on using combinational logic to represent a functional specification. You could use natural language to do what a function does, though this can cause ambiguities. Alternatives are truth tables where a table shows the outputs of all possible combinations of inputs and their outputs. These are good when there’s a small number of inputs and outputs, but can be too large as inputs grow too large. Alternatively, boolean expressions can be used to show the output as a boolean equation from the inputs using AND(*), OR(+), and inversion/NOT (written with overbar (I’ll probably use ! here). It’s easy to convert a boolean equation to a circuit schematic. Truth tables and boolean equations are interchangeable. You can fill out a truth table by passing the inputs to the equation and filling out the table with the output. Truth tables can always be converted to a boolean equation called a sum of products. To do this, write out the function spec as a truth table and write boolean expressions covering each 1 in the output.

AND and OR gates will be needed for more than two inputs. To do Z=A*B*C, pass the output of A*B to another AND with C, so Z=(A*B)*C. To add D, you can do Z=((A*B)*C)*D. This builds a chain of AND gates implementing an N-way and with N-1 two way AND gates. You can also use trees, such Z=(A*B)*(C*D). The total number of pairwise ANDS is the same, but propagation delay in chains grows linearly with the number of inputs while it grows logarithmically for trees. If inputs come in at different points, performance can change and without knowing the propagation delays of circuits generating input signals it’s hard to know for sure. For CMS gates, as rising inputs lead to falling outputs and vice versa, CMOS gates are naturally inverting and NAND and NOR gates should be used in CMOS designs. NAND and NOR gates can be implemented as a single CMOS gate with one pullup and one pulldown circuit, while AND and OR gates would require two CMOS gates. NAND and NOR operations aren’t associative, so NAND of A,B, and C isn’t NAND of (A and B) and C, and a chaining strategy won’t work. XOR will take many more gates than for a two input NAND or NOR as well. Any logic function can be implemented using only NANDS or NORS which is useful for designing low cost, high performance circuits in CMOS.

Since propagation delay only cares about the slowest and longest path, it can be good to use slower and smaller components for quicker paths where performance won’t be impacted. Demorgan’s laws state that !A*!B=!(A+B), that is that the NOR of A with B is equal to the AND of NOT A and NOT B. A 2-input NOR gate can be viewed as a 2-input AND gate with inverting inputs. This can show how to build NANDs and NORs with large numbers of inputs. For sum of products, a NAND-NAND implementation will be noticeably faster than AND/OR. The reduction identity for boolean algebra says that ab+!ab = b and (a+b)(!a+b)=b, which can be used to simplify functions. With this, an equation built by converting a truth table to an expression, Y=!C!BA+CB!A+CBA+!CBA can be simplified to Y=!CA+CB. Computer programs are used for simplifying expressions but they can grow faster than exponentially with regards to input size if trying to get to the smallest form. You can also find “don’t care” situations and not consider rows with different inputs that don’t affect the calculations. The smallest circuit may not be the best if it causes it to wait for an unneeded input, causing an unneeded delay. This is handled by lenient circuits.

A K-Map (Karnaugh map) is a truth table arranged so terms which differ by exactly one variable are adjacent to each other. These help easily show potential reductions. For example, for a table with C, B, and A inputs, the rows could represent values for C, while columns represent varying A and B values. In that case, there would be two rows for C, 0 and 1 and four columns, 00, 01, 11, and 10 representing AB. For extending it to 4-variable tables, both rows and columns can represent two values. To extend to 5 or 6 variables, a 3rd dimension and 4x4x4 matrix would be needed, which is doable for computers, but hard to visualize. K-maps work well for up to four variables. Patterns of adjacent entries containing 1’s reveal opportunities for simpler terms. An implicant is a rectangular region of a K-map where the function has the value 1. Such as a region that needs to be described by one or more product terms in the sum of products. The width and length of the implicant has to be a power of 2. Implicants can overlap with each other and an implicant is a prime implicant if it isn’t completely contained in another implicant. The goal is to find the largest prime implicants. Each implicant can be uniquely identified by a single product term, and the larger the implicant, the smaller the product term. You can build the minimal sum-of-products expression by combining the product for the prime implicants. These can also help remove glitches from output signals and are a good solution for coming up with appropriate expressions. To make a circuit lenient, include product terms for all prime implicants.

A 2-input multiplexer selects one of its two input values as it’s output value. With data input values D₀ and D₁, if the value of an S is 0, then D₀ is elected, if S is 1, then it’s D₁. MUXes can be generalized to 2^k data inputs and k select inputs. MUXes provide an elegant and general way to implement a logic function. An example is that values A,B,and C can be specified to choose one of data values 0 through 7. The behavior isn’t specified on time of manufacture and waits for a programming step from the user. MUXes with N-select lines are stand ins for N-input logic circuits. A MUX would have 2^N data inputs, which is okay up to 5 or 6, but higher N values are impractical. Any logic function is easy to implement using only 2-to-1 muxes.

The final logic implementation strategy uses read only memory and is useful when different outputs need to be generated from the same set of inputs. MUXes are good for implementing truth tables with only one output column and ROMs are good for implementing truth tables with many output columns. A key component in ROMs is the decoder which has k SELECT inputs and 2^k data outputs. Only one of the data outputs will be 1 or HIGH at a given time while others will be 0 or LOW. The jth output will be 1 when the select lines are set to the binary representation of j. Inputs are connected to select lines of a decoder. For ABC inputs, each of the decoder’s 8 possible outputs are labelled with the input variables for which the output will be HIGH. Decoder outputs control a matrix of NFET pulldown switches. The matrix has one vertical column per output of the truth table and each switch connects a particular vertical column to ground, forcing it to a low value when the switch is one. If no switches force a column value to zero, its output will be 1 and the value on each column is inverted to produce the final output. Decoder outputs indicate which row was selected as all of the pulldowns for the selected outputs are turned on, forcing the columns they connect to to be low. Inverting gives the desired output. Changing the location of the switches can allow it to implement any 3 input 2 output function. For ROMs with many inputs, decoders have many inputs. Some inputs can be used to choose from different decoders. ROMs ignore the structure of the function to be implemented. Size, layout and design are independent of the function and any truth table can be programmed with minor reconfiguration. Typically, the switch matrix is fully populated with all possible switch locations filled with an NFET pulldown. A separate physical or electrical programming operation determines which switches are controlled by the decoder, with the others being in the permanently off state. If the ROM has N-inputs and M-outputs, then the switch matrix will have 2 to the N rows and M output columns. ROMs aren’t lenient and can show glitchy behavior, as decoder outputs turn on and off at slightly different times until stable.

Section 5 – Sequential Logic

A system we haven’t been able to build yet is one where a device has a button and when pushed, a light turns on if it’s currently off, or off if currently one. This is a problem as the output isn’t a function of the current input value. The behavior instead depends on what’s happened before. Odd number pushes turn the light on and even turn it off. Devices that remember something about the history of inputs have state. The push marks an event in time with different states before and after and the transition between states is what we’re interested in. A combinational device’s outputs only depend on the current value of the inputs and can’t do this behavior. A memory component is added, storing one or more bits to store the current state. The state and current input values are used as inputs to a block for combinational output. One output is the next state of the device (encoded with the same number of bits as the current state). The other set of outputs are the signals serving as the output of the system. The memory component has 2 inputs, a load control signal telling it when to replace the current state, and a data input specifying what the next state should be. Periodically trigger the load control to produce a sequence of values for the current state, each determined by previous state and inputs when triggered. Circuits including combinational logic and memory components are called sequential logic. The memory component has a capacity measured in bits. If it has k bits, then there’s an upper bound of 2^k possible states.

Since bits have been represented as voltages, we could use a capacitor to store particular voltages. A capacitor is a passive two terminal device. Terminals are connected to two parallel conducting plates separated by an insulator. Adding charge q to one plate generates voltage difference V and Q=CV. Adding charge by hooking a plate to a higher voltage is charging the capacitor, and connecting the plate terminal to a lower voltage is discharging. One terminal hooks to a stable reference voltage while the other uses an NFET switch to connect to a wire called the bit line. The gate of the NFET switch connects to a wire called the word line. To write a bit of information into the memory device, drive the bit line to the desired voltage and set the word line high, turning on the NFET switch. The capacitor charges or discharges until it has the same voltage as the bit line. Set the word line low, turning off the NFET switch and isolating the capacitors charge on the internal place. In a perfect world, this charge would stay indefinitely. To access this info later, charge the bit line to an intermediate voltage and set the word line high to turn on the NFET switch connecting the charge of the bitline to the charge of the capacitor. The charge sharing between them has an effect on the bit line and its voltage. If the capacitor was storing a 1 and was at a higher voltage, charge flows into the bitline. If the capacitor was storing a 0 and at a lower voltage, charge flows from the bitline to the capacitor, lowering the voltage in the bitline. The change in the bitline’s voltage depends on the ratio of the bitline’s capacitance to C, the capacitor’s capacitance. THis is usually small and a sensitive amplifier called a sense amp is used to detect the small change and produce a legal digital voltage as the value read from the memory cell. The individual storage capacitors are small, and billions of bits of storage can be cheaply stored on Dynamic RAMs. The complex sequences of operations for reading and writing take a while and access times are slow. You also have to maintain the charge on the capacitor in the presence of external electrical noise. The NFET switch isn’t perfect and can leak, requiring periodically refreshing the memory by reading and rewriting the stored information before leakage corrupts the memory. In current technologies, this has to be done every 10ms or so.

You can design a system using feedback for a continuous refresh for stored information, for instance, two combinational inverters in a digital feedback loop. A zero passed to the first produces a 1 and the second takes a 1 and produces a 0. As long as it’s connected to power and ground. This is also stable if you flip the values on the wire, and is a bistable storage element as it has two stable configurations. A two to one MUX can be used to build a settable storage element called a latch. A MUX selects one of it’s two data inputs as its output value. Here, the MUX’s output serves as the state output of the memory component. Internally to the memory component, the output of the MUX is also connected to its D₀ data input. The MUX’s D₁ data input becomes the input of the memory component and the SELECT line of the MUX becomes the memory component’s load signal or gate. When the gate input is low, the MUX’s output is looped back through its D₀ data input forming a bistable positive feedback loop. This creates a cycle and the circuit is no longer a combinational circuit. When the gate input is high, the MUX’s output is determined by the D₁ input. To load new data into the memory component, the gate input needs to be HIGH for long enough for the Q output to become stable. After that, the gate input is set to LOW to switch into memory mode so the feedback loop can keep producing the value the “Q” value its meant to store. This memory device is called a D latch or latch. To avoid invalidation due to input transition, a lenient MUX is required and signals should be stable. While the G input is high, set the D input to the value we wish to store. After T_PD, the value appears at the Q output. Wait another T_PD so the information about the new value of the Q’ input propagates through the latch. D and Q’ will have been stable for T_PD. Q’ is the feedback loop input. If D is stable for 2T_PD, transitions on G won’t impact the Q output. This requirement is the setup time of the latch. We can set G to low while holding Q=D and after another T_PD, the Q output will be unaffected by future changes on D. The hold time on the latch is how long after the transition on G, which is how long after the transition on G that D must stay stable and valid. The setup and hold times are the dynamic discipline which is required to store a value while the gate makes a high to low transition.

Using the latch as the memory component in the sequential logic system, output Q goes through the combinational logic system which provides a new state back as input D of the memory component. The problem is if the gate is HIGH for too long, a loop is created and the new state changes rapidly. This requires a carefully timed interval long enough for the dynamic discipline and short enough to close before the new state information propagates all the way around the loop. This is almost impossible to guarantee, so a load signal needs to mark an instant in time and not an interval. A new component called a D Register uses two back to back D latches. The load signal for the D Register is called its clock while the input and output are the same for the same D Latch. The D input is connected to the master latch and the Q output is connected to the slave latch. The clock signal is the gate input for both latches and is inverted before it is connected to the gate input of the master latch, meaning that only one latch can be open at a time and there will never be a path all the way through the register from D input to Q output. To ensure correct operation, the contamination delay of the master latch has to be at least the hold time of the slave latch. Extra gate delays such as pairs of inverters can be added if necessary.

The single-clock synchronous circuit uses registers in a constrained way to build digital systems. There are no combinational cycles and a single periodic clock signal is shared across clocked devices. Signals hooked to the register inputs need to be stable and valid long enough to meet the setup time and stay stable long enough to meet the register’s hold time. The clock period should be greater than the T_PD of every path from output to inputs plus the setup time. There should be no noise problems this way. Clock skew may need to be accounted for when figuring out timing. The D Register works great for the sequential logic system and the state will be updated every rising clock edge without a loop. Pretty much every digital system is a sequential logic system obeying the dynamic discipline’s timing constraints.

Section 6 – Finite State Machines

Sequential logic adds memory components and combinational logic. The combinational logic part is an acyclic graph of components obeying the static discipline. The static discipline guarantees that valid and stable inputs will guarantee valid and stable outputs by some specified interval. The state register remembers the current state of the sequential logic. A state captures the relevant history of the input sequence in stored state bits. A simple sequential circuit is a digital binary combination logic. The lock has a 1 bit input symbol and the combination is entered as a sequence of bits. The only output signal is if it’s unlocked or not. UNLOCK is 1 if and only if the last 4 inputs were the combination 0110. How many state bits are needed. The whole 0110 would be four input bits, but less can work as we only need to know if it is incorrect. A finite state machine or FSM describes the input output behavior or the sequential logic independent of implementation. An FSM has a periodic clock input and a rising clock edge moves the state forward. The FSM has a fixed number, k, of states and one is an initial state. Some number, m, of inputs are also used as well as n outputs. There are also some number of transition rules specifying how the next state s’ is calculated from the previous state s and input, I. Lastly, there’s a specification for how output values are determined as a function of the current state or current state and inputs.

A state transition diagram is shown for the lock. States are represented as circles with a name for the state it represents and the output when in the state. The initial state is shown with a wider border and an arrow goes from state to state with the input for the transition underneath. Starting from initial state SX, input 0 leads to S0, input 1 to S0 leads to S01, another 1 to S011 and a 0 input to that state leads state S0110, which is the only state with an output of 1. An incorrect input doesn’t necessarily send the state back to SX, as the system only cares that the last four digits were the combination. For instance, a 1 input to S011 would send it back to initial as there’s no combination starting with or having three 1’s in a row, while a 0 input to S01 would transition back to S0. The last bits of an incorrect sequence might be a correct start to the combination. An FSM where the outputs are a function of the current state is a Moore machine and the outputs are written in the state circle. Conversely, a Mealy machine has outputs based on states and inputs, where input and outputs are separated by a slash on the transition line instead. For valid state diagrams, transitions leaving a state must be mutually exclusive. There should also be a transition specified for each possible input value including lines that transition back to the current state when no transition is required. State transition diagrams can also be represented as truth tables. A ROM can be used to compute the next state and outputs front the current state and inputs. Encoding the 5 states with a 3 bit binary value gives a 3 bit state register. The ROM has 4 input signals, 3 for the current state and 1 for the input, giving the ROM 16 location (truth table has 16 rows as well). This ROM is 16×4, with 16 locations of 4 bits each. On startup, there needs to be a way to set the encoding to the original state. There could be a reset signal which selects an initial value to be loaded. If unused state encodings are a concern and bugs cause an unused state to be loaded, one solution could be to have that load the initial state on error.

The next example is a RoboAnt with two sensor inputs as L and R antennae which are 1 if in contact with something. The outputs of the FSM control the ant’s motion, with an F output set at 1 causing it to step forward and TL/TR outputs causing it to turn left or right. If a TL/TR and F output are set together, the ant turns and then moves. An ant can turn if an antenna is touching something, but it can’t move forward. The problem given is finding the way output of a maze with one way out. Implementing the right hand rule of walking along a wall, go forward until L or R are a 1. The ant should turn left until both L and R are zero and the ant hits a wall on its right and walks along the wall. If the ant hits an inside corner, it rotates to put the new wall on its right and keeps going. As it steps it turns right and makes sure it touches the wall before turning left and going forward again. If it hits an outside corner, it turns right and doesn’t find a wall, then moves right to go around the corner and checks for a wall on its right. Modern robots effectively have FSM brains that let them perform their assigned tasks.

Two states are equivalent (and can be merged) if both have identical outputs and every input leads to the same state for both. To reduce an FSM based on this, find pairs of states with the same outputs and check if they’re equivalent and then merge them into a single combined state which reduces the number of state bits needed. Logic gates can also be used instead of ROMs. Group behaviors can be modeled by collections of many FSMs running in parallel. What if an FSM could modify its own transition table?

Lastly, this section looks at asynchronous issues on a sequential logic circuit, where the timing of transitions on the input is independent of the logic clock. Events in the world can be out of our control. If the input can change at any time, it could violate setup and hold times. We could use a synchronizer circuit which takes an unsynchronized input symbol and produces a synchronized signal to change after the rising edge of the clock. A specification for a syncrhonizer has two inputs, IN and CLOCK with transitions at time t_IN and t_c. If t_IN is before t_C – t_E (allowable error), the synchronizer should output a 1 within a bounded time t_D after t_C. If t_IN is after t_C + t_E, the synchro should output a 0 instead. If the two transitions are closer together than specified t_E, the synchronizer can output either 0 or 1. This is an unsolvable problem. A D Register won’t work as we can be lured into the digital abstraction that Q will be 1 or 0 after the propagation delay. The master latch is a lenient MUX that can be configured as a bistable storage element using a positive feedback loop. When in memory mode, it’s a 2 gate cyclic circuit with two constraints, which are VTC of the path through the mux and that V_in=V_out. If IN and CLK transition at the same time, the voltage on Q may be in transition when the MUX closes and could be near or at the middle intersection point. When Q is at the metastable point, the storage loop is in an unstable equilibrium called the metastable state and though a small change in the voltage can set it back toward stable, we can’t bound the amount of time it stays in the metastable state. The metastable state is in the forbidden zone and corresponds to an invalid logic level and the register isn’t guaranteed to produce an output in bounded time. There’s also no upper bound for how long it takes for resolution. Every bistable system has at least one metastable state. An approach to dealing asynchronous inputs is to put the potentially metastable value coming out of the D Register synchronizer into quarantine by adding a second register to the output of the first. If a transition on the input causes the first to go metastable, the second register stops it from going into the system. By adding longer clock periods, the probability it’s still metastable is lower. Multiple quarantine registers can be used in series. Metastability isn’t a practical issue using the quarantine strategy.

Section 7 – Performance Measures

Systems that overlap processing of a sequence of inputs are pipeline systems with each step as a stage. The rate that inputs move through the pipeline is determined by the slowest stage. For instance, in a system with a washer that takes 60 minutes and dryer that takes 30 minutes, N (greater than 1) loads can be done in N*60 minutes by loading the washer before the dryer finishes the previous load. There are two performance metrics used. The latency is the time it takes for the system to process an input. The second is the throughput, which is the rate at which inputs or outputs are processed. Applied to circuits, the latency of a combination logic circuit is it’s t_PD and throughout is 1/t_PD. The propagation delay determines latency and throughput. In a module where input X goe through parallel F and G modules which pass their outputs to an H, latency can’t be improved. After producing outputs, F and G are idle. Dividing the circuit into two stages, where the first runs F and G and second does H, one solution uses registers to hold F and G’s outputs so that F and G can start on the next set of inputs. Pipeline diagrams show the stages of a pipeline system. Rows are stages and columns are clock cycles. The processing for an input value moves diagonally through the diagram, progressing one pipeline stage per clock cycle. A K-stage pipeline is an acyclic circuit with K registers on every path from input to output. A combinational circuit is a 0-stage pipeline. Every pipeline stage has a register on its output and the clock common to all registers has a period sufficient to cover propagation over combinational paths PLUS (input) register t_PD PLUS (output) register t_SETUP. The latency of a K-pipleine is K times the period of the systems clock and the throughput is the frequency of the clock (1/clock period).

It’s important to avoid mixing values and inputs between cycles and that the changes are in sync. A goal is placing the minimum registers needed to get the maximum throughput, with a strategy of finding the slowest system component and placing registers on its inputs and outputs, drawing contours on either side of the module. There can be several choices for drawing this. Pipelining implementations with increasing throughput often has the cost of increased latency. Once the slowest component is isolated, the throughput can’t be isolated further. To continue to improve the performance of the circuit, you can replace slow components with a k-pipe version which can decrease the clock period. Registers must be added to account for the new stages. Adding a second dryer to the dryer problem, dryers run on a staggered schedule and the system can produce a clean and dry load every 30 minutes. This gives a throughput of 1/30 loads/min and latency of 90 min/load. To improve the circuit pipeline further, a pipelined version of the slowest component must be found or interleaving copies of it must be used. A D-latch can capture and hold the input value for the slow components. An N-way interleaving circuit is treated as an N-stage pipeline. By running pipeline systems in parallel, the throughput can continue to be increased.

Processing pipelines so far have been synchronous, globally timed systems where the clock period accommodates the worst case processing times. If for some inputs, some processing stages could be faster, you could use a synchronous, locally timed system, where each stage can signal when it needs a new output and has a new output. Handshake signals can be used to communicate between modules. This handshake protocol works by the upstream stage saying if it will have a new output value at the next rising edge of the clock and the downstream stage says if it will grab the output at the rising edge of the clock. If both systems asserted their signals at the same clock edge, then the transfer will occur then. Either stage can delay a transfer if still working. A clock-free asynchronous, locally timed system can use transition signalling, where the upstream asserts its signal when ready in the first phase. In phase 2, the downstream asserts when it’s ready and in phase 3, the downstream waits to see that the upstream stage has received the downstreams signal. In phase four, once the upstream de-asserts its signal, the downstream de-asserts its signal and the handshake can begin again. In the example, the upstream signals “here’s X” and downstream signals “got X.” A component can be added to allow signalling from one component and two downstream stages. These are some examples of various control structures. Pipelining is key for increasing throughput in most digital systems.

Section 8 – Design Tradeoffs

Some optimization metrics are the area of the design, throughput, latency, power consumption, and energy of executing a task. In some circuits all can be improved, but often tradeoffs have to be made. Design goals are important for knowing which tradeoffs to achieve. For instance, watches and GPUs have different needs. In power dissipation, the usual goal is to meet a power budget or minimize power consumptions while meeting other targets. In CMOS circuits, there are several sources of power dissipation. Static power dissipation is power consumed when the system is idle and nodes don’t change value. With a simple switch model for MOSFETs, CMOS circuits would be expected to have no power dissipation. People came close in the early days, but as dimensions have shrunk and voltages lowered, two sources of dissipation appeared in both NFET and PFET MOSFETs. The first effect is based on the thickness of the gate oxide. As it shrinks, it leads to NFETs carrying more current and faster gate speeds, but allowing for power drain. The second effect is current leakage from the drain to source even if the MOSFET is off. This is called subthreshold conduction and is exponentially related to V_GS-V_TH, and V_TH is negative when the NFET is off. As V_TH is reduced in each generation of technology, the subthreshold conduction has increased. One fix is to change the NFET’s geometry so the conducting channel is a tall narrow fin with the gate wrapped around three sides, known as a trigate configuration. Power dissipated in a circuit if f C V_DD²per node and f N C V_DD² per chip where f is the frequency of charge/discharge and N is the number of changing nodes and chips. Newer chips would naturally dissipate more power if we could let them do so. To reduce power consumption, it’s possible to turn off modules whose outputs aren’t needed, stopping their inputs from changing. Latches can be attached to the inputs of each module or unused portions can have their power supply turned off. The clock can also be slowed when the system has nothing to do.

The most straightforward way to improve performance is to reduce the propagation delay of a circuit. Looking at a ripple carry-adder, find the path from input to output with the largest propagation delay, which is the path determining overall T_PD. Currently, latency is O(N) and the term. IN a ripple carry adder, the high order bits have to wait for the carry in from the low order bits. If the high half and low half could work in parallel, that could work better. For instance, to add two 32-bit numbers, use a low bit adder for A[15:0] and B[15:0] and two high bit adders for A[31:16] and B[31:16], with one assuming the carry in from the low bit adder is 1 and the other 0. This means, three 16-bit adders can work on new inputs. Once complete, the actual carry out from the low half is used to select the output from the correct high half. This is known as a carry select adder. The latency is about half the cost of a 32 bit ripple carry adder at the cost of some additional circuitry. This can be repeated with additional circuitry and can be improved to O(log n). In summary, from going from a chain to a tree, latency can go from linear to logarithmic. Another approach focusing on the carry logic. Generate and propagate blocks can be added to speed up processing and further abstract from the previous input output strategy. For instance, G_HL=G_H+P_HG_L and P_HLP_L say to generate a carry out if the high part generates one, or if the low generates on and high propagates it. It propagates a carry if both high and low parts propagate theirs. The fastest known adders are variations on the generate/propagate strategy and “Kogge-Stone” adders provide more information on this strategy.

Binary multiplication is one of the slowest circuits as it takes a digit of each multiple of the first number multiplied by each digit of the second, shifting a unit to the left as it moves to the next digit of the second number, creating partial products. The multiplication table for binary multiplication is a two-input AND gate with four possible combinations where only 1*1 gives 1 and leads to no carries, and partial products are N-bits wide. If the multiplier has M bits, there are M partial products and adding them together gives a potential (N+M)-bit result. Forming the partial products requires AND gates. The hard part is adding the partial products. For a combinational multiplier, adder modules add the current rows partial products the sum of the partial products front the earlier rows. (M*N) 2-input and gates compute the bits of the M partial products. There are two types of adder modules and a full adder is used when modules need three puts and half adder is used for only two inputs. The longest path is at most (M+N) modules, as inputs move either left of down and latency is O(N). Throughput is O(1/N) and hardware is O(N²). With a 2’s complement multiplier and multiplicand, the high order bit of each has negative weight. When adding partial products, sign extend each of n-bit partial products to the full (n+m)-bit sum. Since the high order bit of the multiplier has a negative weight, subtract the last partial product. Next, add 1’s to various columns and subtract them later to eliminate extra additions from the sign-extension. Add the 1’s to the partial products and propagate carries so the sign extension bits go away. Some AND gates need to be NAND gates to deal with the compliments.

Pipelining can help improve multiplication, diving the computations in half which can double throughput. One technique is a carry-save multiplier where carryouts are connected to a module 1 column to the left and down a row. All additions that have to happen in a specific column still happen there, but in a different row. Pipelining the revised diagram creates stages with approximately two modules worth of propagation delay. The long carry chains are broken by horizontal contours and the latency of each stage is constant and independent of N. This requires adding extra hardware. The latency is constant and the clock period and throughput are as well, leading to O(N) pipeline stages and the system is still O(N). Hardware is still O(N²). Another sequential multiplier design can compute a single partial product in each step and adds it to an accumulating sum. It would take O(N) steps to complete multiplication. In each step, the next bit of the multiplier is ANDed by the multiplicand to form the next partial product, sent to the n-bit carry save adder to be added to the accumulating sum in the P register. The value in the P register and output of the adder are in the carry-save format. This output is saved in the P register and both P and B registers shift right. The clock period is independent of N and the propagation delay is constant for carry save with O(N) cycles and hardware. Latency is linear and throughput is O(1/N). Constant throughput and linear hardware are possible and carry-save is a popular technique for achieving them.

Section 9 – Designing an Instruction Set

This is the first section of the second part of the course, focusing on digital systems that can perform computations on different types of binary data. This part focuses on building computers, the central control element for systems with complex behavior. An example of a problem is computing factorial(N) and a program can store two variables a and b for the accumulator and next value for a loop iteration. For the loop

  int a = 1;
  int b = N;
  do {
      a = a * b;
      b = b - 1;
  } while (b != 0)

A sequential logic FSM can be used to run this process. A machine running this loop is a high level FSM as the outputs of each state are formulas indicating operations to be performed on source variables. At a high level, there’s a start state, loop state, and done state. In hardware, 32-bit D-registers hold a and b, and 2-bit D-registers are used for each state (start, loop, done). Logic to transition are boolean as well (is b zero or not) and logic to perform multiply and decrement and select values to load into a and b registers at the end of a cycle. The design shown first is for a single purpose hardware, where a better method could use general purpose hardware for computing multiple functions. Instead, A_SEL and B_SEL MUXes could select from any of multiple data registers to pass to mathematical operations each with their own blocks and the result is selected by an op_SEL MUX and can be passed back to any data register by setting an enable flag to 1 and using the 2-bit W_SEL signal to select the register to be loaded at the next rising clock edge. Data registers have a load enabled flag input which tells it whether to load a new value from its D-input or reload its current value. A control FSM is added to generate the control signals for the data path. A z signal to the control FSM lets the system perform data dependent operations where actual values can influence the process. This system can be used for factorial, division, exponentiation, and more, but nothing requiring more registers than set up. The Eniac and other early computers worked this way. Mapping a problem to a machine was complex and it could take days to get a problem into Eniac by switching levers and switching.

There are many approaches to building general purpose computers. Almost all modern computers are based on the stored program computer architecture from 1945 by John von Neumann, commonly called the von Neumann model. These computers have three components, FIrst there’s the central processing unit or CPU containing the data path and control FSM and performs operations on values in registers. The CPU is connected to the main memory, which holds an array of W words of N bits each. Even small memories have a billion words and 32 bits. When the CPU wants to operate on items in the array, it sends the memory an address (an array index) and the memory returns the N-bit value stored there. Writes work the same way as reads with data flowing in the opposite direction. Lastly, input and output devices are used to communicate with the outside world and other storage. The main idea is to use the main memory for CPU instructions as well as data. Interpreted as an instruction, a value in memory is a set of fields containing bits containing information about actions for the CPU to perform. The first value in an instruction specifies the operation to perform and subsequent fields contain data registers for inputs and a destination where the result is stored. The CPU would then move on to the next instruction in memory, running step by step. Data values and strings of characters can also be represented in memory and the CPU reads and writes them as needed. How do we know which words are instructions and which are data? It’s how it’s used by the CPU. If loaded into the data path, an item is data. If loaded by the control logic, it’s treated as an instruction. The control unit in a von Neumman computer would load data from main memory to interpret as instructions, keeping a register called a program counter which tracks the address of the next instruction that needs to be run. The control unit also contains logic for translating instructions to signals for the data path. The data path is the brawn and stores data.

Instructions are the main unit for work fetched by the control unit and executed one after another in order. Each specifies an operation or opcode and source operands and a destination for the result. In a von Neumann machine, instructions are executed sequentially with the CPU implementing a loop. The next program counter is the current plus the size of the current instruction. The basic loop is to fetch an instruction, decode it, read source operands, execute instructions, write destination operands and compute the next program counter. This loop is run again and again with modern machines able to run over a billion instructions per seconds. Instructions so far have been abstract. The specification of instruction fields and their meaning are referred to as an ISA, instruction set architecture. This serves as a contract between software and hardware. The ISA is a detailed specification of operations and storage locations and a precise description of how software can invoke and access them. Since programs are in main memory and changeable, they’re referred to as software. The combination of hardware and software determines system behavior. The ISA is a layer of abstraction allowing programs to be written without knowing the implementation of the hardware. This hides the complexity of CPU implementation and enables fast innovation in hardware as software doesn’t need to be changed. A particular ISA can last for decades and designing them can be hard. Taking a quantitative approach, find benchmark programs, evaluate versions of the ISA and implementation with and without the feature and pick what works best based on goals. One strategy is optimizing the common cases, such as arithmetic and memory access.

A beta is a reduced instruction ISA where most instructions only access internal registers. Memory values are loaded and stored with separate instruction and simple address calculation. This leads to smaller and higher performance hardware implementations and simpler compilers on the software side. The ARM and MIPS ISAs are other risk architectures. Intel’s x86 is more complex. The CPU state has 32 registers (each holding a 32-bit value) and a PC program counter. Registered are referred to as r followed by a number. R31 always stores 0 and can’t be changed. The beta is a 32 bit architecture. Main memory is an array of 32 bit words, each containing 4 8-bit bytes. The beta ISA only supports word accesses, loading or storing full 32-bit words, though still uses byte addresses where items in memory differ by 4 and often use hex addressing. The maximum memory size is about 4GB or a billion words of main memory. Large memory is important but fast access is too. Using registers for operands and destinations can increase speed and parallelize while storing lots of data in a large main memory. All program data resides generally in main memory. Variables live in memory, and registers hold temporary values. To operate with memory variables, load them, compute on them and store the results.

The beta has three types of instructions in its ISA. Arithmetic and logical instructions perform operations on general registers, loads and stores move data between general registers and main memory, and branches conditionally change the program counter. Instructions have a fixed length of 32 bits or 4 bytes. Fixed length instructions are used here though variable length instructions can lead to smaller code size. Fixed length encoding leads to a larger code size but keeps the hardware execution engine small and fast. The computation by the beta data path happens in the ALU (arithmetic and logic unit) where instructions are [OPCODE, r_c, r_a, r_b, unused], where the remaining bits are unused.

Features are additional instructions that can (in theory) make an ISA better but tradeoffs must be judged. For instance, many programs use small constants. What if we added a constant in ALU instructions. This would avoid needing to load from memory but require more opcodes and more complex control logic. These changes should be simulated and measured on CPU benchmarks to see their impact. Small constants are very common as the second argument of arithmetic instructions. Half of executed arithmetic instructions, 80 percent of compare instructions and a quarter of load instructions use them. This second instruction format is structured [OPCODE, r_c, r_a, 16-bit signed constant]. A “C” is added to the end of the operation to say it uses a constant, such as ADDC instead of ADD.

The second set of instructions, load and store are the only way of accessing memory values. These can be updated in the same way to use constants. To access memory a CPU generates an address. LD and ST compute the address by adding the sign-extended constant to the contents of the register r_a. To access a constant address, specify R31 a as r_a. To use only a register value as the address, specify a constant of zero. The third set of instructions updates the program counter. So far, the counter has been incremented by four after each instruction, but a way is needed to change the counter based on data values from the program’s execution. This is called branching and if a branch is taken, the counter is changed, otherwise, keep executing sequentially. Branch instructions also use the instruction format. The BEQ instruction handles branching if equal

  BEQ(ra,offset,rc):
      NPC <- PC + 4
      Reg[rc] <- NPC
      if (Reg[ra] == 0)
          PC <- NPC + 4*offset
      else
          PC <- NPC

The offset is a signed constant encoded as part of the instruction and is the distance in words to the branch target, counting from the instruction following the BEQ/BNE. Negative offsets can be seen at the end of a loop to get back to the main program. If statements can use forward offsets (positive) to skip over code if not required. BEQ can implement an unconditional branch which is always taken. Testing R31 to see if its zero will always be true. BNE is the same but the branch is only taken if the values aren’t equal. Beta code can be written for factorial now from the previously seen C code for factorial.

  ADDC(r31, 1, r0)
  L:MUL(r0, r1, r0)
  SUBC(r1, 1, r1)
  BNE(r1, L, r31)

“L:” signals the start of the loop. Each state in a previously shown FSM matches up nicely with the beta code. THe JMP instruction sets the program counter to the value from Ra, going to a predetermined destination.

  JMP(Ra,Rc):
      Reg[Rc] <- PC+4
      PC <- Reg[Ra]

Section 10 – Assembly Language, Models of Computation

This part of the course focuses on building the beta computer shown in the previous section. To program the beta, main memory needs to be loaded with binary-encoded instructions. A computer needs to figure out the encoding and a programming language will be used to specify the opcode and operands for each instruction, letting an assembly language write instructions in symbolic form instead of writing binary. Assembly language could be ADD(R2, R3, R4) or be abstracted further to a higher level language with a = b + c;. This section describes the assembly language used for the beta and the next language shows how to interpret C in the assembly language. Assembly language helps avoid the bit level representations of instructions and the need to know the exact locations of variables and instructions in memory. A program called the assembler reads a text file containing the assembly language program and produces an array of 32-bit words that can initialize main memory. The UASM microassembler is built into BSim, a simulator used in the course, which is a basically a fancy calculator, reading arithmetic expressions and evaluating them to produce 8-bit values which it adds to the array of bytes to eventually be loaded into memory. UASM’s main elements are values, symbols, labels and macros. Symbols and labels give names to values and addresses and macros create shorthand notations for sequences of expressions. Examples of UASM code are

        N = 12              // loop index initial value
        ADDC(r31, N, r1)    // r1 = loop index
        ADDC(r31, 1, r0)    // r0 = accumulated product
  loop: MUL(r0, r1, r0)     // r0 = r0 * r1
        SUBC(r1, 1, r1)     // r1 = r1 - 1
        BNE(r1, loop, r31)  // if r1 != 0, NextPC=loop

With “//” or /*…*/ comments being ignored by the assembler. N and r values are symbols, representing a constant value and aren’t variables. Loop here is a label which is a symbol for an address in memory. ADDC, MUL, SUBC, and BNE are macros that will be evaluated and replaced with full instructions based on the macro’s definition. The assembler maintains a symbol table, matching symbol names to their values. Initially, this is loaded with register mappings. The assembler reads the source file line by lines, evaluating symbols and expanding macros, and translates symbols to values. When the assembler encounters a symbol or label, its replaced with the value in the table. The assembler makes two passes through the file, first loading the symbol table with values from symbol and label definition, and then generates a binary output on the second pass. A statement can then refer to a symbol or label defined later in the file which can be required if branching is used. Register symbols aren’t treated differently from regular symbols and will be replaced by the assembler’s process. No type checking can cause errors. BEQ and BNE macros can help compute offset automatically in UASM which is useful as adding or removing code can cause the amount of addresses the program counter needs to move to change. One macro used to assemble beta instructions is

        .macro betaop(OP, RA, RB, RC) {
            .align 4
            LONG((OP<<26)+((RC%32)<<21)+((RA%32)<<16)+((RB%32)<<11))
        }

“.align 4” ensures instructions begin on a word boundary and have a byte address that’s a multiple of 4 and spans exactly one 32-bit word. The LONG macro generates the four bytes of binary data representing the value of the expression shown. The expression is where the assembly of the fields takes places, each limited to the correct number of bits and shifted left.

      .macro betaopc(OP, RA, CC, RC) {
          .align 4
          LONG((OP<<26)+((RC%32)<<21)+((RA%32)<<16)+((CC % 0x10000))
      }

      //Assemble Beta branch instructions
      .macro betabr(OP,RA,RC,LABEL) betaopc(OP,RA, ((LABEL-(.+4))>>2),RC)
      
      For example, 
      .macro ADDC(RA,C,R)    betaopc(0x30,RA,C,RC)
      ADDC(R15, -32768, r0) becomes betaopc(0x30,15,-32768,0). Here, the ADDC macro expands to an invocation of the betaopc helper macro and passes the correct value for the ADDC opcode and the 3 operands. Betaopc then shifts the opcode (0x30) left to occupy the high order 6 bits of the instruction, then the RA argument (15) is placed in its proper location. The 16-bit constant is positioned in the low 16-bits of the instruction and the RC argument is placed in its proper location. This process is called assembling because it assembles the binary instruction from separate binary values. Pseudo instructions such as pop and push are built using multiple smaller instructions and can help make code more readable. LONG assembles a 32-bit value and values can be written as expressions. The assembler evaluates them and they are not translated to instructions to be computed during program execution. If a value is needed for an initial data value or instruction field, the assembler needs to be able to perform the execution itself. A special symbol “.” called dot is the next byte address to be filled in the main memory. Initially this is 0 and incremented each time a new byte value is generated. This can be set to tell the assembler where in memory to place a value. The assembler is a program in itself and the first one was hand-assembled into memory.

What capabilities need to be included in the ISA. Looking at boolean gates, earlier, we saw any boolean function could be implemented with only NAND games. How do we know if the Beta can solve any problem, what about compared to FSMs and does it depend on the ISA. The Beta ISA needs the ability to perform any computation. If a computation could be described using another well formed model, a universal model should be able to handle it. Can FSM implement all computations solvable by any digital device? An FSM can’t solve the problem of checking if parentheses are balanced, since the internal state is used to encode information on the history of inputs. The FSM would need to count the values seen so far, but with a fixed number of states, if the FSM gets more open parenthesis than it can count, it won’t be able to solve the problem. Finiteness leads to problems requiring unbounded counting.

Alan Turing was one of many mathematicians finding the limits of proofs in computation. Turing machines solved the finite problems of the FSM. Turing proposed a conceptual model of an FSM combined with an infinite digital tape that could be read and written at each step. Assuming a non blank input on a digital tape occupies a finite number or adjacent cells, it can be expressed as a large integer. This way, the Turing machine takes integer inputs and outputs. Other models of computation describing a class of integer function where integer inputs become output were done by Stephen Kleene, Alonzo Church and Emile Post. All wanted to prove the existence of problems unsolvable by realizable machines. Their recursive functions, lambda calculus and production systems (respectively) all ended up being able to compute the exact same set of integer functions, proved by translating steps between the models and showing equivalent descriptions exist. This leads to a notion of computability independent of the specific scheme. Church’s thesis states that every discrete function computable by any realizable machine is computable by some Turing machine. The goal is to find a universal function U(k,j) = T_K[j] that computes the result of calling Turing machine K on J. U is computable and T_U exists. There are infinitely many such machines each capable of running any computation any Turing Machine can compute. The smallest known universal Turing Machine has 4 states and 6 tape signals. K encodes a program(a description of an arbitrary machine, j encodes the input data and T_U interprets the program, emulating its processing of the data. The main idea is interpretation, where you manipulate coded representations of computing machines rather than the machines themselves. The universal Turing Machine is the paradigm for modern general purpose computers and ISAs should meet this as long as data can be stored in memory. Treating programs as data is important and makes abstraction, compiling and language design easier. Turing machines paved the way for modern stored program computers.

There are well defined discrete functions that Turing machines can’t compute f(x) for an arbitrary x in a finite number of steps. The finite memory problem isn’t the only limiting factor. The most famous example is the Halting function f_H(k,j) which is 1 if T_k[j] halts, and 0 otherwise. f_H(k,j) determines whether the k^th TM halts when a given tape contains j. This isn’t computable since if it were it would be equivalent to a Turing Machine T_H, then T_N must be computable which does the opposite of whatever T_X[x] does. Finally, giving N as an argument to T_N creates a contradiction since T_N can’t do the opposite of what it does. This shows that the Halting function isn’t computable.

Section 11 – Compilers

This section focuses on translating high level code into executable instructions. Assembly code writers determine which data should be in main memory as opposed to registers and assembly code is an abstraction from working with bits. A higher level language can be used to abstract away storage allocation and data movement using variable names and letting data processors handle the details. Expressions and other operators such as assignment can be used to describe what would be many statements in assembly language. A subset of C will be used for the example high level language. C was developed in the late 60’s and early 70’s and isn’t object oriented. Avoiding worrying about registers makes code easier to write and consistently execute. Higher level language allows for advantages such as productivity, via concise, readable and maintainable code. The same code can be run on different hardware and code can provide type checking and other tools to ensure correctness. It takes a lot less time to write a program in a high level language than assembly. The same code can run on different ISAs without having to rewrite the code.

Two basic execution strategies are interpretation and compilation. To interpret a high level program, a program called an interpreter runs on a hard machine, M1. The interpreter mimics the behavior of an easier to program M2 and the result is a “M2” on M1. The program on M1 can be viewed as an implementation of M2 and given a program for M2, the interpreter emulates its instructions step by step. Several layers of interpretation are typically used. For instance, a laptop with an Intel x86 CPU runs a Python interpreter, which runs SciPy for data analysis. Each SciPy command is translated back to many Python statements. For each Python statement, the x86 interpreter executes many x86 instructions, such as incrementing loop index and checking for loop termination. 1 SciPy command can be 10’s of Python statements which can be 100’s of x86 instructions. Interpretation is useful for performing a computation once.

A compilation strategy is useful for tasks that need to be executed repeatedly, investing more time up front for more efficiency in the long term. In compilation, starting with M1, take a high level language program and translate it statement by statement into a program for M1. This doesn’t run the higher level language program and just translates it into a program that can run directly on M1. The translation process is compilation and the program doing it is a compiler. The translated code can be run quicker as it doesn’t have to process the higher level source each time. With different compilers, the high level code can be run on different machines. Both methods of executing high levels are commonly used but have their trade offs. The interpreter interprets and runs each command, one at a time and may have to interpret commands multiple times. The compiler only has to translate the code once and will be faster overall if code has to run multiple times, but can be slower if debugging or constantly changing and recompiling.

A good compiler will first check that the source is correct and can provide errors if wrong data types are used or a variable is accessed before assigned. If a program passes scrutiny, the compiler produces a sequence of instructions, often optimizing the code to find faster versions. The two main routines in a simple compiler, compile_statement and compile_expr. Compile_statement compiles a single statement from the source program, which will be called a lot. Four types of statements are unconditional statements, which are executed once, compound statements are sequences of statements executed in turn. Conditional statements (if statements) compute the value of a test expression and then choose between statements to execute. Iteration statements such as while loops also contain a test expression executing statements after checking for each iteration, or terminating. Compile_expr generates code to compute the value of an expression into a register. Expressions can be constants, variables, assignments, operations, etc, and procedure calls (this adds complications). To put the value of a constant into a register, CMOVE(1234,Rx) can move small constants into a register. Otherwise, if the constant is large, it’s stored in main memory and LD(c1,RX) gets the value of the LONG into the register. Variables can be stored in RX, the same way. Array access is more complicated as code is needed to convert the array index into the actual main memory address of the element and other values may need to be computed, then stored. A recursive descent technique is used to find assembly code for complicated arithmetic functions. Looking at compile_statement, unconditional statements are often assignment expressions and procedure calls where compile_expr can generate the appropriate code. Compound statements just require calling compile_statement on each statement one after another. Conditional statements are a bit more complicated as branches and labels will be added. In general tasks are broken down to larger into smaller and smaller tasks. A for loop may be turned into a while loop and then to assembly.

Compared to the bare minimum compiler we just saw, a modern compiler first reads the source program text to produce a machine independent intermediate representation. This is done in the analysis or frontend step, reading the source program, breaking it into basic elements, checking for correctness and reporting errors, before producing the intermediate representation. The front end verifies syntax and that operations take the correct types. This phase converts the text of the source program into an internal data structure which specifies the sequence and type of operations to be performed. There are families of front end programs that translate a variety of language to intermediate representations. The synthesis or backend phase optimizes the intermediate representation and then generates code for the target instruction set architecture, ASM and optimizes the ASM. The analysis phase starts by lexical analysis or scanning the source text and generates a list of tokens objects identifying the type of each piece of the source text, removing newlines and tabs. Token objects also contain information on where they were found and illegal tokens such as 3x would be caught as it’s not a legal number or variable name in C, creating an error. The syntactic analysis or parsing phase processes uses the sequence of tokens to generate a syntax tree, determining the roles of each token in the program. You could use a depth-first tree walk using the label of each tree node to select the appropriate code generation template. The syntax tree makes it easy to see the program is semantically correct as well, for instance, “int x = “bananas”;” is syntactically correct but not semantically correct. When the semantic analysis is complete, the program is successfully transformed into a language independent sequence of operations.

The syntax tree is a useful intermediate representation, independent of source language and target instruction set architecture. This additional language is used as it has information on sequencing and groupings not seen in machine language and assembly doesn’t have the needed information to optimize well. It also allows front ends for different source languages to share a common backend. A common intermediate representation is to rearrange the syntax tree into a control flow graph, where each node in a graph is a sequence of expressions and assignments with an optional ranch at the end. For instance “x = a op b”. The edges of a graph represent branches to another basic block. Conditional blocks have multiple departing arrows. If a block can only be arrived at from a single predecessor block, then any information from the previous block is applicable to the current. The intermediate representation can be optimized with multiple passes over the control flow graph, each doing a simple optimization and repeating until no optimizations can be found. For instance, dead code elimination can eliminate assignments to variables that aren’t used (or reassigned) and blocks that aren’t reached. Constant propagation is identifying variables that are constant, substituting the constant elsewhere. Constant folding is computing substituting constant expressions. Each round can check each of these categories until they can no longer be optimized. Code generation then translates the representation to assembly, allocating registers as needed, translate assignments to instructions, emitting each basic block using labels, assignments and branches. Then blocks are laid out and superfluous jumps removed, and any ISA or CPU specific optimizations are added.

Section 12 – Procedures and Stacks

A procedure or subroutine is a reusable code fragment performing a specific task and has a single named entry point, taking zero or more formal parameters it takes when invoked. The arguments passed are matched up with the formal parameters. The body of the procedure contains local storage and variables are allocated when the procedure is invoked and deallocated once it’s done. They can return values to the caller or not. This abstraction can make functions available as black boxes which can be used without understanding the implementation. Every high level language comes with a library of black boxes. Object oriented languages provide all sorts of procedures for getting information on objects, with internal object operations.

One option for implementing procedure is inlining which replaces the call with its body. A problem with this is that the code size could increase if a lengthy procedure was called many times. This is sometimes used with short procedures. Recursive procedures are also a problem as the inlining process wouldn’t terminate at compile time. The second option is linking where separate code is produced for each procedure and the caller evaluates arguments, stores them and transfers control to the callee’s entry point. The callee runs, stores the result and transfers control to the caller. A calling convention is needed to specify where to store arguments and return values during procedure calls. One option is using specified registers. Branch (BR) and jump (JMP) can implement procedure call and return. All procedure calls should use the same convention for accessing and storing variables. Some early languages such as fortran disallowed recursion.

Where do you store procedure calls’ input arguments, return addresses, and results. The procedure may also need local variables and space to save caller’s register values for registers that are overwritten. Each of these is specific to a particular activation of a procedure, stored in a procedure’s activation records. Space is allocated on a procedure call and freed afterward. For instance, calling factorial(3) would have 4 activation records in the recursive call at factorial(0) going back to 1 record once values are returned to factorial(3). Early compiler records realied records were allocated and deallocated in LIFO order and stacks were made with push and pop elements and access to the top element. C only needs to access the activation record of the currently existing procedure. Some languages support closures (JS) and continuation (Python’s yield statement) which require access to activation records even after procedures have completed and LIFO stack behavior isn’t sufficient, though this is outside the course scope. For a beta, a register is dedicated for the stack pointer, which builds up (toward higher addresses) on PUSH. The stack pointer points to the first unused location with addresses lower than it having been previously allocated. The stack discipline says the stack can be used at any time as long as it’s unchanged after a sequence of instructions. A large region of memory is allocated so the stack can grow without overwriting other storage. Different ISAs and programming languages have different stack conventions. PUSH, POP, ALLOCATE and DEALLOCATE macros are added to help support this on the Beta. Allocate and deallocate reserved or release storage for later use. To maintain stack discipline, ALLOCATE and DEALLOCATE should be used with corresponding PUSH and POP methods. Stacks can save values needed later.

A stack here is used to store a procedures allocation record and words are allocated to store local values and return addresses. The caller and callee will be responsible for allocating and deallocating storage. The caller evaluates argument expressions and saves their values in the activation record. Arguments are pushed in reverse order. The code compiled will be a sequence of expression instructions, followed by a PUSH to save the result on a stack. After the argument values have been pushed, a branch transfers control to the procedure’s entree point, saving a return address in a dedicated register. When the callee returns, all argument values are deallocated from the stack. The callee uses the stack of all of its storage needs and the activation record is sometimes referred to as a stack frame. First, the callee saves the return address back to the caller, then dedicates another register which is the base pointer, pointing to the stack frame. The caller then allocates words in the stack frame and saves registers used when executing its code. The saved values are used to restore register values just before returning to the caller, for the “callee saves” convention, where the callee guarantees register values across a procedure call. Dedicating a base pointer isn’t necessary, but helps with offsets. Arguments are pushed in reverse order to make finding their location from a point easier as well.

In summary, the caller pushes arguments on the stack in reverse order, branches to callee, putting the return address in LP, and removes arguments from the stack on return. The callee performs the promised computation, leaving a result in R0, branches to the return address and leaves the stacked data intact including stacked arguments, and leaves registers except R0 unchanged. Since a new frame is allocated for each recursive call and deallocated in inverse order as they return, recursion is possible. This also allows for a stack trace to give debugging information and see exactly what the program is doing. Dedicated registers in the Beta are R31 which is always 0 due to the ISA, R30 which is used for something else, R29 for SP, the stack pointer, R28 for LP, the linkage pointer, and R27 for BP, the base pointer.

Section 13 – Building the Beta

There are tradeoffs for designing a CPU. One goal for the beta could be maximum performance, measured in mips, millions of instructions per second. Another goal is minimum cost measured by the size of the circuit, and a third goal is the best performance/price ratio. In power-sensitive applications, MIPS per Watt is important, such as in mobile systems. The performance of a processor is inversely proportional to the time taken to run a program. Time per program is equal to the number of instructions per program times the cycles per instructions times the time per cycle. Performance is 1/time. To increase performance, we could reduce the number of instructions or the number of clock cycles needed per instruction (CPI). Minimizing propagation delay in hopes of a short clock period can help. This section focuses on a basic implementation which executes 1 instruction per clock cycle and has fairly long combinational paths and uses pipelining to increase the application’s throughput.

The data path will be built incrementally, building datapaths for each instruction class individually and merging them with MUXes, etc. First, logic is added for ALU, then Load and store instructions, jump and branch instructions and exceptions. The digital logic gates will be used such as multibit registers for state information, MUXes for selecting between alternative values. Computations will be performed by the black box ALU from part 1, taking in two operands and producing a result. Lastly, several different memory components implement main memory and register storage. The Beta ISA specifies 32 32-bit registers as part of the data phase, implemented as load-enabled registers which have an EN signal controlling when it’s loaded with a new value from the D input. If EN is 1, a new value is loaded from the D-input, otherwise it’s loaded with its current value. Adding enabling logic to the clock signal can cause glitches (“No gated clocks”). Multiplexers underneath the registers let you select a value from any of the 32 registers. Since two inputs are needed, there are two MUXes and their select inputs RA1 and RA2 function as addresses, determining which register values to select as operands. Lastly, a decoder determines which of the 32 register load-enables will be 1 based on a 5-bit WA input. This functionality is packaged in a component called a register file which has 2 read puts, which given a 5-bit address input deliver the selected register value on the read data ports. These ports act independently and can read from each other and can read from different registers or the same (if the addresses are the same). WA selects a register to be written with the 32-bit write data, WD, and if the write enable signal WE is 1 at the rising edge of the clock signal, the selected register is loaded with the write data. In the Beta ISA, register 31 should always produce a 0 value with internal logic. You can read and write an address in the same clock cycle, though the old value is read until the next clock edge.

First, work on the data path logic needed to work needed to execute ALU instructions with two register operands. Each instruction requires the same processing steps. Hardware needs to be able to FETCH (read) a 32-bit instruction from main memory for the current cycle, DECODE where the opcode field determines values for the data path control signals, READ operands (Ra, Rb) front the register file, EXECUTE the operation and WRITE-BACK the result to the Register File’s Rc field. The rising clock end marks the end of execution for one instruction and beginning of the next. Since one instruction is executed each clock cycle, the frequency of instructions says the rate instructions are executed. For the FETCH, the program counter stores the instruction address to be fetched. Then 4 is added to the PC, loading a new value at the end of the cycle. An adder performs the PC+3 instruction and puts it back as the next value of the PC. A MUX selects the initial value for the PC when its reset signal is 1. After the memory propagation delay, the instruction bits ID[3:0] are available and processing can begin. Some values can be used and others still need logic to compute. ALU instructions for two register operands use registers connected to their appropriate address inputs for the register file. The Ra and Rb fields provide addresses for the two read ports and Rc for the write port. Outputs of the read data ports are routed to the inputs of the ALU as two operands. ALUFN control signals tell the ALU what to run, determined by control logic from the 6-bit opcode field. If the control logic is a ROM and the bits are the address and control signals are the outputs. The ROM provides the correct signal for each opcode. The ALU’s output is routed to the WD port of the register file to be written into the Rc register at the end of the cycle. Another control signal, WERF (write enable register file) should have the value 1 to write into the Rc register. After the instruction is fetched, supplying Ra and Rb instruction fields, the Ra and Rb register values appear on the read data ports of the register file. The control logic decodes the opcode bits and supplies the appropriate ALUNC code. The ALU’s computation goes back to the register file to be written Rc and WERF needs to be set to write. A similar process is run for ALU operations with the constant, adding a MUX to determine if a constant is used or not.

The load and store instructions access the main memory, which has three ports, 2 read and 1 write. The address calculation is the same as the ADDC instruction, where the contents of Ra are added to the sign-extended 16-bit literal from the low order 16 bits of the instruction. The existing data path hardware computes the address. For LOAD, the output of the ALU is routed to the main memory as the address of the location we wish to access. After the propagation delay, the content of the location is returned by memory and needs to be routed back to the register file to be passed to Rc. The memory has two control signals, MOE (memory output enable) is set to 1 when you want to read a value from memory and MWE is set to one when the value on the memory’s WD port should be stored to the addressed memory location. Another MUX is added to select which value to write back to the register file, the output from the ALU or data from main memory. This is a 3 to 1 mux. A two-bit WDSEL chooses the source of the write back signal. Executing the LOAD instruction, the ALU operands are chosen the same as for ADDC and the ALU is instructed to perform an ADD by the ALUFIN. The ALU result is connected to the address port of main memory whose control signals are set for read operation. The WDSEL control signal is set to 2 to route the returning data to the register file. The execution of STORE is very similar with the complication that the value to be written to memory comes from the Rc register which’s instruction field isn’t connected to a register file read address. Since the Rb register address isn’t used by the STORE instruction since the second ALU instruction comes from the literal field, a MUX can enable the Rc field to be selected as the address for the register file’s second read address. The output from the second read data port is connected to the write data port of main memory. STORE is the only instruction which doesn’t write a result to the register file and the WERF control signal is 0 when executing STORE. Data flows by operands being selected and ALU performing address calculation and sending it to main memory. The Rc field is selected as the address for the second read port on the register and RD2 is written to the correct memory location at the end of the cycle. By setting MWR to 1, the main memory writes the WD data into the selected location at the end of the cycle.

Jumps and branches alter the value in the PC, taking the value in the Ra register and making it the next PC value. A PCSEL MUX lets control logic select the PC value’s source, when zero, the incremented value is chosen. When 2, the value of the Ra register is chosen. The instructions also cause the address of the following instruction to be written to Rc. For JMP, the data flow works by the output of the PC+4 adder being routed to the register file and WERF is set to 1 to let it be written at the end of the cycle. Meanwhile, the value of Ra coming out of the register file is connected to the 2-input of PCSEL and setting PCSEL to 2 will select it as the next value. The branch instruction (BEQ/BNE) requires an additional adder to compute the target address, adding the scaled offset from the instructions liter field to current PC+4 value. The output of the offset adder is the 1 input to the PCSEL MUX where, if taken, will become the next value of the PC. Multiplying by 4 can be done by shifting the literal 2 bits to the left and sign extension requires replicating bit id 15. Implementing this requires no additional logic, though a 32-bit NOR gate can test the value of Ra, outputting a Z value of 1 to determine whether to branch or not and choose the value of PCSEL.

The last instruction introduced is LDR, the load relative instruction. This is similar to LOAD, but the memory address is taken from the branch offset adder. This is used for accessing large constants stored in main memory that are two large for 16-bits. An LDR instruction refers to a nearby C1 with a reference to the value in memory. The storage location needs to be placed so it isn’t treated as an instruction. An ASEL mux is added to route the output of the offset adder to the main memory address port, to select either the Ra register value or output of the offset adder (when ASEL is 1). For LDR, ASEL is 1 and the ALU is asked to compute the boolean operation A, the boolean function whose value is the first operand. This appears on the output connected to the main memory’s address port and the rest of the procedure occurs the same as LOAD.

What should happen if an instruction can’t be executed, such as when an illegal opcode is used or non-existent memory value is accessed, or division by zero is attempted? External events such as inputs and outputs can be unanticipated. Hardware is added to treat exceptions, such as forced procedure calls to special code, saving the PC+4 value of the interrupted program so the process can resume. Exception hardware is the key to interfacing running programs to the operating system and letting the OS handle external events without the program needing awareness. The plan is to interrupt the running program, invoking an exception handler (like a procedure) call, and potentially return to continue execution. Exceptions and interrupts are similar, with exceptions usually referring to synchronous events generated by the program. If a program is rerun, the same exception will occur. Interrupts refer to asynchronous events from I/O devices such as keystrokes or packets being received. The implementation for both exception types is the same, and a branch will be taken to 0x4 for synchronous events and 0x8 for asynchronous events. The PC+4 value is stored to R30, called XP. User programs can’t use this register normally. To handle exceptions, a MUX controlled by the WASEL signal choses the writeback address for the register file. When 1, writeback occurs to R30. The PCSEL’s inputs 3 and 4 are set to the constant addresses for exceptions, 0x4 and 0x8 respectively. The control flow for an exception works as follows. The PC+4 value for the interrupted instruction is routed through the WDSEL mux to be written to the XP register. The control logic choses 3 or 4 for the next instruction. The remaining signals are set to “don’t care” values. The interrupted instruction hasn’t been executed and 4 will need to be subtracted from the value in the XP register to return to it. All of the logic for the data path is shown in a diagram at the end of the lecture. This hardware on a modern integrated circuit could occupy 2 square millimeters, while a modern Intel processor can be 600 square millimeters.

Section 14 – Caches and the Memory Hierarchy

In the circuit from last week, the main memory is the most expensive part of the program. The performance of most modern computers is limited by the bandwidth between CPU and memory, known as the memory bottleneck. This lecture focuses on understanding the bottleneck and mitigating against it. Various memory technologies have different tradeoffs. Registers are build from sequential logic and provide very low latency access to 1000s of bits of data in 20 picoseconds). Static and dynamic RAMs offer large capacities at the cost of longer access latencies. SRAMs provide low latencies to many thousands of locations, around 10 KB to 10 MB in 1 to 10 nanoseconds. DRAM provides access to around 10 GB of data in 80 ns. More locations means longer latencies. INcreasing the number of bits increases the area needed for memory circuitry, leading to longer signal lines and slower circuit performance due to increased capacitive loads. DRAMs are optimized for capacity and low cost, sacrificing access latency. This lecture tries to get low average latency and high capacity using a hybrid system with DRAMs and SRAMs. The key word is “average”. Flash memory and hard disks provide non-volatile memory, retained after they are powered off. Hard disks provide around a terabyte of storage at little cost, but with a latency of 10ms. Flash memories store around 100GB, with access in 100 us. Hard disks and flash memories are used together to provide a hybrid system with improved latency and high capacity.

SRAMs are an array of memory locations where a memory access reads or writes all bits in a single location. This looks like a system of rows and columns. Wordlines are horizontal lines and bitlines are vertical with two per cell. A bit cell has two bitlines on its left and right and a word line above and below it. To access the SRAM, provide enough address bits to uniquely specify the location. For an 8×6 array (8 wordlines and 6 pairs of bitlines), 3 address lines are needed to select 1 of 8 memory locations. The address decoder logic sets one of the 8-word lines high to enable a row location for the upcoming access. Remaining wordlines are set to low and the active worldline enables each of the SRAM bit cells on it, connecting each cell to a pair of bit lines (vertical lines in the array. During reads, bit lines carry analog signals from the enabled bit cells to sense amplifiers which convert analog signals to digital data. During writes, incoming data is driven onto the bit lines to be stored into enabled bit cells. Larger SRAMs have a more complex organization to minimize length and capacitance of the bit lines. Bit cells are the heart of the SRAM, which have two CMOS inverters in a positive feedback loop creating a bistable storage element, storing a 1 or 0 bit. As long as there’s power the noise immunity of the inverters ensure the value is maintained despite noise. Both sides of the loop are connected with access FETs to the word line below. If the worldline is high, FETs are on and when low, they’re off. The bitstable feedback loop will still maintain its value as long as there’s power. During a read, drivers recharge all bitlines to Vdd (a logical 1) and disconnect leaving them floating at 1. The address decoder activates one worldline and then each cell in the activated word slowly pulls down one of the bitlines to GND (ground/0). Sense amplifiers quickly detect the voltage difference between bitlines and produce the correct output data. Write operations drive the bitlines to the desired values (Vdd and GND for 1, GND and Vdd for 0). The address decoder activates one worldline and each cell in the word is overpowered by the drivers, storing a value. This requires balancing MOSFET sizes. SRAMs can have additional read and write ports by adding a set of wordlines and bitlines per port. For N ports, N wordlines are needed, 2*N bitlines and 2*N access FETs are required. The wire often dominates the O(N²) area. Reads and writes are analog operations.

What’s the minimum number of MOSFETs needed to store one bit of information? At least 1 MOSFET serves as the access FET to select affected bits. A simple capacitor can be used by storage and the value set by voltage threshold. The charge on the capacitor which determines speed and reliability of reading the stored value is proportional to capacitance which can be increased by increasing the dielectric constant of the insulating layer between the two plates (e), increasing the area of the plates (A), or decreasing the distance between them (d). Capacitance=eA/d. Googling trench capacitors will give current materials in the capacitors. THe resulting circuit is 20 times smaller area than an SRAM cell, which is denser and cheaper, but the capacitor can leak charge and must be refreshed periodically (every 10ms or so). For DRAM writes, drive the bitline to Vdd or GND, activate the worldline and charge or discharge the capacitor. For reads, precharge the bitline an intermediate voltage such as Vdd/2, activate the worldline, and the charge and bitline share charge. If the capacitor was discharged, the bitline voltage decreases. If the capacitor was charged, the bitline voltage increases slightly. Sense amplifiers detect this small change and produce a 0 or 1. Since operations wipe out the stored information, it must be rewritten to the cell. The first access to a row has a high latency and subsequent accesses to the row have low latency. DRAMs are 2 to 10 times slower than SRAM accesses.

Non-volatile memories maintain system state even after a system is powered down. For flash memory, long term storage is achieved by storing charge on a well-insulated conductor called a floating gate, which will remain stable for years. The floating gate is incorporated in a standard MOSFET between the gate and channel. If no charge is stored on the floating gate, the MOSFET can be turned on by placing a voltage on the gate terminal creating an inversion layer converting the source and drain. If there is a charge on the floating gate, a higher voltage is required to turn on the MOSFET. Setting the gate terminal to a voltage between the two, you can see if the floating gate is charged by seeing if the MOSFET is conducting. If you can measure the current flowing through the MOSFET, you can detect how much charge is stored in the floating gate, making it possible to store multiple bits of information in one flash cell by varying the floating gate’s charge. Flash cells can be connected in parallel or series to form circuits similar to CMOS NOR and NAND gates, allowing for a variety of access architectures for random or sequential access. Flash memories are very dense, approaching the density of DRAMs. Read access times for NOR are several 10ths of nanoseconds while read times for NAND are around 10 microseconds. Write times are long as high voltages are required to let electrons cross the insulating layer around the floating gate. After some number of writes, the layer will be damaged, with guaranteed writes between 100,000 and a million. Flash chips contain clever address mapping algorithms to map writes to the same address to different flash cells on each successive write. Flash memories are a higher cost, but higher performance replacement for the hard disk drive.

The hard disk drive contains magnetic rotating platters and a read/write head positioned above it which can detect or change the orientation of the magnetization of the magnetic material below. The read/write head is mounted on an actuator allowing it to be positioned over different circular tracks. To read a particular sector of data the head must be radially positioned above the correct track and needs to wait for the platter to rotate to the correct sector. This can take around 10 milliseconds, but once the position is correct, data can be transferred at around 100MB/s per second. For random read/writes, it takes around 100KB/s. Hard disk drives provide cost effective non-volatile storage for terabytes of data at the cost of slow access times. In the past decade, flash memories have helped fill the performance gap between processor speeds and hard disk drives, but the gap between processor cycle times and DRAM times has continued to widen.

How can a system appear to have a large, fast and cheap memory. Using a hierarchical system of memories can emulate this, using faster memories for more frequently accessed data. A first approach is to expose the hierarchy to a programmer, letting them move data between memories. There would only be a small amount of the fastest memory. This approach has been implemented to some success, such as in Vector Machines. The second approach is to hide the hierarchy, telling the programmer they have large memory to use as needed, with the machine storing data according to usage patterns, with more frequently accessed data (such as in loops) put into faster memory. This would all be transparent to the programmer. When running general purpose programs, it’s possible to build this memory system which appears as one large uniform memory.

How can the memory system put the right data in the right place at the right time. How can it predict which data will be accessed, and the cost of the move amortized by many accesses? Any block in SRAM should be accessed many times. Data not in SRAM would be in DRAM and DRAM access would be infrequent (ideally only to move to SRAM). Looking at how programs access memory, accurate predictions can be made about which memory locations will be accessed. The guiding principle is locality of reference which says if there’s an access to address X at time t, it implies that access to address X+deltaX at time t+deltat becomes more probable as deltaX and deltat approach zero. The SRAM component of the hierarchical memory system is a cache which has fast access to recently accessed data. If requested data is in the cache, it’s a hit and data is supplied by the SRAM, otherwise it’s a miss and data is moved from DRAM into the cache. The locality principle says we should expect hits to occur more frequently. Modern computers use multiple levels of SRAM caches, with smaller and faster caches closer to the CPU. A miss at one level generates an access to the next, until a DRAM access is used. Hit and miss ratios tell us the fraction of accesses that are hits or misses and sum to one. These can be used to find the Average Memory Access Time, which is the hit time plus the product of the miss ratio and miss penalty, with the goal of improving AMAT. This can be applied recursively to multi-level hierarchies. In a simulation of the spec CPU 2000 benchmark, the hit ratio for a standard size level 1 cache was 97.5% over trillions of accesses.

A cache is a list of cache lines which are data blocks and associated tags. A request from CPU searches for a block with a matching address tag. When found, there’s a hit, for a read the data from the matching line is returned, and on a write, the data on the line is changed and at some point, the corresponding location in main memory is updated. Otherwise, there’s a miss and a cache line will be chosen to store the requested data (possibly replacing old cache data). For read, the data in main memory is read and added to the cache and returned to the CPU. On a write, the tag and data in the cache line is written to, along with the main memory (at some point). The simplest cache hardware is an SRAM with a few additional pieces of logic. This is called a direct mapped cache and each memory location in the CPU’s address space maps to a particular cache line. There are many more memory locations than cache lines, so many addresses are mapped to the same line and the cache will only be able to hold one piece of data at a time. The operation is straightforward and a part of the incoming is used as an index to select a line to be searched. The search consists of comparing the rest of the incoming address to the address tag of the selected line. If they match, there’s a hit and the data in the line satisfies a request. An additional valid bit is set on each line and is set to 0, initially. This is set to 1 when the tag and data are filled. The CPU can mark cache lines as invalid to know to flush data. The complete address of the cache location is formed by combining the tag and index bits of the cache line. This simple search is sufficient to get good cache-hit ratios for the data we care about.

The direct mapped cache can be tweaked to take advantage of locality by increasing block size. If there’s a high probability of accessing nearby words, why not fetch a larger number of words, trading the increased cost of the miss with an increased probability of future hits. Larger block sizes fetch more words on a miss and the miss penalty grows linearly with increasing block size. The miss ratio is also decreased. There should be a sufficient number of blocks to hold accesses in the working set. There’s an optimum block size to minimize miss ratio and increasing the block size past that is counter productive. The formula for average memory access time can be used to choose the block size giving the best value for it. A common block size in modern processors is 64 bytes or 16 words. Direct mapped caches have an achilles heel in conflict misses where it’s possible for the first data item and instruction to be mapped to the same cache line meaning only one can be in the cache at a time. Fetching the first instruction fills the cache line, but then the first data access misses and fills the line with data. The next time through the loop, the first instruction will no longer be there and the cache will never contain the word requested by the CPU. The cache hit ratio can drop from 100 percent to 0 without architectural changes for avoiding this problem.

A fully associative cache has a tag comparator for each cache line. The tag field of every cache line is compared with the tag of the incoming address. All lines are searched and a particular memory location can be held at any cache line. Fully associative caches are flexible with high hit ratios, but are expensive in circuitry required. A N-way set-associate cache is a compromise between direct-mapped and fully associative caches. These are just N direct mapped subcaches operating in parallel each of which compare the tag field of the incoming address with the tag field of the cache line selected by the index bits of the incoming address. The N cache lines searched form a search set and the desired location can be held in any member. An N-way set-associative cache can accommodate up to N blocks whose addresses map to the same cache index so access to up to N blocks with conflicting addresses can still be accommodated without misses. This also allows for lots of cache lines with only the cost of N tag comparators. Rows are referred to as sets and columns are referred to as ways. For most programs, an 8-way set associative cache with a large number of sets will perform on par with the more expensive fully associative cache. Associativity implies choice, how do you choose to replace the contents of a cache line to minimize the impact on the hit ratio in the future. Ideally, you could replace the block the furthest in the future, but that requires knowing the future. If a block hasn’t been recently used, it’s less likely to be used in the near future. This leads to the LRU, least recently used policy, replacing the block accessed furthest in the past. LRU works well in practice, but requires keeping a list ordered by last use for each set of cache lines that needs to be updated on each cache access. Caches usually use approximations with simpler to compute update functions. Other policies are FIFO, where the lead recently replaced line is replaced or random (not very good), but less exploitable by adversaries. LRU is generally a reasonable choice.

The last cache design decision to make is how to handle main memory writes. The write could be done immediately, called write-through, where the cache immediately writes its data to main memory. If many updates are made quickly to the cache, the main memory would only actually need the last one. Write-through is simple, but is slow and wastes bandwidth as the CPU is stalled on each write, though the main memory will always hold the current contents. In a write-behind strategy, CPU writes are cached and writes to main memory are buffered. The CPU keeps executing while writes are completed in the background, overlapping main memory writes with the program execution. If there’s a cache miss while the write is still pending, everything will have to wait until both the write and refill read finish, as the CPU can’t proceed until the cache miss is resolved. This is faster, but still uses a lot of bandwidth. The best strategy is called write-back. The contents of the cache are updated and CPU continues execution immediately. Updated cache values are only written to main memory when the cache line is chosen as the replacement line for cache miss, minimizing main memory writes. This is the strategy used by most modern processors. When replacing a cache line this way, it must be written to the main memory. Using a “dirty bit” on each cache line which is set to 0 on a cache miss. If a write changes the data in the cache, the dirty bit is set to 1 and when a replacement is selected, it only needs to be written to main memory if the dirty bit is 1. Cache tradeoffs can be tested by simulating programs.

Section 15 – Pipelining the Beta

This lecture uses pipelining to try and improve the Beta. One way of analyzing the performance of a system is by saying the time per program equals the instructions per program times the cycles per instruction times the time per cycle for a single-cycle beta performance. The cycles per instruction and clock period are under the control of a CPU designer. CPI was 1 since every instruction was executable in a single clock period. t_CLK is the longest path for any instruction and is low and inflexible here. Instead of forcing every instruction to run in a single clock cycle, simple ones can run in one cycle and more complex can take multiple. Pipelining can help execute all instructions with a short clock period, dividing the execution of an instruction into a sequence of steps. Each step is performed in successive stages of the pipeline and it takes multiple cycles to execute an instruction, but the clock period can be shorter and CPU throughput can be higher. CPU throughput is improved by overlapping consecutive instructions. Any time, there will be multiple instructions in a CPU, each at different stages. The last step of an instruction in each clock cycle will give an instruction throughput of 1 per clock cycle.

Pipelining can be done in many ways, with this lecture focusing on the classic 5-stage pipeline, widely used in integrated circuit CPU designs in the 1980s. The five stages correspond to the steps of executing the instruction in a von Neumann stored program calculator. The first step, IF is the instruction fetch stage which maintains the program counter, fetches an instruction and passes it to RF. RF is the register file stage which reads the source operands from the register file and passes them to the ALU stage which performs the requested operation. The result is passed to MEM, the memory stage, performs the second access to main memory to read and right data for load, load relative or store instructions. For LD, the output is the value from the main memory, otherwise it passes the ALU result to the output. Lastly, the write-back stage, WB, writes the result into the register file.

Problems can arise when execution of an instruction depends on the results of a previous instruction. A data hazard arises when data is required and control hazard occurs when a branch, jump or exception changes the order of execution and the instruction it depends on is also in the pipeline (the earlier instruction hasn’t finished executing). To fix, design a 5-stage pipeline that works with sequences of independent instructions, then handle data hazards and control hazards. For a simplification of the beta, the next program counter is always PC+4 for now and the register file is duplicated to allow different stages not to rely on this. This lets it be easily split into 5 stages and allows for overlapping. Here, the IF stage contains the PC and main memory interface for fetching instructions, RF has the register file and operand multiplexers, ALU uses operands to compute a result, MEM handles memory access for load and store and WB writes back to the register file. Each stage does one instruction per clock cycle. Data accesses to the main memory span 2 clock cycles. Instruction contents are propagated through the pipeline in instruction registers (IR) and control signals for each stage are generated from a corresponding instruction register. Pipeline diagrams show a row for each stage and column per cycle. Register reads happen in the RF stage and writes at the end of WB.

Three strategies for fixing hazards have different tradeoffs. The first is to stall and wait for the result to be available by freezing earlier pipeline stages. Stalling always works, but has a negative impact on throughput and stalling for too long can lose the performance advantages of pipelined execution. The second strategy is to bypass or forward, routing data to the earlier pipeline stage as soon as it is calculated. This can avoid stalling in most data hazards. The third strategy is speculation where you guess a value and continue exciting. If you guess incorrectly, backup execution and restart with the correct value. Only speculate if you can make accurate guesses. Speculation works well on control hazards. Comparing register values can help know when to stall and a stall MUX can be added. Bypassing can be done with a many input MUX to select an appropriate value from other pipeline stages. One data hazard is a load-to-use hazard which will require a stall even with bypassing. More pipeline stages means more frequent data hazard, meaning lower t_CLK and higher CPI. Compilers can rearrange code to put dependent instructions farther away to help this process if independent instructions can be moved around. For speculating on control hazards, guessing the next instruction in PC+4 wastes a cycle on branches and jumps and for deeper pipelines, taken branches waste more cycles. Modern CPUs dynamically predict the outcome of control-flow instructions, predicting both the branch condition and a target and works well as branches have repeated behavior, such as taking loop branches.

Exceptions are implicit branches that cause control flow hazards. It’s important to identify which instruction was affected, that earlier instructions are completed correctly and that the affected and following instructions are annulled. To handle exceptions in stage i, turn the instruction into a BNE(R31, 0, XP) instruction to save PC+4. Annul instructions in stages i-1 to 1, flushing the pipeline, and set PC to IllOp or XAdr. OVerall, stall is added to IF and RF and the options to add NOP (no operations) into those stages. For control hazards, PC+4 is speculated with an option for jumps and taken branches, to null the assumption in the IF stage. Pipelined execution should give the same result as unpipelined execution.

Section 16 – Virtual Memory

Where data in main memory comes from and the process of filling main memory is the topic of this lecture. Flash drives and hard disks are secondary storage, where data is stored until the computer is booted up, moving data into primary storage (main memory). Main memory can be viewed as a higher level of cache for the permanent secondary storage. Today focuses on building a virtual memory system, moving data from secondary into main memory and controlling what data can be accessed by a program, to help many programs run on a single CPU. One problem with the memory hierarchy is that disk access times are 100,000 times longer than DRAM and the additional access time for a contiguous block of words is small compared to the first word in DRAM and even faster for disk. Though a hard disk can store a lot more data, it’s slow access time means we should design a virtual memory system that minimizes misses when accessing main memory. The miss rate will need to be very small compared to the rate of executing instructions. High associativity and flexibility in how disk data can be located in main memory is required. Unnecessary collisions should be avoided. Large block sizes will avoid too many disk accesses and amortizing the costs of a miss over multiple words. A write back strategy should only update the contents of the disk when changed data in main memory needs to be replaced with secondary storage. The organization of main memory and secondary storage access can be handled in software. Hit’s should be handled in hardware and misses in software for more flexibility.

There are two kinds of memory addresses. CPU uses virtual addresses and main memory uses physical addresses. A piece of hardware called the memory management unit translates virtual addresses (VAs) to physical addresses (PAs). Both an MMU and a cache can be used, but the example here assumes no cache. An OS managed table called the page map is used to translate addresses. The MMU uses the virtual address to select an index to select an entry in the table to lookup a corresponding physical address anywhere in main memory, though normally two virtual addresses shouldn’t map to the same physical address. It would be okay if some virtual addresses didn’t map to a physical address as they may not have been loaded into main memory and the MMU could signal an exception to the CPU which could assign a location in physical memory and initialize the location from secondary storage. The MMU table gives the system a lot of control for how physical memory is accessed by a program. Multiple programs could be run in quick succession (time sharing) by changing the page map when changing programs. Only the current working set of the program needs to live in main memory with other variables being able to be in secondary storage.

The MMU plays a central role in the design of a modern time sharing computer system. To separately map each virtual address to a physical one, an impossibly large table would be needed. Instead, the virtual and physical address spaces are divided into fixed size blocks called pages. Page sizes are a power of 2 bytes (such as 2^P. P is the number of address bits needed to select a location on the page. For the address, the low order P bits are the offset and remaining are the page number (page being accessed). Both virtual addresses and physical addresses are separated into page numbers and offsets this way. The MMU manages pages and not locations. Entire pages will be moved and with the principle of locality, nearby locations will be expected to be accessed. Choosing the offset from low order bits, nearby locations will live on the same page unless near a page end. Pages naturally capture the notion of locality and reading or writing many locations is only slightly more time consuming that just the first. The MMU maps virtual to physical pages, using the page map to perform translation. If the requested virtual page isn’t in main memory, a page fault exception is sent so the CPU can load it from secondary storage. Main memory is used as a page cache in a process called paging or demand paging. Initially all virtual pages are in secondary storage and MMU is empty. As the program begins running, each VA is mapped to a PA. If referencing a “RAM-resident” page, the RAM is accessed by hardware. If not found, there’s a page fault and the software handler fetches the page from disk to RAM and adjusts the MMU to map the newly loaded virtual page to a physical page in RAM. If the RAM is full it may have to replace a little-used page to free up space. The working set (pages currently being accessed by the program) is loaded by a series of page faults and the frequency of page faults drops. A simple page map has one entry for each virtual page each with a resident bit R, set to 1 if the page is in RAM, or 0 if not. There’s a page fault if R is 0 and a physical page number is stored for each resident bit. A dirty bit, D is 1 if it’s been changed since loading it from disk (and needs to be written when replaced). Additional bits, such as a “read-only” bit could be used.

The three architectural parameters characterizing a virtual memory system are P (the number of offset bits in both types of addresses), m, the number of address bits for the physical page number, and v, the number of address bits for a virtual page number. A typical page size is 4 to 16 kilobytes. The size of the virtual address is determined by the instruction set architecture. 32-bit architectures support 4 GB where 64 support 16EB (exabyte) virtual address spaces. The size of physical addresses is between 30 and 40 bits from a gigabyte to a terabyte, though this will be abstracted from the typical programmer. A portion of main memory can be used for a page map, using a register called the page map pointer to hold the address of the map in main memory. The downside is that it takes 2 accesses to main memory to retrieve a virtual address, with the first looking up the page map. A special purpose cache can be used to cache the page map entries. Most systems use a TLB, translation look-asside buffer mapping virtual page numbers to physical page numbers, which is small and fast, and usually fully associative. If the physical page number is found using the TLB, the lookup to main memory for the page table entry can be avoided and we’re back to a single physical access for each virtual access. TLB’s have a hit ratio often higher than 99 percent. Locations are cached in the TLB if they need to be looked up after not being found.

A page map provides the “context” for interpreting virtual addresses. A context is a mapping of virtual to physical locations dictated by the contents of the page map. Several programs can be loaded into main memory each in its own separate contexts. When switching programs, contexts must be switched by reloading the page map. In a timesharing system, the CPU will switch from one program to another, giving the illusion that programs are running on their own virtual machine by switching contexts. The operating system is a privileged set of code which manages the sharing of one physical processor in main memory among many programs each with their CPU state and virtual address space. The OS effectively creates many virtual machines and correographs their execution using a single set of shared physical resources. The OS runs in a special OS context called the kernel and contains necessary exception handlers and time sharing support, and is allowed to access physical locations as it needs to. Exceptions in programs cause the hardware to switch to the kernel context, known as kernel mode, where the OS has access to many hardware registers such as I/O devices and MMU state. After the exception process runs, the program switches back to the user context or “user mode.” Users must request for the OS to make changes via the Kernel. Applications are written as if they have access to the entire virtual address space, while the OS kernel controls contexts and prevents collisions between different program’s memories. A standard plan for organizing the virtual address space for an application sets, with the first virtual page being inaccessible helping to catch errors involving references to initialized (ex. zero-valued) pointers. Some number of read only pages hold the application’s code and code from shared libraries it uses. Making code pages read-only avoids hard to find bugs where the program is changed. Read-write pages hold the application’s statically allocated data structures and the rest of the space is divided into the stack and heap which can grow. The stack holds procedure activation records on the lower end of the address space as it grows toward higher addresses. The heap grows down and is used when dynamically allocating storage for long-lived data structures. The page fault handler can allocate new pages as needed, though the stack and heap can’t meet as the address space would be out of virtual memory.

With many applications, one permanently resident page can hold a page directory with entrance to partial page tables in virtual memory. Normally when changing contexts, the OS would reload the page table pointer to point to the appropriate page table or directory. The OS would also have to invalidate entries in the TLB cache and impact the hit ratio hurting average access time until the TLB is refilled. To reduce the impact of context switches, some MMUs use a context number register whose contents are concatenated with the virtual page number to form the query to the TLB, meaning the tag field expands to include the context number provided. To switch contexts, update the context number and page table pointer registers and you don’t have to flush the TLB as each entry’s tag indules the context number. Lastly, to use both a cache and MMU in the memory system, you could put the cache between the CPU and MMU for a virtually-addressed cache, which is fast as there’s no MMU time on a hit. The problem is that you must flush the cache after a context switch, leading to a large cache miss ratio until the cache is refilled. The fix and other alternative is to cache physical addresses by placing the cache between the MMU and main memory. This avoids stale cache data after context switch, but has a slow MMU time on a hit. To get the best of both worlds, you don’t have to wait for the MMU to finish before starting the access to the cache. The cache needs the line number from the virtual address to fetch the appropriate line. If the address bits for the line number are completely contained in the page offset of the virtual address, these bits are unaffected by the MMU translation and the cache lookup can happen in parallel. Once the lookup is complete, the tag field of the cache line can be compared with the physical address from the MMU. If there was a TLB hit, both should be available about the same time, causing no impact on average memory access time. To increase the capacity of the cache, increase associativity to increase capacity without affecting address bits used for the line number.

Section 17 – Virtualizing the Processor

A process is an abstraction that captures the notion of a running programming encompassing all needed resources such as CPU, MMU, virtual I/O devices, PC, stack etc. Each process has a state including the context, machine state and registers. The OS kernel is a special privileged process in its own context that manages the execution of other processes and handles real I/O devices, emulating virtual I/O devices for each process. This manages reading from files and handling network connections and user inputs. To switch between processes, the OS needs to capture the entire state of the user mode process. Each process needs to seem to run in its own separate virtual machine. Physical machines include CPU, memory, a timer, disk, I/O connections and monitors, keyboards and mice for UI. The physical machine is managed by the OS running in the privileged kernel context, which handles low level interfaces to the devices and manages the contexts. The OS creates the VM seen by each process. User mode programs run on the physical processor, but can be interrupted by the timer, giving the OS the ability to save the current process state and move to running the next one. Via the MMU, the OS provides each process an independent virtual address space, isolated from the actions of other processes. The virtual peripherals isolate the process of sharing details from other processes, such as a window isolating pixels. Each process accesses a stream of generated I/O events instead of devices directly. The OS also provides streams of bytes for files and sockets and the process communicates with the OS using supervisor calls or SVCs which invoke code in the OS kernel. A course on operating systems would go into detail on each of these. The OS provides an independent virtual machine for each process periodically switching between them. For example, switching from running process 0 to process 1, the process 0 execution will stop due to an explicit yield or a timer interrupt. Control is transferred to the OS code running in kernel mode and current PC+4 value is saved in the XP register. The OS saves the state (regs, context) of process 0 and loads process 1’s state. The OS uses a jump to return to process 1, just as it was when interrupted earlier. This interrupting of 1 process and returning to another is repeated many times. Time multiplexing of the CPU is called timesharing.

A key technology for timesharing is the periodic interrupt from the external timer device. External devices assert the Beta’s interrupt request input (IRQ). If the Beta is running in user mode, supervisor bit in the PC is 0, asserting IRQ=1 triggers the following. PCSEL is set to 4, WASEL is set to 1, WDSEL to 0 and WERF to 1 so PC+4 is written to the XP register and MWR is set to 0 to ensure if a store instruction is interrupted, it’s stopped correctly. Interrupt hardware is minimal, saving the PC+4 to the XP register and setting the PC to a predetermined counter depending on which external interrupt happened. The remainder of the work to handle the interrupt request is performed in software. The state of the interrupted process is stored in main memory in an OS data structure, UserMState. A procedure (usually written in C) is called to handle the exception and when the procedure returns, process data is reloaded from UserMState and the OS subtracts 4 from the value in XP and resumes execution with a jump. A problem can arise if a second interrupt overrides a first and interrupts are disabled when running in the OS.

The OS also handles operations with illegal instructions, or unimplemented user operations (UUO). These act like an interrupt caused directly by the CPU. The operation of the current instruction is suspended and the control signals are set to values to capture PC+4 in the XP register and set PC to “Illop.” Bit 31, the supervisor bit in the new PC is set to 1, so the OS handler has access to the kernel mode context. The OS should put debug information into a file and exit the process or resume the user code if possible. User mode programs need to communicate with the OS, but if running in a different context, they won’t have access to OS data. User programs should be able to call OS code at specific entry points using registers or user mode virtual memory to send and receive information. Supervisor calls would access an API, such as POSIX implemented by many UNIX-like operating systems. A way of transferring control to the OS is using illegal instructions, and adding an opcode field of one can signal a supervisor call with low order bits saying which service to access. The OS Illop handler is called, but the opcode field sends it to the appropriate handler, which can access the user’s registers in the temporary storage area or via subroutines, any user mode virtual address. If information is to be returned, it can be stored in the temporary storage registers which will be loaded back into the CPU registers when execution continues following the supervisor call. A real OS will have a variety of supervisor calls related to file access, network connections, etc. SVC’s can write messages, yield, wait, and more, and new handler’s can be added easily. SVCs provide controlled access to OS services and data values and offer atomic or uninterruptible execution of instruction sequences.

Section 18 – Devices and Interrupts

When the user types a key on the keyboard, the keyboard triggers an interrupt request to the CPU. The interrupt suspends execution of the currently running process and calls the handler for dealing with the particular I/O event. For the keyboard, the keyboard handler reads the character and stores it in a kernel buffer associated with the receiving process for keystrokes. When the handler exits, the interrupted process resumes. The buffer and kernel can only hold so many values before filling up. A user mode program will execute a ReadKey supervisor call which gets the next character from the buffer and puts it into R0. This is a blocking I/O request and execution can be suspended until there’s a character or return a flag saying there wasn’t one if the application is expected to handle that. The user mode program had no interaction with the keyboard, using an event driven approach instead. One problem with waiting for input before returning to the user program is that interrupts are disabled in supervisor mode, but subtracting 4 from the saved value of the XP register before returning can fix this by re-executing the SVC mode in a loop not disabling interrupts on ReadKey. Adding a call to the scheduler after setting the ReadKey supervisor call to be re-executed will suspend the current process and arrange for the next process to run when the handler exits. The round robin scheduler will eventually come back to the current process and ReadKey will try again. Better CPU utilization.

Time sharing can help spend cycles where they’ll be the most good. There’s a cost of context-switching overhead and scheduling but allows productive use of idle time of one process by running another. To avoid running processes waiting for I/O events that haven’t happened, processes can be active or waiting states with values indicating what value the process is waiting for and only have the scheduler run active processes. The UNIX OS has sleep and wakeup subroutines which take arguments. Once a process is marked as waiting, you can avoid retrying it until it’s active to avoid wasting CPU cycles.

Positive side effects of CPU virtualization are abstraction of machine resources, multiple processes executing concurrently, and better CPU utilization, while a negative side effect is that processing throughput is more variable. Currently, our approach to dealing with the asynchronous world is to separate event handling from event processing. This way, it’s difficult to meet hard deadlines. For instance, an automated car will need to guarantee it will break after a certain amount of time from a sensor event. Systems that can make such guarantees are real time systems. One measure of performance in a real time system is the interrupt latency, which is how much time can elapse between an interrupt and the start of its handler. Bad things can happen when a service is delayed beyond its dead line and the real time system should guarantee the actual latency is always less than the maximum allowable latency. These deadlines lead to hard real time constraints. While handling an interrupt, it takes time to save the state and context switch, and dispatch to the correct interrupt handler. When writing an OS, the amount of code in the setup phase of an interrupt handler can be minimized. The biggest problem is when another interrupt handler is running in kernel mode, as the latency will also include the cost of finishing the current interrupt handler. The goal is to bound and minimize interrupt latency, optimizing interrupt sequence context switching, making unbounded time instructions interruptible (state in registers, etc.) and avoiding or minimizing disable time. Interrupts may need to be allowed in kernel mode sometimes.

In an example of a real time system with three devices, ordering can cause significant latencies if longer running requests are prioritized. One way of handling this is to use a weak or non-preemptive priority system, where the task waiting with the highest priority starts after the current task finishes. One way to set priorities assumes each device has a service deadline D, the time until the next request to the same device if not otherwise specified. Earliest deadline is a strategy for assigning priorities, guaranteed to meet the deadlines if any priority assignment is capable of it. This works by sorting requests by deadlines, assigning the highest priority to the earliest deadline, second to the next, etc. A weak priority system chooses the pending request with the highest priority and earliest deadline. In a weak system, the current running task always completes first. A long running task will hurt the chances of completing other tasks with deadlines. In a strong priority system, lower priority tasks can be interrupted by higher priority requests. To implement a strong priority system in beta hardware, the supervisor bit in PC31 is replaced with a 3-bit PRI in PC31 to 29 indicating which of the 8 priority levels the processor is running at. The interrupt mechanism is changed to also specify the priority it was assigned and a priority encoder circuit is added to select the highest priority request and compare it to PRI. The system will only take the request if it’s higher than the current PRI, saving the current process to XP. Deadlines are more important for calculations than latencies for making sure everything gets done. It’s also important to make sure there are enough CPU cycles to complete tasks.

Section 19 – Concurrency and Synchronization

Computer programs often use multiple processes for concurrency, since some operations can be done in parallel. Applications can use processes for asynchrony, separating the handling of front end user inputs from the backend server functionality. Processes are a useful programming primitive and share information between each other in an often data or event driven way. Processes can communicate by sharing memory data by mapping the same physical page (if on the same system). To make coordination easier, synchronization instructions can be added. OS supervisor calls can also be used to pass messages between processes, which has more overhead than shared memory but can make the environment independent or whether processes run on the same processor. The classic example of the producer-consumer problem has a producer and a consumer process. The producer generates information generating information. The consumer runs in a loop waiting for the information from the producer and doing something with it. In this case, data can’t be consumed before being produced and the ith consumer process can’t be run before the ith producer process. The ith receive must precede the i+1th send. The producer and consumer here are tightly coupled which may not be optimal depending on how much time each process takes. A FIFO buffer can relax synchronization constraints where the producer adds to the buffer and the consumer reads from it. The producer can add to the buffer until it’s full and the consumer can read as long as the buffer isn’t empty.

A semaphore is an abstract data type with an integer greater than or equal to zero, proposed by Dijkstra. A semaphore is accessed by wait and signal operations. Wait waits until the semaphore has a value greater than 0, then it decrements the value by 1. Signal increments the value of the semaphore and one waiting process may be able to proceed. A sapphire guarantees that if initialised to K, signal(s)_i precedes wait(s)_i+K. Semaphores can also be viewed as a pool of shared resources where signal returns resources and wait borrows them. Different processes can call these wait and signal commands. Buffer overflow occurs when the buffer is maxed out and the producer overwrites it. A constraint is needed that the producer can’t send the i+n-th character until the consumer has read the i-th character, where N is the length of the FIFO buffer. Here, the producer consumes spaces in the buffer and produces spaces, where the consumer reads characters and produces spaces. Semaphores are used to track both resources and coordinate the consumer/producer process.

If two users try to remove 50 dollars from the same account on an ATM, depending on ordering you could both get 50 dollars while the balance is only decremented by 50 dollars. It’s important to be careful when writing concurrent programs, particularly when modifying shared data. Critical sections of code shouldn’t be able to overlap in mutual exclusion, where only one execution can execute at once. Semaphores can ensure that each runs atomically. A transaction can perform multiple reads and writes while ensuring no other process can change information while it’s running. A semaphore is used for a lock, which is incremented to 1 when a process finishes. The process calls wait to acquire the lock and signal to release it. Only one process can acquire the lock at a time and two executions can’t overlap. This can also be used to improve buffers to avoid overwriting with multiple producers. Semaphores are shared data themselves and implementing the wait and signal functions requires read, modify, write sequences that must be executed as critical sections. On a time-shared processor with an uninterruptible OS kernel, the supervisor call can be used for this. The ISA can be updated to include test and set instructions for implementing a simple lock semaphore which can be used to protect sections with more complex semaphore semantics. Lookup Dekker’s algorithm on wikipedia for an example of a complex algorithm. Many ISAs support a test-clear (TCLR) instruction which reads a memory location value and then clears it. This can be used to implement a single lock.

Special considerations need to be taken for a system which needs to acquire more than one lock. For instance, if a bank process needs to get locks for two accounts for a transfer, if two users are trying to make transfers between the same accounts, both could get a lock and be unsuccessful in getting the other. Both would wait forever in a deadlock. Conditions for a deadlock are mutual exclusion, that the process holds resources while waiting and that a resource can’t be removed from a process holding it, and circular waiting. The program can either be changed to avoid the problem altogether or to recover somehow. A global ordering can help so both customers try to acquire the same lock first.

Section 20 – System-level Communication

Computer systems are made up of components with implementations that can change with time. Technology comes and goes but interfaces are an abstraction that should outlast generations of technology. For instance, base a specification on an abstraction to hide implementation details. For instance, network interfaces that deliver bytes to named hosts hide the details of sockets, packets, etc. Graphic systems deliver images without the user having to worry about implementation. Interfaces should ensure improved technology isn’t a consistent interruption to processes. Successful systems are able to utilize previous code and TCP/IP is a technology which has lasted decades while it’s components have grown to accept more data. General purpose point to point communication channels have mostly replaced earlier synchronous multi-signal channels of earlier systems.

Most communication happens over wires and are treated as either being connected or not and viewed as equipotential nodes in a circuit from a circuit theorist’s point of view, and not worrying about distance allows them to be replaced and used the same. From an interconnect engineer’s view, they need to be viewed as transmission lines where distance and signals matter. The four parameters telling a wire’s behavior are the resistance of the conductor ®, self-inductance of the conductor (L), capacitance between signal and ground ©, and the current that leaks through the insulator (G). Analyzing what happens when you send a signal down the wire, a wire’s behavior can be described as a single transmission line component with impedance Z₀=sqrt(L/C) at high frequency. Voltage changes steps propagate down the wire at 1/sqrt(LC) m/s. Unless we’re careful, there can be energy left over from previous transitions that could corrupt the current one. While one solution may be to wait, that may not be allowable and energy storage effects should be minimized. Energy will reflect off any impedance continuity. Signals need enough time to reach valid logic levels. When a large voltage triggers oscillations in a wire (ringing) due to the wires inductance and capacitance, it will take time before there’s a reliable signal. Spreading the voltage step over a longer time can diminish this. By paying close attention to wiring design and information put on it, performance implications can be minimized. Storage is preserving information and sending it is communication. Communication takes time that must be budgeted for, putting upper bounds on propagating speeds, distances between components and speed we can change wire voltages at without triggering adverse effects. Timing models will have to account for wire delays. Long and heavily loaded outputs will be slower and a partial fix is to use buffers. Optimizing performance will require paying attention to loading issues.

How should a system design accommodate components users might want to add later. For years, there was a way to attach printed boards to the main motherboard. Power and clock signals timed the communication, address wires selected endpoints on the added card, data wires, and control wires for communications. These signals and more are the system bus which is a collection of wires used to transfer data with a predetermined protocol. The timing of the clock waveform gives enough time for signals to go down the bus and receive proper signals for all receivers. The bus master owns the bus and starts the transaction and buses generally are able to transfer ownership between components. The master sets the bus lines to the desired operation, chooses an address and the data to be sent. The recipient called the slave watches the bus line looking for its address at each sample edge and performs the requested operation signalling when complete, possibly returning over data wires to the master. The bus may look for situations where the slave doesn’t respond and signal for the master to take appropriate action. As system speeds increased, transaction speeds had to as well and issues could arrive by sending messages over the bus. Busses were eventually relegated to low speed communication tasks due to a few issues.

Network technologies connect computer systems separated over larger distances measured in meters. Bits were organized into packets containing the address of the destination and a checksum for detecting errors in transmission. Protocols could request retransmission for corrupted packets. Ethernet was devised in the mid-70’s by Bob Metcalf at Xerox as a bus for computer networking. The software controlling the network is divided into layers of modules each implementing a different communication abstraction. The lowest level is the physical layer that transfers and receives correct packets, discarding errors. There are different modules available for different types of physical networks. The network layer (above physical) deals with addressing and routing of packets, with clever routing algorithms finding the shortest communication path and dealing with outages on particular network links. The transport layer provides reliable communication of a stream of data, dealing with flow control to optimize network uses and limit packet loss due to network congestion. A big idea is the idea of building a reliable communication channel on a best efforts packet network. Higher protocol layers can recover from layers in the lower layer, which is better than trying to fix errors in each. The fastest and least problematic communication channels have a single driver communicating with a single receiver (point to point). Differential signalling is particularly useful, where the receiver measures the voltage difference across two wires as the same amount of noise will apply to both. Almost all high performance communication links use differential signaling and are point to point.

If sending digital data, we can recover timing information from the received signal if the nominal transmitter clock signal is known. The receiver can use transitions in the received waveform to receive timing information for some of the clock edges. The receiver can then use its knowledge to estimate the remaining clock edges, using a phase lock loop to generate a local facsimile of the transmitter clock, updating based on received transmissions. The transmitter adds a training sequence of bits at the front of the packet to ensure the receiver’s phase lock loop is synchronized before the packet itself is sent. The packet data is separated from the training data with a unique set of bits. To guarantee clock edges, the transmitter can take the stream of messages and re-encode them into a stream guaranteed to have transitions regardless of message bits. 8b10b is the most common guarantee which turns 8 bits into 10 bits and a transition occurs at least every 6 bit times (the receiver will have to reverse this). The receiver can get timing and data without having to send a separate clock signal. With this, networks have gone from shared communication channels to point to point links. Point to point links can be used with switches to route packets to their intended destination. Communication on each link is independent, so a network with many links can support a lot of communication bandwidth. With a small amount of buffering and switches, this is an effective networking strategy. More throughput can be added by using multiple point to point links in parallel as well.

The expansion strategy of a modern system still uses an adding card connected to the motherboard, but connecting to one or more point to point communication links. For a recent system, a CPU is connected directly to the memories and talks to all other components over the QuickPath Interconnect which has 20 differential signaling paths in each direction supporting up to 64 billion 20-bit transfers in each direction per second. PCI express is often a communication link between components on the motherboard and a single PCIe v2 lane transmits data at 5 gigabits per second using low voltage differential signaling over wires designed to have 100 ohm characteristic impedance. Data is organized in the same packet sequence, starting with the training bits, then a unique start sequence, the data payload and then the ending sequence. The data payload on the physical layer is organized as a sequence number, transaction layer payload and redundancy check sequence for validating data, which lets the data link layer know when a package is dropped and request a retry. The transaction layer reassembles the message from transaction layer payloads from all lanes and uses the header to identify the intended recipient. With 8-lanes, the maximum transfer rate is 4Gb/s, used in graphics cards. Knowledge of networking has reshaped component communication on the motherboard. Going from parallel busses to serial point-to-point links leads to smaller, better, and faster communications.

In addition to a 1D bus where each node is communicated with separately and a loop bus, where the output on one node is sent to the next in a loop, there are more common communication topologies. One common networking topology is a complete graph where every pair of nodes is connected with quadratic throughput and cost. A variant is the crossbar switch where a particular row and column can be connected to form links between particular A (row) and B (column) components. Assuming the first row and column connect to the same component, there are O(n) messages delivered each time unit with a quadratic cost. In mesh networks, components are connected to a fixed number of neighboring nodes. The cost and throughput are linear and the latency is O(sqrt(n)) for 2d meshes and O(cuberoot(n)) for 3d meshes. 2d 4-neighbor meshes are popular for experimental multicore processors. Hypercube and tree networks offer logarithmic latencies.

Section 21 – Parallel Processing

The running time of a program is the time per instruction times the CPI times the time per cycle. To increase running time, at least one of these terms needs to be decreased. Pipelining lowers time per cycle. The idea CPI is the number of cycles per instruction without stall. Stall can occur from data hazards, no ops, control hazards such as branching and exceptions, and memory latency for a cache miss. The classic 5-stage pipeline substantially reduces T_CLK with modest stall time. Adding stages should help to decrease the clock time, though each requires more NOPs for hazards. Each additional stage has some overhead time, having propagations, setup and hold times for pipeline registers, the potential for clock skew. Since the work can’t always be evenly divided, some time will be waited. This overhead is O and if the original clock time is T, with N stages, t_CLK is T/N+0. At the limit as N goes to infinity, the speedup approaches T/O and the overhead starts to dominate as the time on each stage becomes smaller. At some point, adding stages has no impact on the clock period. Executing in parallel can also help increase the ideal time at the risk of more stall time. It’s important to find instructions that can be executed in parallel or out of order, known as instruction level parallelism, with different low level commands having different rules. Adds and shifts and integer arithmetic can be done in parallel (with multiple adders). A lot of circuitry is required for making sure functions have their required operands and organizing execution results into the correct order. Generally, there’s an average speed up of 2 for an out of order superscalar processor over an in-order single issue processor. Increasing pipeline depth can cause CPI stall and overhead to rise beyond T_clk and further increases in depth are unlikely. It’s unlikely future improvements are substantial in out of order superscalar pipeline processors. Data and thread level parallelism are looked at instead for improvements.

Data sometimes naturally comes in vector or matrix form and it’s common that the same operation will be computed on each element. Special purpose vector processors perform instructions on each element in parallel, fetching data in big blocks and the specified register is loaded in one of the words from the block. A single memory access provides data for all paths in parallel. Executing one instruction on a machine with N data paths is the same as N instructions on a machine with 1 data path. This leads to a lot of parallelism without the complexities from out of order superscalar executions. Data paths can be provided with a local predicate flag if comparisons need to be made for data dependent operations. This can help avoid the cost of calculating incorrect branches. Most modern CPUs use vector extensions operating on 8, 16, 32 or 64 bit operands with parallelism baked into vector programs. These vector programs are commonly used for processing data streams and common in GPU’s which are designed for data parallelism front the ground up. Loading lots of graphics or big data requires this parallelism.

If applications have a lot of parallelism, using a larger number of simpler cores is more efficient. The optimal tradeoff between core cost and number of cores is hard to find as it does have some overhead. Many applications have some processes that can gain from parallelism and some that won’t. If f is the parallel portion of the task, then the overall speedup is 1/((1-f)+f/S) and as S goes to infinity, 1/(1-f). If you write a program that can do 90 percent of the work in parallel, but the other 10 percent is sequential, f=0.9, S=inf and speedup is 10. The best overall speedup is a factor of 10. Multi-core machines are most useful when the target task has lots of natural parallelism. Using multiple independent cores to execute a parallel level task is thread level parallelism where each core executes a separate thread. This can be more flexible than vector processing but more expensive. Shared memory is a common communication model for smaller processors with between 2 and 12 cores. With 10’s or more cores, shared memory overwhelms the memory bandwidth, opting for message passing instead, often with a nearest neighbor mesh network.

Multicores have multiple private caches for performance, implementing a write back strategy. Each core shares the contents of main memory and changes to one are visible to others. Sequential consistency is the notion that executing N parallel threads should correspond to an interleaved execution on a single processor. If this is implemented, then programmers can view the multicore system as providing hardware accelerated time sharing. Two cores shouldn’t be able to disagree about the value of a shared variable. If not reached, you can give up and require weak consistency where they only check that values were updated on each and not that they are the same. Out of order cores also have the complication since success of store instructions can complete in different order. A memory barrier instruction can ensure memory operations before the barrier completes before ones after. Each commercially available multicore system has its own guarantees about what happens when. There should be communication when the value of a shared variable changes and caches can communicate over a shared bus letting other caches know when a shared cached value changes, changing the state for each cache line to say its current data is invalid. If the cacheline’s state value is the only one set to exclusive, then that cache has the only correct value of the data and it’s the same as the main memory. The state can be modified, exclusive, shared or invalid and handled accordingly. At the end of a read request, if there are multiple copies of the cache line, all will be in the shared state. If only one copy, it will be in the exclusive state. This is part of the MESI cache coherence protocol.

Conclusion

There’s definitely a lot to take in and come back to with this course. It’s also an introductory course touching on a lot of specific fields that I may or may not go further with. While I was somewhat familiar with some of the concepts, this was definitely a course that had almost entirely new information for me. I’ve been curious about chip design for a while and seeing concepts I was familiar with in software applied to hardware design was interesting.

It was also good to see everything from previous courses come together, as the concepts from the second physics course were brought into computers. Physics 1 was needed to understand that course and calculus was heavily used in both. Going forward there are only 2 more courses left in this process, but my learning will continue past that. Computer Systems Engineering is a course that has this one as a prerequisite and I’m nervous yet excited to see what that contains, considering the whirlwind of information that this one was.

NikCreate

Leave a ReplyCancel reply

From Linux to Lambda: The Evolutions of Compute – Containerization

NickLiftWeight: A Tracker for Weight Lifting

Nick Ate: Food Logging

Trending