3551 Trousdale Rkwy, University Park, Los Angeles, CA
Please type or hand-write your answers and upload to Blackboard. The problems are from the Hennessey and Patterson book, 5th edition. Problem 1 [5/5/5/5 points] We begin with a computer implemented in single-cycle implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle time of 7 ns. After the stages were split, the measured times were IF, 1 ns; ID, 1.5 ns; EX, 1 ns; MEM, 2 ns; and WB, 1.5 ns. The pipeline register delay is 0.1 ns. (a) What is the clock cycle time of the 5-stage pipelined machine? Solution: The clock cycle time of the 5-stage pipelined machine is 2.1 ns, which is the time taken by the slowest pipeline stage and the overhead. (b) If there is a stall every 4 instructions, what is the CPI of the new machine? Solution: Without stalls the CPI will be 1 cycle per instruction. When we have one stall every 4 instructions, it will take 5 cycles for the 4 instructions. Therfore, CPI = 5/4 = 1.25 (c) What is the speedup of the pipelined machine over the singlecycle machine? Solution: To calculate the speedup, we will look at the execution time of unpipelined and pipelined processors. The execution time is given by: Exe time = I × CPI × Cycle time Using this and substituting the values, we get: Speedup = I × 1 × 7 I × 1.25 × 2.1 = 2.67 (1) (d) If the pipelined machine had an infinite number of stages, what would its speedup be over the single-cycle machine? Solution: If we have infinite number of pipeline stages, we can ensure that the clock cycle time is equal to the overhead. We can also ignore the extra stall cycles. With these assumptions, the speedup is: Speedup = I × 1 × 7 I × 1 × 0.1 = 70 (2) Problem 2 [20 points] Increasing a cache’s associativity (with all other parameters kept constant), statistically reduces the miss rate. However, there can be pathological cases where increasing a cache’s associativity would increase the miss rate for a particular workload. Consider the case of direct mapped compared to a two-way set associative cache of equal size. Assume that the set associative cache uses the EE524/CPTS561, Fall 2020 LRU replacement policy. To simplify, assume that the block size is one word. Now construct a trace of word accesses that would produce more misses in the two-way associative cache. (Hint: Focus on constructing a trace of accesses that are exclusively directed to a single set of the twoway set associative cache, such that the same trace would exclusively access two blocks in the direct-mapped cache.) Solution: We can construct a trace of the form addr1, addr2, addr3, addr1, addr2, addr3, addr1, addr2, addr3, . . . . . . , such that all the three addresses map to the same set in the two-way associative cache. Because of the LRU policy, every access will evict a block and the miss rate will be 100If the addresses are set such that in the direct mapped cache addr1 maps to one block while add2 and addr3 map to another block, then all addr1 accesses will be hits, while all addr2/addr3 accesses will be all misses, yielding a 66% miss rate. Example for 32 word cache: consider the trace 0, 16, 48, 0, 16, 48, . . . . . . When the cache is direct mapped address 0 will hit in set 0, while addresses 16, and 48 will keep bumping each other off set 16. On the other hand if the 32 word cache is organized as 16 set of two ways each. All three addresses (0,16,48) will map to set 0. Because of LRU that stream will produce a 100% miss rate. That behavior can happen in real code except that the miss rates would not be that high because of all the other hits to the other blocks of the cache. Problem 3 [7/7/8/8 points] You are trying to appreciate how important the principle of locality is in justifying the use of a cache memory, so you experiment with a computer having an L1 data cache and a main memory (you exclusively focus on data accesses). The latencies (in CPU cycles) of the different kinds of accesses are as follows: cache hit, 1 cycle; cache miss, 105 cycles; main memory access with cache disabled, 100 cycles. (a) When you run a program with an overall miss rate of 5%, what will the average memory access time (in CPU cycles) be? Solution: We use the equation for AMAT in this question. Average memory access time = (1 – miss rate) × hit time + miss rate × miss time = .95 × 1 + 0.05 × 105 = 6.2 cycles. (b) Next, you run a program specifically designed to produce completely random data addresses with no locality. Toward that end, you use an array of size 256 MB (all of it fits in the main memory). Accesses to random elements of this array are continuously made (using a uniform random number generator to generate the elements indices). If your data cache size is 64 KB, what will the average memory access time be? Solution: In this case we are accessing the cache in a random fashion. The cache size is 64 KB, which means that we can store 64 KB of the 256 MB array in the cache. The access is with a uniform random generator, therefore, the probability of a cache hit is 64 KB/256 MB = 0.00025 Using this hit rate, the AMAT is given by, AMAT = 0.00025 × 1 + (1 – 0.00025) × 105 = 104.974 cycles (c) If you compare the result obtained in part (b) with the main memory access time when the cache is disabled, what can you conclude about the role of the principle of locality in justifying the use of cache memory? Solution: The access time when the cache is disabled is 100 cycles which is less than the average access time when the cache is enabled and almost all the accesses are misses. Therefore, 2 EE524/CPTS561, Fall 2020 if there is no locality at all in the data stream then the cache memory will not only be useless but also it will be a liability. (d) You observed that a cache hit produces a gain of 99 cycles (1 cycle vs. 100), but it produces a loss of 5 cycles in the case of a miss (105 cycles vs. 100). In the general case, we can express these two quantities as G (gain) and L (loss). Using these two quantities (G and L), identify the highest miss rate after which the cache use would be disadvantageous. Solution: Let us assume that the memory access time with the cache off is Tof f and with the cache on it is Ton. Let us also assume that the miss rate is m. With these assumptions, the AMAT with the cache on is given by: Ton = (1 m)(Tof f G) + m(Tof f + L) (3) Now, the cache becomes useless when Tof f ≤ Ton. That is, Tof f ≤ (1 m)(Tof f G) + m(Tof f + L) (4) Rearranging the terms we get: m ≥ G G + L (5) This means that if the miss rate is higher than the ratio above, the cache will not be useful. Problem 4 [15/15 points] Suppose the branch frequencies (as percentages of all instructions) are as follows: Conditional branches – 15% Jumps and calls – 1% Taken conditional branches – 60% are taken (a) We are examining a four-deep pipeline where the branch is resolved at the end of the second cycle for unconditional branches and at the end of the third cycle for conditional branches. Assuming that only the first pipe stage can always be done independent of whether the branch goes and ignoring other pipeline stalls, how much faster would the machine be without any branch hazards? Solution: We will use the pipeline speed-up equation for this problem. Before using the equation, we need to determine the number of stall cycles for each type of control instruction above. First, for a jump instruction, the pipeline will look like Figure 1 Jump is an unconditional branch, therefore the instruction is resolved at the end of the second stage i.e. ID. However, in the mean time one instruction after the jump instruction has been fetch, but it is no longer useful. This is because we have to start fetching instructions from the jump target. As a result, we need to add one stall cycle. Second, for taken branches the pipeline looks as shown Figure 2. Similar to the case of jumps, we have stalls since the branch is taken and any instructions that were fetched are not useful. Moreover, since conditional branches are resolved after the EX stage, we have two stall cycles. Finally, in the case of non-taken branches, we have one stall cycle as shown Figure 3. This occurs because the pipeline fetches the next sequential instruction from the program by default—which 3 EE524/CPTS561, Fall 2020 Figure 1: Pipeline for jump instructions Figure 2: Pipeline for taken branch instructions happens to be the instruction that follows a not-taken branch. Once the conditional branch resolves in cycle 3, the pipeline determines it does not need to reissue the fetch of instruction i + 1 and therefore can resume executing the instruction it fetched in cycle 2. Instruction i + 1 cannot leave the IF stage until after the branch resolves because the exercise specifies the pipeline is only capable of using the IF stage while a branch is being resolved. Figure 3: Pipeline for taken branch instructions Combining the three pieces of information, we can obtain the pipeline stall cycles as: Stall cycles = 1 ∗ 0.01 + 2 ∗ 0.15 ∗ 0.60 + 1 ∗ 0.15 ∗ 0.40 = 0.24 (6) Using the stall cycles, the pipeline speedup is given by: Speedup = Pipeline depth 1 + Stall cycles = 4 1 + 0.24 = 3.23 (7) 4 EE524/CPTS561, Fall 2020 Now, without any branch hazards, the speedup would be 4. Therefore, the speedup of the ideal pipeline over the pipeline with branch hazards is 4/3.23 = 1.24 (b) Now assume a high-performance processor in which we have a 15-deep pipeline where the branch is resolved at the end of the fifth cycle for unconditional branches and at the end of the tenth cycle for conditional branches. Assuming that only the first pipe stage can always be done independent of whether the branch goes and ignoring other pipeline stalls, how much faster would the machine be without any branch hazards? Solution: We need to employ analysis similar to part (a) for this question. The number of stall cycles for the jump instructions is 4, for taken branches it is 9 and for non-taken branches it is 8. Using these, the pipeline stalls are given by: Stall cycles = 4 ∗ 0.01 + 9 ∗ 0.15 ∗ 0.60 + 8 ∗ 0.15 ∗ 0.40 = 1.33 (8) Using the stall cycles, the pipeline speedup is given by: Speedup = Pipeline depth 1 + Stall cycles = 15 1 + 1.33 = 6.43 (9) Finally, the speedup obtained by the ideal pipeline over the pipeline with branch hazards is 15/6.43 = 2.33. This shows that as the pipelines get deeper, branch predictions are important to obtain higher speedup.