| stall in computer architecture | 0 | 2 |
| pipeline clock cycle | 0 | 1 |
| pipeline stall | 0 | 1 |
| pipelining speedup formula | 0 | 1 |
| pipeline speedup formula | 0 | 1 |
Please type or hand-write your answers and upload to Blackboard. The
problems are from the
Hennessey and Patterson book, 5th edition.
Problem 1 [5/5/5/5 points]
We begin with a computer implemented in single-cycle implementation.
When the stages are split
by functionality, the stages do not require exactly the same amount of
time. The original machine
had a clock cycle time of 7 ns. After the stages were split, the
measured times were IF, 1 ns; ID, 1.5
ns; EX, 1 ns; MEM, 2 ns; and WB, 1.5 ns. The pipeline register delay is
0.1 ns.
(a) What is the clock cycle time of the 5-stage pipelined machine?
Solution: The clock cycle time of the 5-stage pipelined machine is 2.1
ns, which is the time
taken by the slowest pipeline stage and the overhead.
(b) If there is a stall every 4 instructions, what is the CPI of the new
machine?
Solution: Without stalls the CPI will be 1 cycle per instruction. When
we have one stall
every 4 instructions, it will take 5 cycles for the 4 instructions.
Therfore, CPI = 5/4 = 1.25
(c) What is the speedup of the pipelined machine over the singlecycle
machine?
Solution: To calculate the speedup, we will look at the execution time
of unpipelined and
pipelined processors. The execution time is given by:
Exe time = I × CPI × Cycle time
Using this and substituting the values, we get:
Speedup = I × 1 × 7 I × 1.25 × 2.1
= 2.67 (1)
(d) If the pipelined machine had an infinite number of stages, what
would its speedup be over the
single-cycle machine? Solution: If we have infinite number of pipeline
stages, we can ensure
that the clock cycle time is equal to the overhead. We can also ignore
the extra stall cycles.
With these assumptions, the speedup is:
Speedup = I × 1 × 7 I × 1 × 0.1
= 70 (2)
Problem 2 [20 points]
Increasing a cache’s associativity (with all other parameters kept
constant), statistically reduces
the miss rate. However, there can be pathological cases where increasing
a cache’s associativity
would increase the miss rate for a particular workload. Consider the
case of direct mapped compared
to a two-way set associative cache of equal size. Assume that the set
associative cache uses the
EE524/CPTS561, Fall 2020
LRU replacement policy. To simplify, assume that the block size is one
word. Now construct a
trace of word accesses that would produce more misses in the two-way
associative cache. (Hint:
Focus on constructing a trace of accesses that are exclusively directed
to a single set of the twoway set associative cache, such that the same
trace would exclusively access two blocks in the
direct-mapped cache.)
Solution: We can construct a trace of the form addr1, addr2, addr3,
addr1, addr2, addr3, addr1,
addr2, addr3, . . . . . . , such that all the three addresses map to the
same set in the two-way associative
cache. Because of the LRU policy, every access will evict a block and
the miss rate will be 100If the
addresses are set such that in the direct mapped cache addr1 maps to one
block while add2 and
addr3 map to another block, then all addr1 accesses will be hits, while
all addr2/addr3 accesses will
be all misses, yielding a 66% miss rate.
Example for 32 word cache: consider the trace 0, 16, 48, 0, 16, 48, . . .
. . . When the cache is
direct mapped address 0 will hit in set 0, while addresses 16, and 48
will keep bumping each other
off set 16. On the other hand if the 32 word cache is organized as 16
set of two ways each. All three
addresses (0,16,48) will map to set 0. Because of LRU that stream will
produce a 100% miss rate.
That behavior can happen in real code except that the miss rates would
not be that high because
of all the other hits to the other blocks of the cache.
Problem 3 [7/7/8/8 points]
You are trying to appreciate how important the principle of locality is
in justifying the use of a cache
memory, so you experiment with a computer having an L1 data cache and a
main memory (you
exclusively focus on data accesses). The latencies (in CPU cycles) of
the different kinds of accesses
are as follows: cache hit, 1 cycle; cache miss, 105 cycles; main memory
access with cache disabled,
100 cycles.
(a) When you run a program with an overall miss rate of 5%, what will
the average memory access
time (in CPU cycles) be?
Solution: We use the equation for AMAT in this question.
Average memory access time = (1 – miss rate) × hit time + miss rate ×
miss time = .95 × 1
+ 0.05 × 105 = 6.2 cycles.
(b) Next, you run a program specifically designed to produce completely
random data addresses
with no locality. Toward that end, you use an array of size 256 MB (all
of it fits in the main
memory). Accesses to random elements of this array are continuously made
(using a uniform
random number generator to generate the elements indices). If your data
cache size is 64 KB,
what will the average memory access time be?
Solution: In this case we are accessing the cache in a random fashion.
The cache size is 64
KB, which means that we can store 64 KB of the 256 MB array in the
cache. The access is
with a uniform random generator, therefore, the probability of a cache
hit is 64 KB/256 MB =
0.00025
Using this hit rate, the AMAT is given by,
AMAT = 0.00025 × 1 + (1 – 0.00025) × 105 = 104.974 cycles
(c) If you compare the result obtained in part (b) with the main memory
access time when the
cache is disabled, what can you conclude about the role of the principle
of locality in justifying
the use of cache memory?
Solution: The access time when the cache is disabled is 100 cycles which
is less than the
average access time when the cache is enabled and almost all the
accesses are misses. Therefore,
2
EE524/CPTS561, Fall 2020
if there is no locality at all in the data stream then the cache memory
will not only be useless
but also it will be a liability.
(d) You observed that a cache hit produces a gain of 99 cycles (1 cycle
vs. 100), but it produces a
loss of 5 cycles in the case of a miss (105 cycles vs. 100). In the
general case, we can express
these two quantities as G (gain) and L (loss). Using these two
quantities (G and L), identify
the highest miss rate after which the cache use would be
disadvantageous.
Solution: Let us assume that the memory access time with the cache off
is Tof f and with the
cache on it is Ton. Let us also assume that the miss rate is m. With
these assumptions, the
AMAT with the cache on is given by:
Ton = (1 m)(Tof f G) + m(Tof f + L) (3)
Now, the cache becomes useless when Tof f ≤ Ton. That is,
Tof f ≤ (1 m)(Tof f G) + m(Tof f + L) (4)
Rearranging the terms we get:
m ≥ G G + L
(5)
This means that if the miss rate is higher than the ratio above, the
cache will not be useful.
Problem 4 [15/15 points]
Suppose the branch frequencies (as percentages of all instructions) are
as follows:
Conditional branches – 15%
Jumps and calls – 1%
Taken conditional branches – 60% are taken
(a) We are examining a four-deep pipeline where the branch is resolved
at the end of the second
cycle for unconditional branches and at the end of the third cycle for
conditional branches.
Assuming that only the first pipe stage can always be done independent
of whether the branch
goes and ignoring other pipeline stalls, how much faster would the
machine be without any
branch hazards?
Solution: We will use the pipeline speed-up equation for this problem.
Before using the
equation, we need to determine the number of stall cycles for each type
of control instruction
above. First, for a jump instruction, the pipeline will look like Figure
1
Jump is an unconditional branch, therefore the instruction is resolved
at the end of the second
stage i.e. ID. However, in the mean time one instruction after the jump
instruction has been
fetch, but it is no longer useful. This is because we have to start
fetching instructions from the
jump target. As a result, we need to add one stall cycle.
Second, for taken branches the pipeline looks as shown Figure 2.
Similar to the case of jumps, we have stalls since the branch is taken
and any instructions that
were fetched are not useful. Moreover, since conditional branches are
resolved after the EX
stage, we have two stall cycles.
Finally, in the case of non-taken branches, we have one stall cycle as
shown Figure 3. This occurs
because the pipeline fetches the next sequential instruction from the
program by default—which
3
EE524/CPTS561, Fall 2020
Figure 1: Pipeline for jump instructions
Figure 2: Pipeline for taken branch instructions
happens to be the instruction that follows a not-taken branch. Once the
conditional branch
resolves in cycle 3, the pipeline determines it does not need to reissue
the fetch of instruction i
+ 1 and therefore can resume executing the instruction it fetched in
cycle 2. Instruction i +
1 cannot leave the IF stage until after the branch resolves because the
exercise specifies the
pipeline is only capable of using the IF stage while a branch is being
resolved.
Figure 3: Pipeline for taken branch instructions
Combining the three pieces of information, we can obtain the pipeline
stall cycles as:
Stall cycles = 1 ∗ 0.01 + 2 ∗ 0.15 ∗ 0.60 + 1 ∗ 0.15 ∗ 0.40 = 0.24 (6)
Using the stall cycles, the pipeline speedup is given by:
Speedup = Pipeline depth
1 + Stall cycles = 4
1 + 0.24
= 3.23 (7)
4
EE524/CPTS561, Fall 2020
Now, without any branch hazards, the speedup would be 4. Therefore, the
speedup of the
ideal pipeline over the pipeline with branch hazards is 4/3.23 = 1.24
(b) Now assume a high-performance processor in which we have a 15-deep
pipeline where the
branch is resolved at the end of the fifth cycle for unconditional
branches and at the end of
the tenth cycle for conditional branches. Assuming that only the first
pipe stage can always be
done independent of whether the branch goes and ignoring other pipeline
stalls, how much
faster would the machine be without any branch hazards?
Solution: We need to employ analysis similar to part (a) for this
question. The number of
stall cycles for the jump instructions is 4, for taken branches it is 9
and for non-taken branches
it is 8. Using these, the pipeline stalls are given by:
Stall cycles = 4 ∗ 0.01 + 9 ∗ 0.15 ∗ 0.60 + 8 ∗ 0.15 ∗ 0.40 = 1.33 (8)
Using the stall cycles, the pipeline speedup is given by:
Speedup = Pipeline depth
1 + Stall cycles =
15
1 + 1.33
= 6.43 (9)
Finally, the speedup obtained by the ideal pipeline over the pipeline
with branch hazards is
15/6.43 = 2.33. This shows that as the pipelines get deeper, branch
predictions are important
to obtain higher speedup.