UNIVERSITY OF CALIFORNIA, LOS ANGELES UCLA

CS M151B / EE M116C
Final Exam

Before you start, make sure you have all 13 pages attached to this cover sheet.

All work and answers should be written directly on these pages, use the backs of pages if needed.

This is an open book, open notes final – but you cannot share books, notes, or calculators.

NAME: ________________________________________________________________

ID: ____________________________________________________________________

Problem 1 (10 points): ___________

Problem 2 (10 points): ___________

Problem 3 (10 points): ___________

Problem 4 (10 points): ___________

Problem 5 (20 points): ___________

Problem 6 (10 points): ___________

Problem 7 (30 points): ___________

Total: ________________ (out of 100 points)

``

BERKELEY • DAVIS • IRVINE • LOS ANGELES • RIVERSIDE • SAN DIEGO • SAN FRANCISCO

SANTA BARBARA • SANTA CRUZ
1

1. I Amdahl-ighted with Tradeoffs (10 points): Given the following problems, suggest one solution and give one
drawback of the solution. Be brief, but specific.

EXAMPLE
Problem: long memory latencies
Solution: Caches
Drawback: when the cache misses, the latency becomes worse due to the cache access latency
We would not accept solutions like: “do not use memory”, “use a slower CPU”, “cache is hard to spell”, etc

Problem: too many capacity misses in the data cache
Solution:

drawback:

Problem: too many control hazards
Solution:

drawback:

Solution:

drawback:

Problem: we want to be able to use a larger immediate field in the MIPS ISA
Solution:

drawback:

Problem: the execution time of our CPU with a single-cycle datapath is too high
Solution:

drawback:

2
2. Hazard a Guess? (10 points): Assume you are using the 5-stage pipelined MIPS processor, with a three-cycle
branch penalty. Further assume that we always use predict not taken. Consider the following instruction
sequence, where the bne is taken once, and then not taken once (so 7 instructions will be executed total):

Loop : lw \$t0, 512(\$t0)
lw \$t1, 64(\$t0)
bne \$s0, \$t1, Loop
sw \$s1, 128(\$t0)

Assuming that the pipeline is empty before the first instruction:

a. Suppose we do not have any data forwarding hardware – we stall on data hazards. The register file is
still written in the first half of a cycle and read in the second half of a cycle, so there is no hazard from
WB to ID. Calculate the number of cycles that this sequence of instructions would take:

__________________

3
b. How many cycles would this sequence of instructions take with data forwarding hardware:

_____________________

4
3. More \$ More Problems (10 points): Find the data cache hit or miss stats for a given set of addresses.
The data cache is a 1KB, direct mapped cache with 64-byte blocks. Find the hit/miss behavior of the cache for
a given byte address stream, and label misses as compulsory, capacity, or conflict misses. All blocks in the
cache are initially invalid.

Cache
Hit or Miss
Cache
Miss Type
...0011011010000
...0001011100000
...0000011010000
...0011011100000
...0011011010000
...0001011100000
...0000011010000
...0011011100000

4. The Trouble with TLBs (10 points): Consider an architecture with 32-bit virtual addresses and 1 GB of
physical memory. Pages are 32KB and we have a TLB with 64 sets that is 8-way set associative. The data
and instruction caches are 8KB with 16B block sizes and are direct mapped – and they are both virtually
indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires 1 extra
bit for storing protection information. Answer the following:

a. How many pages of virtual memory can fit in physical memory at a time? ______________

b. How large (in bytes) is the page table? ________________

c. What fraction of the total number of page translations can fit in the TLB? ____________

d. What bits of a virtual address will be used for the index to the TLB? Specify this as a range of bits –
i.e. bits 4 to 28 will be used as the index. The least significant bit is labeled 0 and the most significant
bit is labeled 31.
___________________
5

5. Starting Some Static (Scheduling) (20 points): Consider the 2-way superscalar processor we covered in class
– a five stage pipeline where we can issue one ALU or branch instruction along with one load or store
instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control
hazards with branch delay slots (since the penalty is two cycles, and this is a 2-way superscalar processor, that
would be four instructions that we need to place in delay slots). This processor has full forwarding hardware.
This processor is a VLIW machine. How long would the following code take to execute on this processor
assuming the loop is executed 200 times? Assume the pipeline is initially empty and give the time taken up
until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e.
reorder) the code (use the table below) to reduce the total number of cycles required (but don’t unroll it…yet).

Total # of cycles for 200 iterations: _____________________________

(Hint – schedule the code first for one iteration, then figure out how long it will take the processor to run 200
iterations of this scheduled code)

Loop: lw \$t0, 0 (\$s0)
lw \$t1, 0 (\$t0)
sw \$t1, 0 (\$t0) # you may assume that this store never goes to the same address as the first load
bne \$s0, \$s2, Loop

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)
1
2
3
4
5
6
7
8
9
10
11
12
13
6
Now unroll the loop once to make two copies of the loop body. Schedule it again and record the total # of cycles
for 200 iterations:
__________________________________

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)
1
2
3
4
5
6
7
8
9
10
11
12
13
7
6. A Branch Too Far (10 points): One difficulty in designing a branch predictor is trying to avoid cases where
two PCs with very different branch behavior index to the same entry of a 2-bit branch predictor. This is called
destructive aliasing. One way around this is to use multiple 2-bit branch predictors with different sizes. This
way, if two PCs index to the same entry in one predictor, they will not likely index to the same entry in the
other predictor. We will evaluate a scheme with three 2-bit branch predictors – each with a different number
of entries. The three predictors will be accessed in parallel, and the majority decision of the predictors will be
chosen. So if two predictors say taken and the other predictor says not taken, the majority decision will be
taken. The scheme looks like this:

Each predictor has the following FSM:

1K-entry
2-bit
predictor
512-entry
2-bit
predictor
768-
entry
2-bit
predictor
PC
Majority Vote
Taken or
Not Taken
Taken
Taken
Taken
Taken
Not taken
Not taken
Not taken
Not taken
Predict taken Predict taken
Predict not taken Predict not taken
11 10
01 00
8
Evaluate the performance of this prediction scheme on the following sequence of PCs. The table shows the
address of the branch and the actual direction of the branch (taken or not taken). You get to fill in whether or
not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2-bit value
representing that state. Assume that all predictors are initialized to 00. To find an index into a predictor,
assume we use the simplified branch indexing formula: index = PC % predictor_size. The symbol %
represents the modulo operator. Predictor_size will be different according to the predictor.

PC Actual Direction Correctly Predicted?
128 T
640 NT
1152 NT
128 T
640 T
1152 NT
128 T
640 NT
1152 NT
128 T
640 T
1152 NT

9
7. With Friends Like These… (30 points): Consider the scalar pipeline we have explored in class:

a. (10 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are
R-type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware
on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch
penalty. This means that the PC is updated at the end of the EX stage – after the comparison is made
in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle
latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of 20%.
We have an L2 cache that misses 5% – it has a 10 cycle latency – and memory has a 100 cycle latency.
Find the TCPI for this architecture.

TCPI = __________________

P C I n s tr u c ti o n m e m o r y
R e g is t e r s
M
u
x
M
u
x
M
u
x
C o n t r o l
ALU
EX
M
WB
M
W B
WB
ID/EX
EX/ M E M
MEM/WB
Data
m emory
M
u
x
H a z a r d
d e t e c t io n
u n it
Forwarding
unit
0
M
u
x
I F / I D
In s tr
u c tio
n
I
F /ID
W rite
P C W
rit
e
I D / E X . R e g is t e rRt
I F / I D . R e g is t e r RdI F / I D . R e g is t e r Rt
I F / I D . R e g is t e r RtI F / I D . R e g is t e r Rs
Rt
Rs
Rd
Rt E X / M E M.RegisterRd
M E M /W B.RegisterRd
10
b. (5 points) Your friend has a flash of brilliance – “I know a way to get rid of stalls in this pipeline. The
reason we have to stall now is because a load can have a dependent instruction follow it through the
pipeline, and we cannot forward the load’s data until the end of the MEM stage – but the dependent
instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes
what we did in EX if the instruction before it is a load and it is dependent on the load?” This ALU will
be in the memory stage of the pipeline as shown below in this simplified picture:

Is your friend right or wrong? If they are wrong, give an example of when we would still need to stall.

They are right: ___________________

Or

Counter example: ____________________________________________________________

M u x
ALU Datamemory
M
u
x
ALU
M u x
M u x
M
u
x
M
u
x
11
c. (5 points) Another friend offers an alternative – using the original pipeline from part a, let’s get rid of
now. This will allow us to combine EX and MEM into one stage (called EM) and avoid the need to
stall entirely. Instructions will either use the ALU or memory – but not both. There is still forwarding
hardware, but now we only need to forward from the EM/WB latch to the EM stage ALU. The
pipeline will now be:

Suppose that four fifths of loads actually use base + displacement addressing (i.e. they have a non-zero
displacement), which means that these loads will need to have add instructions before them to do their
effective address computation. Half of stores use base + displacement addressing, and these will also
need to be replaced with an add plus the store instruction. This modification has no impact on the
branch penalty or the instruction cache miss rate.

Is your friend right or wrong – will this eliminate all stalls? If they are wrong, give an example of
when we would still need to stall.

They are right: ___________________

Or

Counter example: _____________________________________________________________

IF

ID

EM
(EX or MEM)

WB
12
d. (10 points) A third friend has a different idea (it may be time for you to get new friends who don’t talk
about architecture all the time). Forget about trying to eliminate hazards – she says we should just use
superpipelining and get a win on cycle time. Take the original architecture from part a – ignore the
suggestions from b and c – and assume that the stages have the following latencies:

Stage Latency (in picoseconds)
IF 200
ID 100
EX 200
MEM 200
WB 100

Your friend suggests a way to cut the IF, EX, and MEM stages in half – just increase the pipeline depth
and make each of these stages into two stages. So your pipeline would now have IF1, IF2, ID, EX1,
EX2, MEM1, MEM2, and WB stages – each of which would have 100 picosecond latency. Your
friend also finds a way to do full forwarding between stages – even in the ALU – but loads are still a
problem. In fact, load stalls will increase now because of this increase in pipeline depth. To help you
figure out the new # of pipeline stalls from load data hazards, use the following table:

% of Loads Distance of the next dependent instruction
30% 1 cycle
20% Exactly 2 cycles later
20% Exactly 3 cycles later
10% Exactly 4 cycles later
10% Exactly 5 cycles later
5% Exactly 6 cycles later
5% Exactly 7 or more cycles later

So this means that 30% of loads are immediately followed by a dependent (i.e. 1 cycle later), 20% of
loads have a dependent exactly 2 cycles later, 20% have a dependent 3 cycles later, and so on. These
classifications are completely disjoint – the 20% of loads that have a dependent 2 cycles later do NOT
have dependents 1 cycle later.

Find the TCPI of this new architecture:_________________

13
Assume your target application will run 1M instructions. Find the execution time of this architecture
for that application:

ET: ________________ 