代写-F0 |学霸联盟

代写-F0

时间：2021-10-31

Problem Set 5
Q1. Assume that we use Speculative Tomasulo Algorithm with Reorder Buffer to run the following code sample:
L.D F0, 0(R1)
SUB.D F4, F2, F0
MUL.D F2, F4, F6
DADDIU R1, R1, #-8
BNE R1, <…>
L.D F8, 0(R2)
ADD.D F8, F8, F0
DADDIU R2, R2, #-8
BNE R2, <…>

Assume that L.D, SUB.D, and ADD.D need 2 cycle to execute, DADDIU and BNE 1 cycle, and MUL.D 10 cycles.
Also assume there are unlimited number of RS’s for each operation. Also, assume all instructions go through one
pipeline and the processor is a single-issue processor. According to our slides in Lesson 15, the reorder buffer is shown
in the following in the 1st cycle:

HINT: For the Done? column use: N: Instruction is not completed yet; Y: Instruction is completed but not committed
yet; C: Instruction is in commit cycle; E: Instruction is Committed and the entry is Empty. In each case include the
Completion clock cycle (for N and Y options) or Commit clock cycle (for C and E options) in the parenthesis. E.g.
N(7) = Instruction has not completed yet but will complete at the end of clock cycle 7 and E(5) = Instruction in this
entry is committed at the end of clock cycle 5 and the entry is empty.
Dest Value Instruction [Issue Cycle] Done?

F0 L.D F0, 0(R1) [1] N (4)
Please fill the reorder buffer when it is in the 10th cycle and 15th cycle (you could ignore the “value” column)
10th cycle:
Dest Value Instruction [Issue Cycle] Done?

15th cycle:
Dest Value Instruction [Issue Cycle] Done?

Case Study
Assume that we have the following code sample:

Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop
R1 = R2 + 1024

Assume that the branch prediction is perfect. ADD.D, L.D and S.D need 2 cycles to execute, DADDUI and BNE
need 1 cycle to execute.

Q2. If we use Tomasulo algorithm with ROB to run this code sample, what is the CPI? Assume unlimited RS’s for
every operation and entries in ROB.

Q3. Repeat Q2 assuming one RS for each operation.

Q4. To reduce the stalls, we optimize the compiler to unroll the loop so we could run 4 iterations at once. What is
the new CPI? Assume unlimited RS’s for every operation and entries in ROB.

Q5. To further speed up the execution, we decide to use a VLIW to run the code sample. The VLIW could issue two
memory operations, two floating point operations and one integer operation/branch at once. If we still run 4
iterations at once, what is the new CPI for this VLIW machine? Assume the first clock cycle is for the issue of the
instructions and the following clocks for execution and write back operations (Consider simple Issue, Execute. WB
scheme for the CPU Microarchitecture). You could use the help of the following table.

Cycle Memory Operation 1 Memory Operation 2 FP operation 1 FP operation 2 Integer operation/branch
1 L.D F0, 0(R1) L.D F6, -8(R1)

Q6. Given a processor running a program, the base CPI is 1.2. Assume that the program has 20% memory operations
(load and store). Suppose the hit rate of instruction cache is 98% and 85% for data cache. If the miss penalty for
instruction and data caches are 80 and 50 clock cycles, respectively, calculate the CPI and average memory access
time.

学霸联盟

学霸联盟