xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

计算机系统结构代写-M151B

时间：2021-03-11

UNIVERSITY OF CALIFORNIA, LOS ANGELES UCLA

CS M151B / EE M116C

Final Exam

Before you start, make sure you have all 13 pages attached to this cover sheet.

All work and answers should be written directly on these pages, use the backs of pages if needed.

This is an open book, open notes final – but you cannot share books, notes, or calculators.

NAME: ________________________________________________________________

ID: ____________________________________________________________________

Problem 1 (10 points): ___________

Problem 2 (10 points): ___________

Problem 3 (10 points): ___________

Problem 4 (10 points): ___________

Problem 5 (20 points): ___________

Problem 6 (10 points): ___________

Problem 7 (30 points): ___________

Total: ________________ (out of 100 points)

``

BERKELEY • DAVIS • IRVINE • LOS ANGELES • RIVERSIDE • SAN DIEGO • SAN FRANCISCO

SANTA BARBARA • SANTA CRUZ

1

1. I Amdahl-ighted with Tradeoffs (10 points): Given the following problems, suggest one solution and give one

drawback of the solution. Be brief, but specific.

EXAMPLE

Problem: long memory latencies

Solution: Caches

Drawback: when the cache misses, the latency becomes worse due to the cache access latency

We would not accept solutions like: “do not use memory”, “use a slower CPU”, “cache is hard to spell”, etc

Problem: too many capacity misses in the data cache

Solution:

drawback:

Problem: too many control hazards

Solution:

drawback:

Problem: our carry lookahead adder is too slow

Solution:

drawback:

Problem: we want to be able to use a larger immediate field in the MIPS ISA

Solution:

drawback:

Problem: the execution time of our CPU with a single-cycle datapath is too high

Solution:

drawback:

2

2. Hazard a Guess? (10 points): Assume you are using the 5-stage pipelined MIPS processor, with a three-cycle

branch penalty. Further assume that we always use predict not taken. Consider the following instruction

sequence, where the bne is taken once, and then not taken once (so 7 instructions will be executed total):

Loop : lw $t0, 512($t0)

lw $t1, 64($t0)

bne $s0, $t1, Loop

sw $s1, 128($t0)

Assuming that the pipeline is empty before the first instruction:

a. Suppose we do not have any data forwarding hardware – we stall on data hazards. The register file is

still written in the first half of a cycle and read in the second half of a cycle, so there is no hazard from

WB to ID. Calculate the number of cycles that this sequence of instructions would take:

__________________

3

b. How many cycles would this sequence of instructions take with data forwarding hardware:

_____________________

4

3. More $ More Problems (10 points): Find the data cache hit or miss stats for a given set of addresses.

The data cache is a 1KB, direct mapped cache with 64-byte blocks. Find the hit/miss behavior of the cache for

a given byte address stream, and label misses as compulsory, capacity, or conflict misses. All blocks in the

cache are initially invalid.

Address in Binary

Cache

Hit or Miss

Cache

Miss Type

...0011011010000

...0001011100000

...0000011010000

...0011011100000

...0011011010000

...0001011100000

...0000011010000

...0011011100000

4. The Trouble with TLBs (10 points): Consider an architecture with 32-bit virtual addresses and 1 GB of

physical memory. Pages are 32KB and we have a TLB with 64 sets that is 8-way set associative. The data

and instruction caches are 8KB with 16B block sizes and are direct mapped – and they are both virtually

indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires 1 extra

bit for storing protection information. Answer the following:

a. How many pages of virtual memory can fit in physical memory at a time? ______________

b. How large (in bytes) is the page table? ________________

c. What fraction of the total number of page translations can fit in the TLB? ____________

d. What bits of a virtual address will be used for the index to the TLB? Specify this as a range of bits –

i.e. bits 4 to 28 will be used as the index. The least significant bit is labeled 0 and the most significant

bit is labeled 31.

___________________

5

5. Starting Some Static (Scheduling) (20 points): Consider the 2-way superscalar processor we covered in class

– a five stage pipeline where we can issue one ALU or branch instruction along with one load or store

instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control

hazards with branch delay slots (since the penalty is two cycles, and this is a 2-way superscalar processor, that

would be four instructions that we need to place in delay slots). This processor has full forwarding hardware.

This processor is a VLIW machine. How long would the following code take to execute on this processor

assuming the loop is executed 200 times? Assume the pipeline is initially empty and give the time taken up

until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e.

reorder) the code (use the table below) to reduce the total number of cycles required (but don’t unroll it…yet).

Total # of cycles for 200 iterations: _____________________________

(Hint – schedule the code first for one iteration, then figure out how long it will take the processor to run 200

iterations of this scheduled code)

Loop: lw $t0, 0 ($s0)

lw $t1, 0 ($t0)

add $t1, $s1, $t1

sw $t1, 0 ($t0) # you may assume that this store never goes to the same address as the first load

addi $s0, $s0, 4

bne $s0, $s2, Loop

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)

1

2

3

4

5

6

7

8

9

10

11

12

13

6

Now unroll the loop once to make two copies of the loop body. Schedule it again and record the total # of cycles

for 200 iterations:

__________________________________

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)

1

2

3

4

5

6

7

8

9

10

11

12

13

7

6. A Branch Too Far (10 points): One difficulty in designing a branch predictor is trying to avoid cases where

two PCs with very different branch behavior index to the same entry of a 2-bit branch predictor. This is called

destructive aliasing. One way around this is to use multiple 2-bit branch predictors with different sizes. This

way, if two PCs index to the same entry in one predictor, they will not likely index to the same entry in the

other predictor. We will evaluate a scheme with three 2-bit branch predictors – each with a different number

of entries. The three predictors will be accessed in parallel, and the majority decision of the predictors will be

chosen. So if two predictors say taken and the other predictor says not taken, the majority decision will be

taken. The scheme looks like this:

Each predictor has the following FSM:

1K-entry

2-bit

predictor

512-entry

2-bit

predictor

768-

entry

2-bit

predictor

PC

Majority Vote

Taken or

Not Taken

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

11 10

01 00

8

Evaluate the performance of this prediction scheme on the following sequence of PCs. The table shows the

address of the branch and the actual direction of the branch (taken or not taken). You get to fill in whether or

not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2-bit value

representing that state. Assume that all predictors are initialized to 00. To find an index into a predictor,

assume we use the simplified branch indexing formula: index = PC % predictor_size. The symbol %

represents the modulo operator. Predictor_size will be different according to the predictor.

PC Actual Direction Correctly Predicted?

128 T

640 NT

1152 NT

128 T

640 T

1152 NT

128 T

640 NT

1152 NT

128 T

640 T

1152 NT

9

7. With Friends Like These… (30 points): Consider the scalar pipeline we have explored in class:

a. (10 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are

R-type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware

on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch

penalty. This means that the PC is updated at the end of the EX stage – after the comparison is made

in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle

latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of 20%.

We have an L2 cache that misses 5% – it has a 10 cycle latency – and memory has a 100 cycle latency.

Find the TCPI for this architecture.

TCPI = __________________

P C I n s tr u c ti o n m e m o r y

R e g is t e r s

M

u

x

M

u

x

M

u

x

C o n t r o l

ALU

EX

M

WB

M

W B

WB

ID/EX

EX/ M E M

MEM/WB

Data

m emory

M

u

x

H a z a r d

d e t e c t io n

u n it

Forwarding

unit

0

M

u

x

I F / I D

In s tr

u c tio

n

I D /EX.MemRead

I

F /ID

W rite

P C W

rit

e

I D / E X . R e g is t e rRt

I F / I D . R e g is t e r RdI F / I D . R e g is t e r Rt

I F / I D . R e g is t e r RtI F / I D . R e g is t e r Rs

Rt

Rs

Rd

Rt E X / M E M.RegisterRd

M E M /W B.RegisterRd

10

b. (5 points) Your friend has a flash of brilliance – “I know a way to get rid of stalls in this pipeline. The

reason we have to stall now is because a load can have a dependent instruction follow it through the

pipeline, and we cannot forward the load’s data until the end of the MEM stage – but the dependent

instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes

what we did in EX if the instruction before it is a load and it is dependent on the load?” This ALU will

be in the memory stage of the pipeline as shown below in this simplified picture:

Is your friend right or wrong? If they are wrong, give an example of when we would still need to stall.

They are right: ___________________

Or

Counter example: ____________________________________________________________

M u x

ALU Datamemory

M

u

x

ALU

M u x

M u x

M

u

x

M

u

x

11

c. (5 points) Another friend offers an alternative – using the original pipeline from part a, let’s get rid of

base + displacement addressing for loads and stores. Loads and stores can only use register addressing

now. This will allow us to combine EX and MEM into one stage (called EM) and avoid the need to

stall entirely. Instructions will either use the ALU or memory – but not both. There is still forwarding

hardware, but now we only need to forward from the EM/WB latch to the EM stage ALU. The

pipeline will now be:

Suppose that four fifths of loads actually use base + displacement addressing (i.e. they have a non-zero

displacement), which means that these loads will need to have add instructions before them to do their

effective address computation. Half of stores use base + displacement addressing, and these will also

need to be replaced with an add plus the store instruction. This modification has no impact on the

branch penalty or the instruction cache miss rate.

Is your friend right or wrong – will this eliminate all stalls? If they are wrong, give an example of

when we would still need to stall.

They are right: ___________________

Or

Counter example: _____________________________________________________________

IF

ID

EM

(EX or MEM)

WB

12

d. (10 points) A third friend has a different idea (it may be time for you to get new friends who don’t talk

about architecture all the time). Forget about trying to eliminate hazards – she says we should just use

superpipelining and get a win on cycle time. Take the original architecture from part a – ignore the

suggestions from b and c – and assume that the stages have the following latencies:

Stage Latency (in picoseconds)

IF 200

ID 100

EX 200

MEM 200

WB 100

Your friend suggests a way to cut the IF, EX, and MEM stages in half – just increase the pipeline depth

and make each of these stages into two stages. So your pipeline would now have IF1, IF2, ID, EX1,

EX2, MEM1, MEM2, and WB stages – each of which would have 100 picosecond latency. Your

friend also finds a way to do full forwarding between stages – even in the ALU – but loads are still a

problem. In fact, load stalls will increase now because of this increase in pipeline depth. To help you

figure out the new # of pipeline stalls from load data hazards, use the following table:

% of Loads Distance of the next dependent instruction

30% 1 cycle

20% Exactly 2 cycles later

20% Exactly 3 cycles later

10% Exactly 4 cycles later

10% Exactly 5 cycles later

5% Exactly 6 cycles later

5% Exactly 7 or more cycles later

So this means that 30% of loads are immediately followed by a dependent (i.e. 1 cycle later), 20% of

loads have a dependent exactly 2 cycles later, 20% have a dependent 3 cycles later, and so on. These

classifications are completely disjoint – the 20% of loads that have a dependent 2 cycles later do NOT

have dependents 1 cycle later.

Find the TCPI of this new architecture:_________________

13

Assume your target application will run 1M instructions. Find the execution time of this architecture

for that application:

ET: ________________

学霸联盟

CS M151B / EE M116C

Final Exam

Before you start, make sure you have all 13 pages attached to this cover sheet.

All work and answers should be written directly on these pages, use the backs of pages if needed.

This is an open book, open notes final – but you cannot share books, notes, or calculators.

NAME: ________________________________________________________________

ID: ____________________________________________________________________

Problem 1 (10 points): ___________

Problem 2 (10 points): ___________

Problem 3 (10 points): ___________

Problem 4 (10 points): ___________

Problem 5 (20 points): ___________

Problem 6 (10 points): ___________

Problem 7 (30 points): ___________

Total: ________________ (out of 100 points)

``

BERKELEY • DAVIS • IRVINE • LOS ANGELES • RIVERSIDE • SAN DIEGO • SAN FRANCISCO

SANTA BARBARA • SANTA CRUZ

1

1. I Amdahl-ighted with Tradeoffs (10 points): Given the following problems, suggest one solution and give one

drawback of the solution. Be brief, but specific.

EXAMPLE

Problem: long memory latencies

Solution: Caches

Drawback: when the cache misses, the latency becomes worse due to the cache access latency

We would not accept solutions like: “do not use memory”, “use a slower CPU”, “cache is hard to spell”, etc

Problem: too many capacity misses in the data cache

Solution:

drawback:

Problem: too many control hazards

Solution:

drawback:

Problem: our carry lookahead adder is too slow

Solution:

drawback:

Problem: we want to be able to use a larger immediate field in the MIPS ISA

Solution:

drawback:

Problem: the execution time of our CPU with a single-cycle datapath is too high

Solution:

drawback:

2

2. Hazard a Guess? (10 points): Assume you are using the 5-stage pipelined MIPS processor, with a three-cycle

branch penalty. Further assume that we always use predict not taken. Consider the following instruction

sequence, where the bne is taken once, and then not taken once (so 7 instructions will be executed total):

Loop : lw $t0, 512($t0)

lw $t1, 64($t0)

bne $s0, $t1, Loop

sw $s1, 128($t0)

Assuming that the pipeline is empty before the first instruction:

a. Suppose we do not have any data forwarding hardware – we stall on data hazards. The register file is

still written in the first half of a cycle and read in the second half of a cycle, so there is no hazard from

WB to ID. Calculate the number of cycles that this sequence of instructions would take:

__________________

3

b. How many cycles would this sequence of instructions take with data forwarding hardware:

_____________________

4

3. More $ More Problems (10 points): Find the data cache hit or miss stats for a given set of addresses.

The data cache is a 1KB, direct mapped cache with 64-byte blocks. Find the hit/miss behavior of the cache for

a given byte address stream, and label misses as compulsory, capacity, or conflict misses. All blocks in the

cache are initially invalid.

Address in Binary

Cache

Hit or Miss

Cache

Miss Type

...0011011010000

...0001011100000

...0000011010000

...0011011100000

...0011011010000

...0001011100000

...0000011010000

...0011011100000

4. The Trouble with TLBs (10 points): Consider an architecture with 32-bit virtual addresses and 1 GB of

physical memory. Pages are 32KB and we have a TLB with 64 sets that is 8-way set associative. The data

and instruction caches are 8KB with 16B block sizes and are direct mapped – and they are both virtually

indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires 1 extra

bit for storing protection information. Answer the following:

a. How many pages of virtual memory can fit in physical memory at a time? ______________

b. How large (in bytes) is the page table? ________________

c. What fraction of the total number of page translations can fit in the TLB? ____________

d. What bits of a virtual address will be used for the index to the TLB? Specify this as a range of bits –

i.e. bits 4 to 28 will be used as the index. The least significant bit is labeled 0 and the most significant

bit is labeled 31.

___________________

5

5. Starting Some Static (Scheduling) (20 points): Consider the 2-way superscalar processor we covered in class

– a five stage pipeline where we can issue one ALU or branch instruction along with one load or store

instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control

hazards with branch delay slots (since the penalty is two cycles, and this is a 2-way superscalar processor, that

would be four instructions that we need to place in delay slots). This processor has full forwarding hardware.

This processor is a VLIW machine. How long would the following code take to execute on this processor

assuming the loop is executed 200 times? Assume the pipeline is initially empty and give the time taken up

until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e.

reorder) the code (use the table below) to reduce the total number of cycles required (but don’t unroll it…yet).

Total # of cycles for 200 iterations: _____________________________

(Hint – schedule the code first for one iteration, then figure out how long it will take the processor to run 200

iterations of this scheduled code)

Loop: lw $t0, 0 ($s0)

lw $t1, 0 ($t0)

add $t1, $s1, $t1

sw $t1, 0 ($t0) # you may assume that this store never goes to the same address as the first load

addi $s0, $s0, 4

bne $s0, $s2, Loop

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)

1

2

3

4

5

6

7

8

9

10

11

12

13

6

Now unroll the loop once to make two copies of the loop body. Schedule it again and record the total # of cycles

for 200 iterations:

__________________________________

Cycle 1st Issue Slot (ALU or Branch) 2nd Issue Slot (LW or SW)

1

2

3

4

5

6

7

8

9

10

11

12

13

7

6. A Branch Too Far (10 points): One difficulty in designing a branch predictor is trying to avoid cases where

two PCs with very different branch behavior index to the same entry of a 2-bit branch predictor. This is called

destructive aliasing. One way around this is to use multiple 2-bit branch predictors with different sizes. This

way, if two PCs index to the same entry in one predictor, they will not likely index to the same entry in the

other predictor. We will evaluate a scheme with three 2-bit branch predictors – each with a different number

of entries. The three predictors will be accessed in parallel, and the majority decision of the predictors will be

chosen. So if two predictors say taken and the other predictor says not taken, the majority decision will be

taken. The scheme looks like this:

Each predictor has the following FSM:

1K-entry

2-bit

predictor

512-entry

2-bit

predictor

768-

entry

2-bit

predictor

PC

Majority Vote

Taken or

Not Taken

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

11 10

01 00

8

Evaluate the performance of this prediction scheme on the following sequence of PCs. The table shows the

address of the branch and the actual direction of the branch (taken or not taken). You get to fill in whether or

not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2-bit value

representing that state. Assume that all predictors are initialized to 00. To find an index into a predictor,

assume we use the simplified branch indexing formula: index = PC % predictor_size. The symbol %

represents the modulo operator. Predictor_size will be different according to the predictor.

PC Actual Direction Correctly Predicted?

128 T

640 NT

1152 NT

128 T

640 T

1152 NT

128 T

640 NT

1152 NT

128 T

640 T

1152 NT

9

7. With Friends Like These… (30 points): Consider the scalar pipeline we have explored in class:

a. (10 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are

R-type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware

on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch

penalty. This means that the PC is updated at the end of the EX stage – after the comparison is made

in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle

latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of 20%.

We have an L2 cache that misses 5% – it has a 10 cycle latency – and memory has a 100 cycle latency.

Find the TCPI for this architecture.

TCPI = __________________

P C I n s tr u c ti o n m e m o r y

R e g is t e r s

M

u

x

M

u

x

M

u

x

C o n t r o l

ALU

EX

M

WB

M

W B

WB

ID/EX

EX/ M E M

MEM/WB

Data

m emory

M

u

x

H a z a r d

d e t e c t io n

u n it

Forwarding

unit

0

M

u

x

I F / I D

In s tr

u c tio

n

I D /EX.MemRead

I

F /ID

W rite

P C W

rit

e

I D / E X . R e g is t e rRt

I F / I D . R e g is t e r RdI F / I D . R e g is t e r Rt

I F / I D . R e g is t e r RtI F / I D . R e g is t e r Rs

Rt

Rs

Rd

Rt E X / M E M.RegisterRd

M E M /W B.RegisterRd

10

b. (5 points) Your friend has a flash of brilliance – “I know a way to get rid of stalls in this pipeline. The

reason we have to stall now is because a load can have a dependent instruction follow it through the

pipeline, and we cannot forward the load’s data until the end of the MEM stage – but the dependent

instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes

what we did in EX if the instruction before it is a load and it is dependent on the load?” This ALU will

be in the memory stage of the pipeline as shown below in this simplified picture:

Is your friend right or wrong? If they are wrong, give an example of when we would still need to stall.

They are right: ___________________

Or

Counter example: ____________________________________________________________

M u x

ALU Datamemory

M

u

x

ALU

M u x

M u x

M

u

x

M

u

x

11

c. (5 points) Another friend offers an alternative – using the original pipeline from part a, let’s get rid of

base + displacement addressing for loads and stores. Loads and stores can only use register addressing

now. This will allow us to combine EX and MEM into one stage (called EM) and avoid the need to

stall entirely. Instructions will either use the ALU or memory – but not both. There is still forwarding

hardware, but now we only need to forward from the EM/WB latch to the EM stage ALU. The

pipeline will now be:

Suppose that four fifths of loads actually use base + displacement addressing (i.e. they have a non-zero

displacement), which means that these loads will need to have add instructions before them to do their

effective address computation. Half of stores use base + displacement addressing, and these will also

need to be replaced with an add plus the store instruction. This modification has no impact on the

branch penalty or the instruction cache miss rate.

Is your friend right or wrong – will this eliminate all stalls? If they are wrong, give an example of

when we would still need to stall.

They are right: ___________________

Or

Counter example: _____________________________________________________________

IF

ID

EM

(EX or MEM)

WB

12

d. (10 points) A third friend has a different idea (it may be time for you to get new friends who don’t talk

about architecture all the time). Forget about trying to eliminate hazards – she says we should just use

superpipelining and get a win on cycle time. Take the original architecture from part a – ignore the

suggestions from b and c – and assume that the stages have the following latencies:

Stage Latency (in picoseconds)

IF 200

ID 100

EX 200

MEM 200

WB 100

Your friend suggests a way to cut the IF, EX, and MEM stages in half – just increase the pipeline depth

and make each of these stages into two stages. So your pipeline would now have IF1, IF2, ID, EX1,

EX2, MEM1, MEM2, and WB stages – each of which would have 100 picosecond latency. Your

friend also finds a way to do full forwarding between stages – even in the ALU – but loads are still a

problem. In fact, load stalls will increase now because of this increase in pipeline depth. To help you

figure out the new # of pipeline stalls from load data hazards, use the following table:

% of Loads Distance of the next dependent instruction

30% 1 cycle

20% Exactly 2 cycles later

20% Exactly 3 cycles later

10% Exactly 4 cycles later

10% Exactly 5 cycles later

5% Exactly 6 cycles later

5% Exactly 7 or more cycles later

So this means that 30% of loads are immediately followed by a dependent (i.e. 1 cycle later), 20% of

loads have a dependent exactly 2 cycles later, 20% have a dependent 3 cycles later, and so on. These

classifications are completely disjoint – the 20% of loads that have a dependent 2 cycles later do NOT

have dependents 1 cycle later.

Find the TCPI of this new architecture:_________________

13

Assume your target application will run 1M instructions. Find the execution time of this architecture

for that application:

ET: ________________

学霸联盟