[CA] Lecture #25

Computer Architecture/컴퓨터구조[05]

[CA] Lecture #25

leziwn.cs 2023. 11. 28. 18:11

Example: Intrinsity FastMATH

Large cache: 1 block = 64 bytes

Large cache의 경우, block의 크기는 크지만, 이를 다시 CPU로 보낼 때는 1 word(= 4 bytes) 단위로 보내야 한다. 따라서 64 bytes를 4 bytes씩 16개로 쪼개놓는 것이다.

Main Memory Supporting Caches

▷ 1 block = 4 word의 경우 miss penalty:

1: address transfer
4*15: DRAM access (DRAM은 1 word씩만 접근 가능하기 때문에 4를 곱해야 한다.)
4*1: data transfer (DRAM은 1 word씩만 접근 가능하기 때문에 4를 곱해야 한다.

Designing the Memory to Support Caches

a. One-word wide memory organization

: 캐시는 1 block이 4 words인데, 메모리는 1 word씩만 접근이 가능하다.

b. Wider memory organization

: 메모리에 4 words씩 접근 가능하기 때문에, DRAM access, data transfer 모두 4배 빨라진다.

c. Interleaved memory organization: "bank"

: 메모리에 4 words씩 병렬적으로 접근할 수 있지만, Bus가 늘어나지 않았기 때문에 data transfer은 1 words씩 해야 한다.

DRAM Generations

: CPU가 DRAM보다 훨씬 빨라졌다. 즉, CPU 입장에서는 DRAM이 오히려 느려진 것이다. 따라서 성능을 향상시키기 위해서는 DRAM에의 접근을 최소화해야 한다.

Byte Addressable Memory

CPU: 4 bytes씩 접근 가능
Cache: 1 block씩 접근 가능
Memory: 4 bytes씩 접근 가능

Measuring and Improving Cache Performance

▷ Cache miss rate를 줄여서, main memory에의 접근을 줄인다.

▷ Cache miss가 발생해도, main memory에 접근하지 않도록 한다.

Multilevel caching: 캐시가 많기 때문에, 상위 레벨 캐시에서 miss가 발생해도, 아직은 메모리가 아니다!

Cache Performance

▶ CPU time = (CPU execution clock cycle + Memory stall clock cycles) * Clock cycle time

Read-stall: lw -- 메모리에서 데이터를 읽어 와서 레지스터로 load word.
Write-stall: sw -- 메모리에 데이터를 store word.

Memory stall clock cycles: Read-stall, Write-stall

Memory Stall Cycles

Cache Performance Example

Cache miss는 instruction memory의 i-cache, data memory의 d-cache에서 발생할 수 있다.

i-cache miss 즉, i-cache에 instruction 데이터가 들어있지 않은 경우를 2%, d-cache miss 즉, d-cache에 instruction 데이터가 들어있지 않은 경우를 4%라고 가정한다.

Instruction miss cycles: I*0.02*100 = 2*I
Data miss cycles: I*0.36*0.04*100 = 1.44*I

- i-cache miss의 경우, (I 즉, instruction 접근 횟수 x i-cache miss가 발생할 확률 x miss penalty) 로 구한다.

- d-cache miss의 경우, (I 즉, instruction 접근 횟수 x 메모리 접근 instruction 발생 횟수 x d-cache miss가 발생할 확률 x miss penalty) 로 구한다.

--> Total number of memory-stall cycles = 3.44*I

--> Total CPI = 2(이상적인 CPI) + 3.44 = 5.44

메모리 stall이 아예 없을 때의 이상적인 CPI(2)에 비해 메모리 접근 시간때문에 2.72배가 느려져, CPI가 5.44가 되었다.

**CPI: Clock cycle Per Instruction

Fully Associative Cache

: Memory block can be placed in any location in the cache.

--> All entries must be searched.

Set Associative Cache

: There are fixed number of locations where each block can be placed.

Associative Cache Example

예) Entry가 8개인 상황:

Direct mapped cache: Just modulo 8 연산 --> 하나의 index 당 공간이 하나이다.
2-way set associative: 하나의 index 당 공간이 두 개이다. (modulo 8/2 즉, modulo 4 연산)
Fully associative: Index가 없다. 모든 공간에 어떤 tag든지 다 들어갈 수 있다.

Position of a Memory Block

Example: Misses and Associativity in Cache

1) Direct Mapped Cache

2) 2-Way Set-Associative Cache

3) Fully Associative Cache

Effect of Associativity

Three Portions of an Address for Set Associative

Three Portions of an Address for Set Associative

Associativity가 늘어나면, 하나의 Index 당 sub-room이 많아지기 때문에, index 자체는 줄어든다.
--> Index bit 수가 감소하고, 따라서 Tag bit 수가 증가한다.

4-Way Set Associative Cache

4-way set associative의 경우, 하나의 index 당 sub-room이 4개이다.

--> 특정 index, 즉 set에 접근하고 나서, 4번의 search를 통해, 내가 원하는 Tag가 cache 안에 들어 있는지 확인해야 한다.

출처: 이화여자대학교 윤명국교수님 컴퓨터구조

저작자표시 (새창열림)