The Memory Hierarchy

Thanks to Chris Terman and MIT open courseware. These notes are from an MIT lecture found here

Timeline:

0:20 What we want in a memory.
2:25 Technologies for memories. (see table)
5:30 SRAM Memory Cell
10:23 1-T Dynamic RAM
16:00 Hard disk
17:40 Challenge to cope with Quality vs. Quantity
18:15 Key idea: Best of both worlds using Memory hierarchy
20:55 Memory reference patterns. Locality for program, stack and data
24:20 Exploiting the Memory Hierarchy
31:53 The Cache Idea: Program-Transparent Memory Hierarchy
34: 16 How high of a Hit Ratio do we need?
36: 15 The Cache Principle
46: 16 Direct Mapped Cache
47: 36 Contention Problem: Contention, Death and Taxes

Professor talks about the detailed low level details of memory, addr, DIN/DOUT.

Two kinds of memories:

  1. 2-port main memory: One port for program counter and get back an instruction, the other port is to use load and store instructions, computing a memory address with an offset to get data.
  2. Register file: Built into the CPU data, two register operands for each instruction. Same organization as 2-port memory.

Technologies for memories:

Capacity Latency Cost
Register 100’s of bits 20ps $$$$
SRAM 100’s of Kbytes 1ns $$$
DRAM 1000’s of Mbytes 40ns $
Hard disk* 100’s of Gbytes 10ms Cents
Desired 1’s Gbytes 1ns cheap

The real bottleneck is if we have to fetch each instruction from the memory, there is a high order of latency even though the processor is very fast.

In past the speed of the processor has improved with CMOS technologies. The capacity of DRAM has increased, as the size of the transistors get smaller and smaller, but the latency in the DRAM which are dictated by the size of the memory, have not increased dramatically as compared to processor.

SRAM

Static Ram – A technology that is used in our register file (one of the types of memory mentioned above). Professor talks about the low-level gates and transistors of SRAM that uses inverters. There is a static bi-stable storage element. The writes of bits “overpower” the reads.

We can build multi-port SRAMs. One can increase the number of SRAM ports by adding access transistors. By carefully sizing the inverter pair, so that one is strong and the other is weak, we can assure that our WRITE bus will only fight with the weaker one, and the READs are driven by the stronger one – thus minimizing both access and write times.

1-T Dynamic Ram

It is a high capacity memory system, is much simpler – involves six transistors/cell may not sound much, but they can add up quickly. What is the fewest number of transistors that can be used to store a bit? This is determined by area, better dielectric, thinner film, there is a formula to calculate that.

The interesting idea here is that every 10ms your computer is reading all the data in the memory and writing it back again so that it does not get lost.

A trick to increase throughput with the idea of pipelining. Send over the address in couple different chunks.

Synchronous DRAM (SDRAM)

Double-clocked Synchronous Memory (DDR)

The idea of DDR RAM is that it uses a clock transmission protocol. The reason the machine is slow because fetching the data from this memory system is slow.

Hard disk

  • Average latency = 4ms
  • Average seek time = 9ms
  • Transfer rate = 20Mbytes/sec
  • Capacity = 1TB
  • Cost <= $1/Gbytes
  • Spinning tracks: 7000 – 15000 RPM

There are cylinders with level of discs. Discs have tracks which are divided into sectors. The shaft and the read/write head is a mechanical device. Information is stored in concentric circles to minimize randomization of head.

Quantity vs Quality

  • Your memory can be BIG and slow …. or …
  • SMALL and FAST.

Is there an architectural solution to this DILEMMA.

We can nearly get our wish.

KEY: Use a hierarchy of memory technologies

Key Idea

  • Keep the most often-used data in a small, fast SRAM (often local to CPU chip)
  • Refer to Main Memory only rarely, for remaining data.
  • The reason this strategy works: LOCALITY

Statistically researchers have found a memory reference pattern. See diagram (21:03).

Program: Branching factor also affects the speed, usually if-else statements that branch program paths out.

Stack: At any given moment we are using a small amount of the stack in a program – called the activation records for the current subroutine.

Data: Copying data from one data structure to another or performing computation on it.

Exploiting the Memory Hierarchy

Approach1: (Cray, others): Expose Hierarchy

Hardware types: SMOP – As hardware guys get lazy they push the programmer to write smarter programs. Until recently these were the fastest machines on earth, Cray super computers. The argument of this type by Seymour Cray was that you cannot fake something that you do not have. And that is fake a huge faster memory.

    • Register, Main Memory

Disk each available as storage alternatives

    • Tell programmers: “Use them cleverly”

Approach2: Hide Hierarchy

Here the idea – the hardware looks over the shoulder, and manages of locality of reference. This is a layer abstraction that does a memory management.

    • Programming model: SINGLE kind of memory, single address space
    • Machine AUTOMATICALLY assigns location to fast or slow memory depending on usage patterns.

CPU looks at small static cache (usually L1/L2) and then the DRAM and then Hard disk. Most of what you buy in a processor is the cache memory. The size of the cache is important. Ideally you want most information to be found in the yellow box (small static cache).

The Cache Idea: Program-Transparent Memory Hierarchy

Cache contains “temporary copies” of selected main memory locations.

Challenge is to make hit ratio as high as possible.

Goals:

  • Improve the average access time
    • HIT RATIO : Fraction of refs found in CACHE
    • MISS RATIO: Remaining references
  • Transparency (compatibility, programming ease)

How High of a Hit Ratio?

Suppose we can easily build an on-chip static memory with a 4ns access time, but the fastest DRAM that we can buy for main memory has an average access time of 40ns. How high of a hit rate do we need to sustain an average speed of 5ns? (Only slightly slower than cache?)

Over 97% of the time the instruction should be in the small cache. Over a period of time there is a subset of instructions that the processor can process and will need for computation. If the cache is big enough to accommodate that then we can achieve our hit ratio. The amount of time the CPU takes to process that should be balanced with amount of time to load the misses.

The Cache Principle

ALGORITHM: Look nearby for the requested information first, if it’s not there, check secondary storage.

Basic cache algorithm:

Cache knows two things, which addresses it has and the contents of them. CPU if there is a hit for the data in the cache, it can update the data. And then it is the cache’s responsibility to update it in the main memory. If there is a miss, then cache has to replace something from cache and replace it with something from the main memory.

Associativity: Parallel Lookup

Look at every row or line of the cache, and see if it has what CPU is looking for, all in parallel. Any data item can be located in any cache location. Fully –associative cache are very expensive and we need half of the register area to store address.

Direct-Mapped Cache

A cheaper alternative to associative cache. This is non-associative, where it indexes the data and look-up serially (as opposed to parallel). The basic idea is to use a table-index to find memory location quickly, because parallel operation of the same is expensive.

Problem: Contention and cache conflicts. Improve the mapping indexing function. (So use low-order as opposed to higher order of the address) – Since high-order do not change much given locality of references.

L1 cache: Are very small but very fast cache. They are a few thousand entries long, and they respond in 10ps.

Next lecture deals with the cache issues, and if there is a happy middle ground.

Fully Associative

  • Expensive
  • Flexible: any address can be cached in any line

Direct Mapped

  • Cheap (ordinary SRAM)
  • Contention: Addresses compete for cache lines.

Leave a Reply

Your email address will not be published. Required fields are marked *