Thanks to Chris Terman and MIT open courseware. These notes are from an MIT lecture found here
|0:20||What we want in a memory.|
|2:25||Technologies for memories. (see table)|
|5:30||SRAM Memory Cell|
|10:23||1-T Dynamic RAM|
|17:40||Challenge to cope with Quality vs. Quantity|
|18:15||Key idea: Best of both worlds using Memory hierarchy|
|20:55||Memory reference patterns. Locality for program, stack and data|
|24:20||Exploiting the Memory Hierarchy|
|31:53||The Cache Idea: Program-Transparent Memory Hierarchy|
|34: 16||How high of a Hit Ratio do we need?|
|36: 15||The Cache Principle|
|46: 16||Direct Mapped Cache|
|47: 36||Contention Problem: Contention, Death and Taxes|
Professor talks about the detailed low level details of memory, addr, DIN/DOUT.
Two kinds of memories:
- 2-port main memory: One port for program counter and get back an instruction, the other port is to use load and store instructions, computing a memory address with an offset to get data.
- Register file: Built into the CPU data, two register operands for each instruction. Same organization as 2-port memory.
Technologies for memories:
|Register||100’s of bits||20ps||$$$$|
|SRAM||100’s of Kbytes||1ns||$$$|
|DRAM||1000’s of Mbytes||40ns||$|
|Hard disk*||100’s of Gbytes||10ms||Cents|
The real bottleneck is if we have to fetch each instruction from the memory, there is a high order of latency even though the processor is very fast.
In past the speed of the processor has improved with CMOS technologies. The capacity of DRAM has increased, as the size of the transistors get smaller and smaller, but the latency in the DRAM which are dictated by the size of the memory, have not increased dramatically as compared to processor.
Static Ram – A technology that is used in our register file (one of the types of memory mentioned above). Professor talks about the low-level gates and transistors of SRAM that uses inverters. There is a static bi-stable storage element. The writes of bits “overpower” the reads.
We can build multi-port SRAMs. One can increase the number of SRAM ports by adding access transistors. By carefully sizing the inverter pair, so that one is strong and the other is weak, we can assure that our WRITE bus will only fight with the weaker one, and the READs are driven by the stronger one – thus minimizing both access and write times.
1-T Dynamic Ram
It is a high capacity memory system, is much simpler – involves six transistors/cell may not sound much, but they can add up quickly. What is the fewest number of transistors that can be used to store a bit? This is determined by area, better dielectric, thinner film, there is a formula to calculate that.
The interesting idea here is that every 10ms your computer is reading all the data in the memory and writing it back again so that it does not get lost.
A trick to increase throughput with the idea of pipelining. Send over the address in couple different chunks.
Synchronous DRAM (SDRAM)
Double-clocked Synchronous Memory (DDR)
The idea of DDR RAM is that it uses a clock transmission protocol. The reason the machine is slow because fetching the data from this memory system is slow.
- Average latency = 4ms
- Average seek time = 9ms
- Transfer rate = 20Mbytes/sec
- Capacity = 1TB
- Cost <= $1/Gbytes
- Spinning tracks: 7000 – 15000 RPM
There are cylinders with level of discs. Discs have tracks which are divided into sectors. The shaft and the read/write head is a mechanical device. Information is stored in concentric circles to minimize randomization of head.
Quantity vs Quality
- Your memory can be BIG and slow …. or …
- SMALL and FAST.
Is there an architectural solution to this DILEMMA.
We can nearly get our wish.
KEY: Use a hierarchy of memory technologies
- Keep the most often-used data in a small, fast SRAM (often local to CPU chip)
- Refer to Main Memory only rarely, for remaining data.
- The reason this strategy works: LOCALITY
Statistically researchers have found a memory reference pattern. See diagram (21:03).
Program: Branching factor also affects the speed, usually if-else statements that branch program paths out.
Stack: At any given moment we are using a small amount of the stack in a program – called the activation records for the current subroutine.
Data: Copying data from one data structure to another or performing computation on it.
Exploiting the Memory Hierarchy
Approach1: (Cray, others): Expose Hierarchy
Hardware types: SMOP – As hardware guys get lazy they push the programmer to write smarter programs. Until recently these were the fastest machines on earth, Cray super computers. The argument of this type by Seymour Cray was that you cannot fake something that you do not have. And that is fake a huge faster memory.
- Register, Main Memory
Disk each available as storage alternatives
- Tell programmers: “Use them cleverly”
Approach2: Hide Hierarchy
Here the idea – the hardware looks over the shoulder, and manages of locality of reference. This is a layer abstraction that does a memory management.
- Programming model: SINGLE kind of memory, single address space
- Machine AUTOMATICALLY assigns location to fast or slow memory depending on usage patterns.
CPU looks at small static cache (usually L1/L2) and then the DRAM and then Hard disk. Most of what you buy in a processor is the cache memory. The size of the cache is important. Ideally you want most information to be found in the yellow box (small static cache).
The Cache Idea: Program-Transparent Memory Hierarchy
Cache contains “temporary copies” of selected main memory locations.
Challenge is to make hit ratio as high as possible.
- Improve the average access time
- HIT RATIO : Fraction of refs found in CACHE
- MISS RATIO: Remaining references
- Transparency (compatibility, programming ease)
How High of a Hit Ratio?
Suppose we can easily build an on-chip static memory with a 4ns access time, but the fastest DRAM that we can buy for main memory has an average access time of 40ns. How high of a hit rate do we need to sustain an average speed of 5ns? (Only slightly slower than cache?)
Over 97% of the time the instruction should be in the small cache. Over a period of time there is a subset of instructions that the processor can process and will need for computation. If the cache is big enough to accommodate that then we can achieve our hit ratio. The amount of time the CPU takes to process that should be balanced with amount of time to load the misses.
The Cache Principle
ALGORITHM: Look nearby for the requested information first, if it’s not there, check secondary storage.
Basic cache algorithm:
Cache knows two things, which addresses it has and the contents of them. CPU if there is a hit for the data in the cache, it can update the data. And then it is the cache’s responsibility to update it in the main memory. If there is a miss, then cache has to replace something from cache and replace it with something from the main memory.
Associativity: Parallel Lookup
Look at every row or line of the cache, and see if it has what CPU is looking for, all in parallel. Any data item can be located in any cache location. Fully –associative cache are very expensive and we need half of the register area to store address.
A cheaper alternative to associative cache. This is non-associative, where it indexes the data and look-up serially (as opposed to parallel). The basic idea is to use a table-index to find memory location quickly, because parallel operation of the same is expensive.
Problem: Contention and cache conflicts. Improve the mapping indexing function. (So use low-order as opposed to higher order of the address) – Since high-order do not change much given locality of references.
L1 cache: Are very small but very fast cache. They are a few thousand entries long, and they respond in 10ps.
Next lecture deals with the cache issues, and if there is a happy middle ground.
- Flexible: any address can be cached in any line
- Cheap (ordinary SRAM)
- Contention: Addresses compete for cache lines.