《(8.4.4)--Ch_04-Cache Memory计算机组成原理.ppt》由会员分享,可在线阅读,更多相关《(8.4.4)--Ch_04-Cache Memory计算机组成原理.ppt(70页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、 计算机组成原理 -双语教学课件双语教学课件William Stallings Computer Organization and Architecture6th EditionChapter 4Cache Memory4.1 CharacteristicsLocationCapacityUnit of transferAccess methodPerformancePhysical typePhysical characteristicsOrganisationLocation page97CPU (register,cache)Internal(cache,main memory)Exte
2、rnal(disk,tape.)Capacity page97Word size (字长)The natural unit of organisation(字的位数)Number of words (字数)or Bytes字字(word)在不同的机器上值不同,在不同的机器上值不同,它不是一个大家公认的单位。它不是一个大家公认的单位。而字节是标准单位:而字节是标准单位:1byte=8bitUnit of Transfer 1 page97InternalUsually governed by data bus widthBe equal to the word length(or XX bits
3、)ExternalUsually a block which is much larger than a word(e.g.4,8,16,)1 block=2n wordsUnit of Transfer 2Addressable unitSmallest location which can be uniquely addressedWord internallyCluster on M$disksAddress code is A bits,so it can represent address from 0 to 2A-1Access Methods(1)page98Sequential
4、Start at the beginning and read through in orderAccess time depends on location of data and previous locatione.g.tapeDirectIndividual blocks have unique addressAccess is by jumping to vicinity plus sequential searchAccess time depends on location and previous locatione.g.diskAccess Methods(2)RandomI
5、ndividual addresses identify locations exactlyAccess time is independent of location or previous accesse.g.RAMAssociativeData is located by a comparison with contents of a portion of the storeAccess time is independent of location or previous accesse.g.cacheMemory Hierarchy page99RegistersIn CPUInte
6、rnal or Main memoryMay include one or more levels of cache“RAM”External memoryBacking storeMemory Hierarchy-Diagram速度速度变慢变慢但容但容量加量加大大,且单且单位成位成本降本降低低Two levels of memoryLevel 1 fast but size smallLevel 2 slow but size largeAccess method:-first fetches data from L1 memory if data is in(hit on)-directl
7、y gets the datafast-if data is not in L1 memory(hit miss),fetches data from L2 memory(certainly in)into L1 memory and into processor-slowH=hit on ratio(X%)T1=access time to L1T2=access time to L2The average time to access the data isT=H*T1+(1-H)*(T1+T2)=T1+(1-H)*T2i.g.0.95*0.01+0.05*(0.01+0.1)=0.009
8、5+0.0055=0.015If H is near to 100%,the T near to T1If H is near to 0%,the T near to T1+T2(see F4.2 page 101)Performance page98Access timeTime between presenting the address and getting the valid dataMemory Cycle timeTime may be required for the memory to“recover”before next accessCycle time is acces
9、s+recoveryTransfer RateRate at which data can be movedPhysical TypesSemiconductorRAM(DRAM&SRAM)MagneticDisk&TapeOpticalCD&DVDOthersBubbleHologramPhysical CharacteristicsDecayVolatilityErasablePower consumptionOrganisationPhysical arrangement of bits into wordsNot always obviouse.g.interleavedThe Bot
10、tom LineHow much?Capacity (容量)How fast?Time is money (速度)How expensive?(成本)Hierarchy ListRegistersL1 CacheL2 CacheMain memoryDisk cacheDiskOpticalTapeSo you want fast?It is possible to build a computer which uses only static RAM(see later)This would be very fastThis would need no cache-结构简化How can y
11、ou cache cache?This would cost a very large amountLocality of ReferenceDuring the course of the execution of a program,memory references tend to clustere.g.loops4.2 Cache page103Small amount of fast memorySits between normal main memory and CPUMay be located on CPU chip or moduleCache operation over
12、view page 105CPU requests contents of memory location-sends an real addressCheck cache for this dataIf present,get from cache(fast)If not present,read required block from main memory to cache(slow)Then deliver from cache to CPUCache includes tags to identify which block of main memory is in each cac
13、he slot(line)Cache DesignSizeMapping FunctionReplacement AlgorithmWrite PolicyBlock SizeNumber of CachesSize does matterCostMore cache is expensiveSpeedMore cache is faster(up to a point)But Checking cache for data takes timeTypical Cache OrganizationMapping FunctionCache of 64kByte Cache block of 4
14、 bytesi.e.cache is 16k(214)lines of 4 bytes16MBytes main memory24 bit address(224=16M)Direct Mapping(直接映射)Each block of main memory maps to only one(the fixed)cache linei.e.if a block is in cache,it must be in one specific placeAddress is divided 3 parts:Right bits identify unique word in the block(
15、word number:Word#)Middle bits specify caches location(line number:line#)Left bits serve as mark(tag)Tag+line#=block number (block#)Direct MappingAddress StructureTagLine#Word#814224 bit address2 bit word#(4 byte block)22 bit block#14 bit line#8 bit tag(=24-2-14)No two blocks in the same line have th
16、e same Tag fieldCheck contents of cache by finding line and checking TagDirect Mapping Cache Line TableCache lineMain Memory blocks held00,m,2m,3m2s-m11,m+1,2m+12s-m+1m-1m-1,2m-1,3m-12s-1Direct Mapping Cache OrganizationDirect Mapping ExampleAddress 16339C H =0001 01100011 0011 1001 1100Tag(8)line#(
17、14)word#(2)16H 0CE7H16K lines cacheDirect Mapping SummaryAddress length=n bitsNumber of addressable units=2n words or bytesBlock size=line size=2w words or bytesNumber of blocks in main memory=2n/2w=2n-wNumber of lines in cache=2rSize of line#=r bitsSize of tag=(n w-r)bitsDirect Mapping pros&consSim
18、pleInexpensiveFixed location for given blockIf a program accesses 2 blocks that map to the same line repeatedly,cache misses are very highAssociative Mapping(关联映射)A main memory block can load into any line of cacheMemory address is interpreted as tag and word#Tag uniquely identifies block of memoryE
19、very lines tag is examined for a matchCache searching gets expensiveFully Associative Cache OrganizationAssociative Mapping Example16K lines cacheTag22bitWord2bitAssociative MappingAddress Structure22 bit tag stored with each 32 bit block of dataCompare tag field with tag entry in cache to check for
20、 hitLeast significant 2 bits of address identify which 16 bit word is required from 32 bit data blocke.g.AddressTagDataCache lineFFFFFC3FFFFF 24682468 any(3FFF)16339C 058CE7 FEDCBA98 any Associative Mapping SummaryAddress length=n=(tag bits+word#)Number of addressable units=2s+w words or bytesBlock
21、size=line size=2w words or bytesNumber of blocks in main memory =2n/2w=2 n-wNumber of lines in cache=undeterminedSize of tag=n-w bitsSet Associative Mapping(组关联映射组关联映射)Cache is divided into a number of setsEach set contains a number of linesA given block maps to any line in a given(the fixed)sete.g.
22、Block B can be in any line of set ie.g.2 lines per set2 way associative mappingA given block can be in one of 2 lines in only one set(fixed)Set Associative MappingExample13 bit set number(213 set&2 way)Block number in main memory is modulo 213?!000000,00A000,00B000,00C000 map to same setTwo?K!Way Se
23、t Associative Cache Organization?!Set Associative MappingAddress StructureUse set field to determine cache set to look inCompare tag field to see if we have a hite.gAddressTag Data Set number16339C 02C FEDCBA980CE7FFFFFF1FF 11223344 1FFFTag9bitSet#13bitWord#2bitTwo Way Set Associative Mapping Exampl
24、e16339C16K lines cacheSet Associative Mapping SummaryAddress length=(s+w)bitsNumber of addressable units=2s+w words or bytesBlock size=line size=2w words or bytesNumber of blocks in main memory=2dNumber of lines in set=kNumber of sets=v=2dNumber of lines in cache=kv=k*2dSize of tag=(s d)bitsReplacem
25、ent Algorithms(1)Direct mapping page115When a new block is brought into the cache,one of the existing block must be replaced.Which one?No choiceEach block only maps to one line-a Fixed lineReplace that lineReplacement Algorithms(2)Associative&Set AssociativeHardware implemented algorithm(speed)Least
26、 Recently used(LRU)*problem4.8&4.11e.g.in 2 way set associativeWhich of the 2 block is lru?Each line includes a USE bit.When a line is referenced,its USE bit is set to 1 and the USE bit of the other line in same set is set to 0.So,the USE bit is 0,meaning the line is LRU,replacement is occurred in t
27、he line.First in first out(FIFO)replace block that has been in cache longestUsing a counter to record historyLeast frequently usedreplace block which has had fewest hitsUsing a counter to record frequencyRandomReplacement Algorithms(3)Associative&Set AssociativeWrite Policy page118Must not overwrite
28、 a main memory unless cache block is up to dateMultiple CPUs may have individual cachesI/O may address main memory directlyWrite throughAll writes go to main memory as well as cacheThe data in a cache line is changed(writed),the change must occur in memory block at the same time.Multiple CPUs can mo
29、nitor main memory traffic to keep local(to CPU)cache up to dateLots of trafficSlows down writesRemember bogus write through caches!假的假的Write backUpdates initially made in cache onlyUpdate bit for cache line is set when update occursIf block is to be replaced,write to main memory only if update bit i
30、s setOther caches get out of syncI/O must access main memory through cacheN.B.15%of memory references are writesLine size page 119Simply,a large block size(same as line size),may rise the HIT IN ratio.But heavy traffic to BUS and Number of caches page 119Multi-level cachesOn chip cache,on board cach
31、e,CPU,L1 cache,L2 cache,memory 1 :12 :10 :50 100Unified cache or split cachesOn-chip cache holds data&instructions Unified cache,dynamic adjust the size of data or instructoins by HIT ON ratio Split cache:data writes back memory,but instructions does not.Pentium 4 Cache80386 no on chip cache80486 8k
32、 using 16 byte lines and four way set associative organizationPentium(all versions)two(split)on chip L1 cachesData&instructionsPentium 4 L1 caches8k bytes64 byte linesfour way set associativeL2 cache Feeding both L1 caches256k128 byte lines8 way set associativePentium 4 Diagram(Simplified)NOW,UPTO32
33、MB2MBPentium 4 Core ProcessorFetch/Decode UnitFetches instructions from L2 cacheDecode into micro-opsStore micro-ops in L1 cacheOut of order execution logicSchedules micro-opsBased on data dependence and resourcesMay speculatively executeExecution unitsExecute micro-opsData from L1 cacheResults in r
34、egistersMemory subsystemL2 cache and systems busPentium 4 Design ReasoningDecodes instructions into RISC like micro-ops before L1 cacheMicro-ops fixed lengthSuperscalar pipelining and schedulingPentium instructions long&complexPerformance improved by separating decoding from scheduling&pipelining(Mo
35、re later ch14)Data cache is write backCan be configured to write throughL1 cache controlled by 2 bits in registerCD=cache disableNW=not write through2 instructions to invalidate(flush)cache and write back then invalidatePower PC Cache Organization601 single 32kb 8 way set associative603 16kb(2 x 8kb
36、)two way set associative604 32kb610 64kbG3&G464kb L1 cache8 way set associative256k,512k or 1M L2 cachetwo way set associativePowerPC G4Comparison of Cache SizesPentium 4 Cache80386 no on chip cache80486 8k using 16 byte lines and four way set associative organizationPentium(all versions)two on chip
37、 L1 cachesData&instructionsPentium III L3 cache added off chipPentium 4L1 caches8k bytes64 byte linesfour way set associativeL2 cache Feeding both L1 caches256k128 byte lines8 way set associativeL3 cache on chipIntel Cache EvolutionProblemSolutionProcessor on which feature first appearsExternalmemor
38、yslowerthanthesystembus.Addexternalcacheusingfastermemorytechnology.386Increasedprocessorspeedresultsinexternalbusbecomingabottleneckforcacheaccess.Moveexternalcacheon-chip,operatingatthesamespeedastheprocessor.486Internalcacheisrathersmall,duetolimitedspaceonchipAddexternalL2cacheusingfastertechnol
39、ogythanmainmemory486ContentionoccurswhenboththeInstructionPrefetcherandtheExecutionUnitsimultaneouslyrequireaccesstothecache.Inthatcase,thePrefetcherisstalledwhiletheExecutionUnitsdataaccesstakesplace.Createseparatedataandinstructioncaches.PentiumIncreasedprocessorspeedresultsinexternalbusbecomingab
40、ottleneckforL2cacheaccess.Createseparateback-sidebusthatrunsathigherspeedthanthemain(front-side)externalbus.TheBSBisdedicatedtotheL2cache.PentiumProMoveL2cacheontotheprocessorchip.PentiumIISomeapplicationsdealwithmassivedatabasesandmusthaverapidaccesstolargeamountsofdata.Theon-chipcachesaretoosmall.
41、AddexternalL3cache.PentiumIIIMoveL3cacheon-chip.Pentium4Pentium 4 Block DiagramPentium 4 Core ProcessorFetch/Decode UnitFetches instructions from L2 cacheDecode into micro-opsStore micro-ops in L1 cacheOut of order execution logicSchedules micro-opsBased on data dependence and resourcesMay speculati
42、vely executeExecution unitsExecute micro-opsData from L1 cacheResults in registersMemory subsystemL2 cache and systems busPentium 4 Design ReasoningDecodes instructions into RISC like micro-ops before L1 cacheMicro-ops fixed lengthSuperscalar pipelining and schedulingPentium instructions long&comple
43、xPerformance improved by separating decoding from scheduling&pipelining(More later ch14)Data cache is write backCan be configured to write throughL1 cache controlled by 2 bits in registerCD=cache disableNW=not write through2 instructions to invalidate(flush)cache and write back then invalidateL2 and
44、 L3 8-way set-associative Line size 128 bytesPowerPC Cache Organization601 single 32kb 8 way set associative603 16kb(2 x 8kb)two way set associative604 32kb620 64kbG3&G464kb L1 cache8 way set associative256k,512k or 1M L2 cachetwo way set associativeG532kB instruction cache64kB data cachePowerPC G5 Block Diagram