标量处理机学习.pptx-淘文阁

资源描述

《标量处理机学习.pptx》由会员分享，可在线阅读，更多相关《标量处理机学习.pptx（104页珍藏版）》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、2023/3/255-1超标量处理机的基本结构超标量处理机的基本结构如果把处理机中能够同时运行的指令条数定义为指令并行度指令并行度ILP（instruction level parallelism），），那未一条k级流水线的ILP为k。如果一个超标量处理机中存在n条这样的流水线，其ILP为nk。12341234整数寄存器123412345656浮点数寄存器存储器图2-26 常见的超标量处理机组成返回返回上一张上一张第1页/共104页2023/3/255-2指令的单发射与多发射指令的单发射与多发射处理机从指令存储单元（或指令分配单元）取得指令的过程称为“发射发射”。如果一个处理机在单个时钟周

2、期中只能取出一条指令供执行，就称为单发射处理机单发射处理机。如果在一个时钟周期内可以同时取得多条指令同时取得多条指令的处理机可以称为多发射处理机多发射处理机。时钟周期指令I1I2I351234IFIDEXWRIFIDEXWRIFIDEXWR时钟周期指令I6I1I2I3I4I512345EXWRIFIDIFIDEXWRIFIDEXWRIFIDEXWRIFIDEXWRIFIDEXWR(a)单发射(b)多发射图2-28 单发射与多发射工作方式比较返回返回上一张上一张第2页/共104页2023/3/255-3超超标量流水线处理机流水线处理机超标量流水线的发射策略已经指出，限制指令级并行性的3种因素是：

3、1.结构相关，即资源冲突；2.控制相关；3.数据相关，即WR相关、RW相关、WW相关。在超标量流水中，上述相关的存在，使问题变得更加复杂化。因此超标量流水线的调度，即指令的发射和完成策略，对于充分利用指令级的并行度，提高超标量处理器的性能十分重要。所谓指令发射策略包括两层意思，一是取指令的次序，另一个是所取指令的执行次序。第3页/共104页2023/3/255-4超超标量流水线处理机流水线处理机指令发射(instructionissue)是指启动指令进入执行段的过程。指令发射策略是指指令发射所用的协议或规则。当指令按程序的次序发射时，称之为按序发射(in-orderissue)。为改善流水线性

4、能，可以将存在相关性的指令推后发射，而将后面无相关性的指令提前发射，即不按程序原有次序发射指令，称之为无序发射(out-of-orderissue)。类似地，指令的完成也有按序完成和无序完成之分。一般而言，无序发射总导致无序完成。第4页/共104页2023/3/255-5超超标量流水线处理机流水线处理机超标量流水线共有3种调度策略：1.按序发射按序完成；2.按序发射无序完成；3.无序发射无序完成。无论哪种调度策略，都要保证程序运行的最终结果是正确的.第5页/共104页2023/3/255-6超超标量流水线处理机流水线处理机假设有一个并行度为2的超标量流水线，其结构如图7(a)所示。它分为取指(

5、F)段、译码(D)段、执行(E)段和写回(W)段共四段。F、D、W段都是1个时钟周期完成。E段有多个功能部件：其中LOADSTORE部件完成D-Cache访问只需1个时钟周期，加法器部件完成加法操作需2个时钟周期，乘法器部件完成乘法操作则需3个时钟周期。加法器和乘法器都已流水化。F段和D段要求指令成对的输入。E段有内部数据定向传送，结果生成即可使用。第6页/共104页2023/3/255-7超超标量流水线处理机流水线处理机使用的程序包含如下6条指令序列：I1LOADR1,M(A)；R1M(A)I2ADDR2，R1；R2(R2)(R1)I3ADDR3，R4；R3(R3)(R4)I4MULR4，R

6、5；R4(R4)(R5)I5LOADR6，M(B)；R6M(B)I6MULR6，R7；R6(R6)(R7)上述指令中I1，I2有WR相关，I3，I4有RW相关，I5，I6有WW相关和WR相关。第7页/共104页2023/3/255-8超超标量流水线处理机流水线处理机1按序发射图7(b)给出了按序发射按序完成的译码段、执行段、写回段的推进情况，而图7(c)给出了流水线的时空图。第8页/共104页2023/3/255-9超超标量流水线处理机流水线处理机我们看到，指令I5与I3，I4无关，若不推迟写回而是在时钟7写回，程序的语义仍然正确。这样实现的话，I5先于I4完成，这种情况就是按序发射无序完成，

7、其流水线时空图见图8所示。虽然总的完成时间仍是10个时钟周期，但是图7(b)中的I5不存在了，LOADSTORE部件的利用率得到了提高。第9页/共104页2023/3/255-10超超标量流水线处理机流水线处理机2无序发射从按序发射方式看到，译码段只是对到达的指令进行资源冲突或数据相关性的判测，若无冲突或相关性则按序发射出去，否则指令滞留在译码段直到冲突或相关性消失再发射，如图7(b)中的I2。如果处理器具有前找能力，即后续的指令中可能有独立指令，它与已在流水线上的指令不相关，此时应提前译码并执行，以充分发挥超标量多条指令流水线的能力。这就是无序发射的目的第10页/共104页2023/3/25

8、5-11超超标量流水线处理机流水线处理机2无序发射为实现无序发射，就必须在流水线的译码段和执行段之间建立紧密的联系。一种常用的方法是使用指令窗口，它实质上是一个缓冲栈。当处理器译码一条指令后就将它放入指令窗口，只要缓冲器不满，就继续取和译码后续的指令。指令由指令窗口发射到执行段。只要满足两个条件：1.指令所需的功能部件是可用的，2.无相关性阻碍这条指令的执行，那么这条指令即可发射出去，与取指或译码的顺序无关。第11页/共104页2023/3/255-12超超标量流水线处理机流水线处理机2无序发射使用指令窗口的超标量流水线模型见图9(a)所示。注意，指令窗口只是译码段与执行段之间的缓冲机构，并不

9、是流水线的一个独立段。在无序发射方式下，前述程序的6条指令在流水线上的推进情况及流水线时空图分别示于图9(b)和(c)中。第12页/共104页2023/3/255-13SuperscalarProblemsWemustnowexpandthepotentialproblemsthatarisewithasuperscalarpipelineoveranordinarypipeline:RAWhazardscouldexistbetweenthetwoinstructionsissuedatthesametimeTherearenewpotentialWAWandWARhazardsWeneed

10、tohavetwiceasmanyregisterreadsandwritesasbefore,ourregisterfilemustbeexpandedtoaccommodatethisLoadsandStoresareintegeroperationseveniftheyaredealingwithfloatingpointregisterswemightbereadingfloatingpointregistersforaFPoperationandalsoreading/writingfloatingpointregistersforanFPloadorstoreMaintaining

11、preciseexceptionsisdifficultbecauseanintegeroperationmayhavealreadycompletedHardwaremustdetecttheseproblems(andquickly)第13页/共104页2023/3/255-14CostofaSuperscalarWealreadyhadthemultiplefunctionalunits,sothereisnoaddedcostintermsofhavinganintandaFPinstructionissueandexecuteinparallelThereareaddedcostst

12、houghforHazarddetectionthecomplexityhereisincreasedbecausenowinstructionsmustbecomparednotonlytoinstructionsfurtherdownthepipeline,buttotheinstructionatthesamestage,plusthereisapotentialfortwiceasmanyinstructionsbeingactiveatonetime!MaintainingpreciseexceptionsTwosetsofbusesintegeroperationsfrominte

13、gerregisterstointegerALU&datacacheFPoperationsfromFPregisterstoFPfunctionalunit&datacacheAbilitytoaccessfloatingpointregisterfilebyupto3instructionsduringthesamecycle(aloadorstoreFPintheIDorWBstage,anFPinstructioninIDandanFPinstructioninWB)第14页/共104页2023/3/255-15HardwareBasedSpeculationInissuingmult

14、ipleinstructionspercycle,branchpredictionmaynotbeaccurateenoughtomaintainareasonableissuerateAhighissueprocessormayneedtoexecuteabrancheveryclockcycle!Toexploitfurtherperformance,wenowlookathardwaretopromotespeculativeinstructionissueHardwarewillpredictthenextinstructionandissueitbeforedeterminingth

15、ebranchresultIfpredictingwrong,theinstructionmustbekilledoffbeforeitcanaffectachangetothemachinesstateitcannotupdateregistersormemoryWeaddanewbuffercalledthereorder bufferThisbufferstorestheresultsofcompletedinstructionsthatwerespeculated,untilthespeculationisproventrueorfalseIftrue,wecanallowtheins

16、tructionsresultstobewrittentoregisters/memoryIffalse,wemustremoveitandallinstructionsthatfolloweditsincetheywerespeculatedincorrectlyWeAddanewstatetoinstructionexecutioncalledcommit toourTomasulo-basedsuperscalararchitectureShouldtheresultbestoredinthedestinationregister?Thisbecomesthefinalstepforal

17、linstructions第15页/共104页2023/3/255-16TheNewArchitectureWillcombine:Tomasulo-basedapproachofreservationstationsfordynamicschedulingmulti-issuesuperscalarseparatelycontrolledintegratedfetchunitwhichwillspeculateoncontroldependencesreorderbuffertotemporarilystoreresultsbeforetheyaremovedtoregisters第16页/

18、共104页2023/3/255-17StepsforHardwareWemustenhanceourcontrolhardwarefromTomasulosapproachtoincludeInstructioncannotissueifthereorderbufferisfullUponissue,updateregisterstatustoincludereorderbufferentrynumber,andenterreorderbufferentrynumberintodestinationfieldofreservationstationusethisvaluetorenamereg

19、istersifneededExecutionremainsthesamealthoughloadsandstoresarenowbeinghandledbyaseparatememorycontrolunitWriteresultremainsthesameexceptthatvaluesarenotwrittentoregistershere,buttheyareforwardedviaCDBIneachcycle,committheinstructionatthefrontofthereorderbufferifithasreachedthewriteresultstageand the

20、speculationfortheinstructionwascorrectOtherwise,ifthespeculationfortheinstructionwaswrong,flushtheinstructionandallothersinthereorderbufferuntilyoureachthefirstinstructionfetchedafterthebranchconditionwasdetermined第17页/共104页2023/3/255-18ExampleHerewetakeabrieflookatanotherexampleofspeculationThecode

21、isgivenbelowAssumethereareseparateintegerunitsforeffectiveaddresscalculation,ALUoperations,andbranchconditionevaluationNoticethattherearenoFPoperationshere,soallinstructionsshouldexecutein1cycleWewilllookatthecyclesatwhicheachinstructionissues,executes,andwritestotheCDBwithoutspeculation,andissues,e

22、xecutes,writesandcommitswithspeculationLoop:LDR2,0(R1)DADDIUR2,R2,#1 SDR2,0(R1)DADDIUR1,R1,#4 BNER2,R3,Loop第18页/共104页2023/3/255-19WithoutSpeculationCycle#InstructionIssueExecuteMem AccCDBComments1LD R2,0(R1)1234First issue1DADDIU R2,R2,#1156Wait for LD1SD R2,0(R1)237Wait for add 1DADDIU R1,R1,#4234E

23、xecute directly1BNE R2,R3,Loop37Wait for add2LD R2,0(R1)48910Wait for BNE2DADDIU R2,R2,#141112Wait for LD2SD R2,0(R1)5913Wait for add2DADDIU R1,R1,#4589Wait for 1st BNE2BNE R2,R3,Loop613Wait for add3LD R2,0(R1)7141516Wait for 2nd BNE3DADDIU R2,R2,#171718Wait for LW3SD R2,0(R1)81519Wait for add3DADDI

24、U R1,R1,#481415Wait for 2nd BNE3BNE R2,R3,Loop919Wait for add第19页/共104页2023/3/255-20WithSpeculationCycleInstructionIssueExecMem AccCDBCommitComments1LD R2,0(R1)12345First issue1DADDIU R2,R2,#11567Wait for LD1SD R2,0(R1)237Wait for add 1DADDIU R1,R1,#42348Commit in order1BNE R2,R3,Loop378Wait for add

25、2LD R2,0(R1)45679No delay2DADDIU R2,R2,#148910Wait for LD2SD R2,0(R1)5610Wait for add2DADDIU R1,R1,#456711Commit in order2BNE R2,R3,Loop61011Wait for add3LD R2,0(R1)7891012No delay3DADDIU R2,R2,#17111213Wait for LW3SD R2,0(R1)8913Wait for add3DADDIU R1,R1,#4891014Commit in order3BNE R2,R3,Loop91314W

26、ait for add第20页/共104页2023/3/255-21DesignIssuesReorderbuffervs.moreregistersWecouldforegothereorderbufferbyprovidingadditionaltemporarystorageinessence,thetwoarethesamesolution,justaslightlydifferentimplementationBothrequireagooddealmorememorythanweneededwithanordinarypipeline,butbothimproveperforman

27、cegreatlyHowmuchshouldwespeculate?Otherfactorscauseourmultiple-issuesuperscalartoslowcacheissuesorexceptionsforinstance,soalargeamountofspeculationisdefeatedbyotherhardwarefailings,wemighttrytospeculateoveracoupleofbranches,butnotmoreSpeculatingovermultiplebranchesImagineourloophasaselectionstatemen

28、t,nowwespeculateovertwobranchesspeculationovermorethanonebranchgreatlycomplicatesmattersandmaynotbeworthwhile第21页/共104页2023/3/255-22Limitations/DifficultiesInherentlimitationstomultiple-issuearethelimitedamountofILPofaprogram:Howmanyinstructionsareindependentofeachother?Howmuchdistanceisavailablebet

29、weenloadinganoperandandusingit?betweenusingandsavingit?Coupledwiththemulti-cyclelatencyforcertaintypesofoperationsthatcauseinconsistenciesintheamountofissuingthatcanbesimultaneousDifficultiesinbuildingtheunderlyinghardwareNeedmultiplefunctionunits(costgrowslinearlywiththenumberofunits)Needanincrease

30、(possiblyverylarge)inmemoryandregister-filebandwidthwhichmighttakeupsignificantspaceonthechipandmayrequirelargersystembussizeswhichturnsintomorepinsComplexityofmultiplefetchesmeansamorecomplexmemorysystem,possiblywithindependentbanksforparallelaccesses第22页/共104页2023/3/255-23LimitationsonIssueSizeIde

31、ally,wewouldliketoissueasmanyindependentinstructionssimultaneouslyaspossible,butthisisnotpracticalbecausewewouldhaveto:LookarbitrarilyfaraheadtofindaninstructiontoissueRenameallregisterswhenneededtoavoidWAR/WAWDetermineallregisterandmemorydependencesPredictallbranchesProvideenoughfunctionalunitstoen

32、sureallreadyinstructionscanbeissuedWhatisapossiblemaximumwindowsize?Todetermineregisterdependencesoverninstructionsrequiresn2-ncomparisons2000instructions4,000,000comparisons50instructions2450comparisonsWindowsizeshaverangedbetween4and32withsomerecentmachineshavingsizesof2-8Amachinewithwindowsizeof3

33、2achievesabout1/5oftheidealspeedupformostbenchmarks第23页/共104页2023/3/255-24OtherEffectsWithinfiniteregisters,registerrenamingcaneliminateallWAWandWARhazardsWithTomasulosapproach,thereservationstationsoffervirtualregistersMostmachinestodayhaveonlyafewvirtualregistersandperhaps32Intand32FPregistersavai

34、lableFigure3.41showstheresultingissuespercyclefordifferentnumbersofregistersSurprisingly,thenumberofregistersdoeshaveadramaticimpactandthat32registersaredesirableAsidefromregisterrenaming,wehavenamedependenciesonmemoryreferencesThreemodelsofanalysisare:Global(perfectanalysisofallglobalvars)Stackperf

35、ect(perfectanalysisofallstackreferences)theseoffersomeimprovement,particularlyin2benchmarksInspection(examineaccessesforinterferenceatcompiletime)None(assumeallreferencesconflict)thesehavesimilarresults,between3-6instructions/cycle第24页/共104页2023/3/255-25ExampleProcessorsLetscomparethreehypotheticalp

36、rocessorsanddeterminetheirMIPSratingforthegccbenchmarkProcessor1:simpleMIPS2-issuesuperscalarpipelinewithclockrateof1GHz,CPIof1.0,cachesystemwith.01missesperinstructionProcessor2:deeplypipelinedMIPSwithaclockrateof1.2GHz,CPIof1.2,smallercacheyielding.015missesperinstructionProcessor3:speculativesupe

37、rscalarwith64-entrywindowthatachieves50%ofitsidealissueratewithaclockrateof800MHz,asmallcacheyielding.02missesperinstruction(although10%ofthemisspenaltyisnotvisibleduetodynamicscheduling)Assumememoryaccesstime(misspenalty)is100ns第25页/共104页2023/3/255-26SolutionFirst,determinetheCPI(includingtheimpact

38、ofcachemisses)Processor1:1GHzclock=1nsperclockcyclememoryaccessof100nssomisspenalty=100/1=60cyclescachepenalty=.01*100=1.0cyclesperinstructionoverallCPI=1.0+1.0=2.0Processor2:1.2GHzclock=.83nsperclockcyclemisspenalty=100/.83=120cyclescachepenalty=.015*120=1.8cyclesperinstructionoverallCPI=1.2+1.8=3.

39、0Processor3:800MHzclock=1.25nsperclockcyclemisspenaltytakesaffectonly90%ofthetime,somisspenalty=.90*100/1.25=72cyclescachepenalty=.02*72=1.44overallCPItobecomputednext第26页/共104页2023/3/255-27SolutionContinuedTheCPIofprocessor3requiresabitmoreeffortSincewewerenotgiventheCPI,wehavetocomputeitbyconsider

40、ingthenumberofinstructionissuespercycleWitha64-entrywindow,themaximumnumberofinstructionissuespercycleis9,wearetoldthatthisprocessoraverages50%itsidealrate,sothismachineissues4.5instructionspercyclegivingitaprocessorCPI=1/4.5=.22overallCPI=.22+1.44=1.66NowwecandeterminetheMIPSratingforeachProcessor1

41、:1GHz/2.0=500MIPSProcessor2:1.2GHz/3.0=400MIPSProcessor3:800MHz/1.66=482MIPSThe2-issueprocessor(proc1)isagoodcompromisebetweenspeedofclockandissuerate,andyieldsthebestperformance第27页/共104页2023/3/255-28超超标量流水线处理机流水线处理机典型处理机结构Motorola公司的MC88110微处理器、Intel公司的Pentium微处理器都是典型的超标量流水线设计。前者是RISC机器，后者具有CISC和R

42、ISC两者的特性。下面只介绍Pentium机的超标量流水线.第28页/共104页2023/3/255-29超超标量流水线处理机流水线处理机Pentium能在每个时钟周期执行两条指令。它的某些指令完全是以硬连线实现的，并能在一个时钟周期执行完毕(RISC特征)；另外一些指令是以微指令来实现的，可能需要2-3个时钟周期的执行时间(CISC特征)。因此，Pentium的超标量流水线与RISC处理器超标量流水线相比，既简单又复杂。简单是指它采用的超标量技术简单且直截了当；复杂是指让不定长、不同寻址方式、不同实现方式的指令流经并行度为2的指令流水线是要颇费苦心的。第29页/共104页2023/3/255

43、-30超超标量流水线处理机流水线处理机1Pentium指令流水线的结构Pentium处理器内包含一个浮点部件(FPU)。浮点运算是流水化的，一条浮点运算指令分成8段完成。下面主要介绍整数指令流水线，其结构如图11所示。第30页/共104页2023/3/255-31超超标量流水线处理机流水线处理机从图11中看出，Pentium有两个32位的ALU来完成所有的整数运算和逻辑操作，因而能支持U、V两条指令流水线的并行执行。芯片内部独立设置的指令Cache(I-cache)和数据Cache(D-cache)，其容量各为8KB，是对流水线的有力支持。两个预取缓冲器，每个都是32字节，负责由I-cache

44、或主存取指令，并缓存其中。指令译码器除完成译码指令外，还要完成指令配对检查。如果遇到转移指令，要在译码之后将转移指令地址送至转移目标缓冲器BTB进行查找。控制ROM中存放用于控制指令执行时操作顺序的微指令。以上3个部件被U、V两条流水线共用。第31页/共104页2023/3/255-32超超标量流水线处理机流水线处理机两个地址生成器用于产生(或计算)存储器操作数地址，各种工作模式下的逻辑地址最终要转换成物理地址来访问D-cache，并由转换后援缓冲器TLB来加速这种地址转换过程。D-cache是双端口的，一个时钟周期能存取两个32位数据(或一个64位浮点数)。通用寄存器组有8个32位整数寄存器

45、，用于地址计算、保存ALU的源操作数和目的操作数。两个32位的ALU都具有一个时钟周期的等待时间。只有简单指令而且没有寄存器存储器或存储器寄存器操作的算术逻辑指令才能在一个时钟周期执行完毕。大多数简单指令是以硬连线实现的，执行段只需1个时钟周期。少数涉及寄存器存储器或存储器寄存器操作的算术逻辑指令，它们需2-3个时钟周期才能执行完毕。但由于Pentium具有排序化硬件，允许将这些少数例外也作为简单指令对待。第32页/共104页2023/3/255-33超超标量流水线处理机流水线处理机2流水线的调度策略Pentium通过U、V两条流水线能在每个时钟周期执行两条整数指令。这两条流水线都由5段组成，

46、前两段(PF、D1)是U、V共享的，见图12(a)所示。现说明如下：预取(PF)段由I-cache取指令，指令长度是可变的，存入一个预取缓冲器。译码1(D1)段译码指令确认它的操作码和寻址方式等有关信息。此段要完成指令配对检查和转移指令预测。两条连续的指令I1、I2前后被译码，然后判决是否将这一对指令并行发射出去。发射一对指令必须满足以下4个条件：1.两条指令是简单指令；2.两条指令间不存在WR相关和WW相关，即I1的目标寄存器既不是I2的源寄存器也不是I2的目标寄存器。RW相关则用发射策略予以避免；3.每条指令都不同时含有立即数和偏移量；4.只有I1指令允许带有指令前辍。如果不满足上述条件，

47、只允许I1指令发射到U流水线的下一段。第33页/共104页2023/3/255-34超超标量流水线处理机流水线处理机译码2(D2)段计算并产生存储器操作数的地址。如果TLB命中，只需1个时钟周期，否则不只1个时钟周期。当然不是所有指令都有存储器操作数，但也必须流经这个段。执行(EX)段此段主要是在ALU、桶形移位器或其他功能部件中完成指定的运算。需要时完成D-cache访问。写回(WB)段将运算的结果打入目标寄存器和标志寄存器。U、V两条流水线是不等价的，也不能交换使用。U流水线能执行所有的整数和浮点数指令，而V流水线只能执行简单的整数指令和浮点数交换这样的少数浮点数指令。U、V两条流水线的调

48、度采用按序发射按序完成策略。检查合格的一对指令同时被发射到U、V流水线的D2段，这一对指令也必须同时离开D2段进入EX段。如果一条指令在D2段滞留，另一条指令也必须在D2段停顿，如图12(b)的I1、I2情况所示(时钟4)。一旦成对进入EX段，若能同时执行完最好，否则就使U流水线的指令先执行完。如图12(b)所示的指令I3、I4情况是：I3执行所需时间较长，此时V流水线的I4必须停顿，等待I3执行完(时钟7)。图12(b)所示的指令I5、I6情况是：U流水线中的I5执行所需时间较短，那么它可先执行完毕并进入写回段(时钟9)。第34页/共104页2023/3/255-35超超标量流水线处理机流水

49、线处理机Pentium的超标量流水线在每个时钟周期能执行两条简单的整数指令，但一般只能执行一条浮点数指令。这是因为浮点数指令流水线是8段，而前5段是与U、V流水线的5段共享的，而且某些浮点操作数是64位，因此除少数例外(如浮点数交换指令)，浮点数指令不能与整数指令同时执行。第35页/共104页2023/3/255-36PentiumII:RISCfeaturesAllRISCfeaturesareimplementedontheexecutionofmicroinstructionsinsteadofmachineinstructionsMicroinstruction-levelpipeli

50、newithdynamicallyscheduledmicrooperationsFetchmachineinstruction(3stages)Decodemachineinstructionintomicroinstructions(2stages)Issuemicroinstructions(2stages,registerrenaming,reorderbufferallocationperformedhere)Executeofmicroinstructions(1stage,floatingpointunitspipelined,executiontakesbetween1and3

展开阅读全文