《《并行程序设计导论》 第二章PPT.ppt》由会员分享,可在线阅读,更多相关《《并行程序设计导论》 第二章PPT.ppt(143页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、1Copyright 2010,Elsevier Inc.All rights ReservedChapter 2Parallel Hardware and Parallel SoftwareAn Introduction to Parallel ProgrammingPeter Pacheco2Copyright 2010,Elsevier Inc.All rights ReservedRoadmapnSome backgroundnModifications to the von Neumann modelnParallel hardwarenParallel softwarenInput
2、 and outputnPerformancenParallel program designnWriting and running parallel programsnAssumptions#Chapter Subtitle3SOME BACKGROUNDCopyright 2010,Elsevier Inc.All rights Reserved4Serial hardware and softwareCopyright 2010,Elsevier Inc.All rights ReservedinputoutputprogramsComputer runs oneprogram at
3、a time.5Copyright 2010,Elsevier Inc.All rights ReservedThe von Neumann Architecture#Chapter SubtitleFigure 2.16Main memorynThis is a collection of locations,each of which is capable of storing both instructions and data.nEvery location consists of an address,which is used to access the location,and
4、the contents of the location.Copyright 2010,Elsevier Inc.All rights Reserved7Central processing unit(CPU)nDivided into two parts.nControl unit-responsible for deciding which instruction in a program should be executed.(the boss)nArithmetic and logic unit(ALU)-responsible for executing the actual ins
5、tructions.(the worker)Copyright 2010,Elsevier Inc.All rights Reservedadd 2+28Key termsnRegister very fast storage,part of the CPU.nProgram counter stores address of the next instruction to be executed.nBus wires and hardware that connects the CPU and memory.Copyright 2010,Elsevier Inc.All rights Res
6、erved9Copyright 2010,Elsevier Inc.All rights ReservedmemoryCPUfetch/read10Copyright 2010,Elsevier Inc.All rights ReservedmemoryCPUwrite/store11von Neumann bottleneckCopyright 2010,Elsevier Inc.All rights Reserved12An operating system“process”nAn instance of a computer program that is being executed.
7、nComponents of a process:nThe executable machine language program.nA block of memory.nDescriptors of resources the OS has allocated to the process.nSecurity information.nInformation about the state of the process.Copyright 2010,Elsevier Inc.All rights Reserved13MultitaskingnGives the illusion that a
8、 single processor system is running multiple programs simultaneously.nEach process takes turns running.(time slice)nAfter its time is up,it waits until it has a turn again.(blocks)Copyright 2010,Elsevier Inc.All rights Reserved14Threading nThreads are contained within processes.nThey allow programme
9、rs to divide their programs into(more or less)independent tasks.nThe hope is that when one thread blocks because it is waiting on a resource,another will have work to do and can run.Copyright 2010,Elsevier Inc.All rights Reserved15A process and two threadsCopyright 2010,Elsevier Inc.All rights Reser
10、vedFigure 2.2the“master”threadstarting a threadIs called forkingterminating a threadIs called joining16MODIFICATIONS TO THE VON NEUMANN MODELCopyright 2010,Elsevier Inc.All rights Reserved17Basics of cachingnA collection of memory locations that can be accessed in less time than some other memory lo
11、cations.nA CPU cache is typically located on the same chip,or one that can be accessed much faster than ordinary memory.Copyright 2010,Elsevier Inc.All rights Reserved18Principle of localitynAccessing one location is followed by an access of a nearby location.nSpatial locality accessing a nearby loc
12、ation.nTemporal locality accessing in the near future.Copyright 2010,Elsevier Inc.All rights Reserved19Principle of localityCopyright 2010,Elsevier Inc.All rights Reservedfloat z1000;sum=0.0;for(i=0;i 1000;i+)sum+=zi;20Levels of CacheCopyright 2010,Elsevier Inc.All rights ReservedL1L2L3smallest&fast
13、estlargest&slowest21Cache hitCopyright 2010,Elsevier Inc.All rights ReservedL1L2L3x sumy z totalA radius r1 centerfetch x22Cache missCopyright 2010,Elsevier Inc.All rights ReservedL1L2L3y sumr1 z totalA radius centerfetch xxmainmemory23Issues with cachenWhen a CPU writes data to cache,the value in c
14、ache may be inconsistent with the value in main memory.nWrite-through caches handle this by updating the data in main memory at the time it is written to cache.nWrite-back caches mark data in the cache as dirty.When the cache line is replaced by a new cache line from memory,the dirty line is written
15、 to memory.Copyright 2010,Elsevier Inc.All rights Reserved24Cache mappingsnFull associative a new line can be placed at any location in the cache.nDirect mapped each cache line has a unique location in the cache to which it will be assigned.nn-way set associative each cache line can be place in one
16、of n different locations in the cache.Copyright 2010,Elsevier Inc.All rights Reserved25n-way set associativenWhen more than one line in memory can be mapped to several different locations in cache we also need to be able to decide which line should be replaced or evicted.Copyright 2010,Elsevier Inc.
17、All rights Reservedx26Example Copyright 2010,Elsevier Inc.All rights ReservedTable 2.1:Assignments of a 16-line main memory to a 4-line cache27Caches and programsCopyright 2010,Elsevier Inc.All rights Reserved28Virtual memory(1)nIf we run a very large program or a program that accesses very large da
18、ta sets,all of the instructions and data may not fit into main memory.nVirtual memory functions as a cache for secondary storage.Copyright 2010,Elsevier Inc.All rights Reserved29Virtual memory(2)nIt exploits the principle of spatial and temporal locality.nIt only keeps the active parts of running pr
19、ograms in main memory.Copyright 2010,Elsevier Inc.All rights Reserved30Virtual memory(3)nSwap space-those parts that are idle are kept in a block of secondary storage.nPages blocks of data and instructions.nUsually these are relatively large.nMost systems have a fixed page size that currently ranges
20、 from 4 to 16 kilobytes.Copyright 2010,Elsevier Inc.All rights Reserved31Virtual memory(4)Copyright 2010,Elsevier Inc.All rights Reservedprogram Aprogram Bprogram Cmain memory32Virtual page numbersnWhen a program is compiled its pages are assigned virtual page numbers.nWhen the program is run,a tabl
21、e is created that maps the virtual page numbers to physical addresses.nA page table is used to translate the virtual address into a physical address.Copyright 2010,Elsevier Inc.All rights Reserved33Page tableCopyright 2010,Elsevier Inc.All rights ReservedTable 2.2:Virtual Address Divided into Virtua
22、l Page Number and Byte Offset34Translation-lookaside buffer(TLB)nUsing a page table has the potential to significantly increase each programs overall run-time.nA special address translation cache in the processor.Copyright 2010,Elsevier Inc.All rights Reserved35Translation-lookaside buffer(2)nIt cac
23、hes a small number of entries(typically 16512)from the page table in very fast memory.nPage fault attempting to access a valid physical address for a page in the page table but the page is only stored on disk.Copyright 2010,Elsevier Inc.All rights Reserved36Instruction Level Parallelism(ILP)nAttempt
24、s to improve processor performance by having multiple processor components or functional units simultaneously executing instructions.Copyright 2010,Elsevier Inc.All rights Reserved37Instruction Level Parallelism(2)nPipelining-functional units are arranged in stages.nMultiple issue-multiple instructi
25、ons can be simultaneously initiated.Copyright 2010,Elsevier Inc.All rights Reserved38PipeliningCopyright 2010,Elsevier Inc.All rights Reserved39Pipelining example(1)Copyright 2010,Elsevier Inc.All rights ReservedAdd the floating point numbers 9.87104 and 6.5410340Pipelining example(2)nAssume each op
26、eration takes one nanosecond(10-9 seconds).nThis for loop takes about 7000 nanoseconds.Copyright 2010,Elsevier Inc.All rights Reserved41Pipelining(3)nDivide the floating point adder into 7 separate pieces of hardware or functional units.nFirst unit fetches two operands,second unit compares exponents
27、,etc.nOutput of one functional unit is input to the next.Copyright 2010,Elsevier Inc.All rights Reserved42Pipelining(4)Copyright 2010,Elsevier Inc.All rights ReservedTable 2.3:Pipelined Addition.Numbers in the table are subscripts of operands/results.43Pipelining(5)nOne floating point addition still
28、 takes 7 nanoseconds.nBut 1000 floating point additions now takes 1006 nanoseconds!Copyright 2010,Elsevier Inc.All rights Reserved44Multiple Issue(1)nMultiple issue processors replicate functional units and try to simultaneously execute different instructions in a program.Copyright 2010,Elsevier Inc
29、.All rights Reservedadder#1adder#2z1z3z2z4for(i=0;i 0)w=x;e l s e w=y;Z will be positiveIf the system speculates incorrectly,it must go back and recalculate w=y.48Hardware multithreading(1)nThere arent always good opportunities for simultaneous execution of different threads.nHardware multithreading
30、 provides a means for systems to continue doing useful work when the task being currently executed has stalled.nEx.,the current task has to wait for data to be loaded from memory.Copyright 2010,Elsevier Inc.All rights Reserved49Hardware multithreading(2)nFine-grained-the processor switches between t
31、hreads after each instruction,skipping threads that are stalled.nPros:potential to avoid wasted machine time due to stalls.nCons:a thread thats ready to execute a long sequence of instructions may have to wait to execute every instruction.Copyright 2010,Elsevier Inc.All rights Reserved50Hardware mul
32、tithreading(3)nCoarse-grained-only switches threads that are stalled waiting for a time-consuming operation to complete.nPros:switching threads doesnt need to be nearly instantaneous.nCons:the processor can be idled on shorter stalls,and thread switching will also cause delays.Copyright 2010,Elsevie
33、r Inc.All rights Reserved51Hardware multithreading(3)nSimultaneous multithreading(SMT)-a variation on fine-grained multithreading.nAllows multiple threads to make use of the multiple functional units.Copyright 2010,Elsevier Inc.All rights Reserved52PARALLEL HARDWAREA programmer can write code to exp
34、loit.Copyright 2010,Elsevier Inc.All rights Reserved53Flynns TaxonomyCopyright 2010,Elsevier Inc.All rights ReservedSISDSingle instruction streamSingle data stream(SIMD)Single instruction streamMultiple data streamMISDMultiple instruction streamSingle data stream(MIMD)Multiple instruction streamMult
35、iple data streamclassic von Neumannnot covered54SIMDnParallelism achieved by dividing data among the processors.nApplies the same instruction to multiple data items.nCalled data parallelism.Copyright 2010,Elsevier Inc.All rights Reserved55SIMD exampleCopyright 2010,Elsevier Inc.All rights Reservedco
36、ntrol unitALU1ALU2ALUnfor(i=0;i n;i+)xi+=yi;x1x2xnn data itemsn ALUs56SIMDnWhat if we dont have as many ALUs as data items?nDivide the work and process iteratively.nEx.m=4 ALUs and n=15 data items.Copyright 2010,Elsevier Inc.All rights ReservedRound3ALU1ALU2ALU3ALU41X0X1X2X32X4X5X6X73X8X9X10X114X12X
37、13X1457SIMD drawbacksnAll ALUs are required to execute the same instruction,or remain idle.nIn classic design,they must also operate synchronously.nThe ALUs have no instruction storage.nEfficient for large data parallel problems,but not other types of more complex parallel problems.Copyright 2010,El
38、sevier Inc.All rights Reserved58Vector processors(1)nOperate on arrays or vectors of data while conventional CPUs operate on individual data elements or scalars.nVector registers.nCapable of storing a vector of operands and operating simultaneously on their contents.Copyright 2010,Elsevier Inc.All r
39、ights Reserved59Vector processors(2)nVectorized and pipelined functional units.nThe same operation is applied to each element in the vector(or pairs of elements).nVector instructions.nOperate on vectors rather than scalars.Copyright 2010,Elsevier Inc.All rights Reserved60Vector processors(3)nInterle
40、aved memory.nMultiple“banks”of memory,which can be accessed more or less independently.nDistribute elements of a vector across multiple banks,so reduce or eliminate delay in loading/storing successive elements.nStrided memory access and hardware scatter/gather.nThe program accesses elements of a vec
41、tor located at fixed intervals.Copyright 2010,Elsevier Inc.All rights Reserved61Vector processors-ProsnFast.nEasy to use.nVectorizing compilers are good at identifying code to exploit.nCompilers also can provide information about code that cannot be vectorized.nHelps the programmer re-evaluate code.
42、nHigh memory bandwidth.nUses every item in a cache line.Copyright 2010,Elsevier Inc.All rights Reserved62Vector processors-ConsnThey dont handle irregular data structures as well as other parallel architectures.nA very finite limit to their ability to handle ever larger problems.(scalability)Copyrig
43、ht 2010,Elsevier Inc.All rights Reserved63Graphics Processing Units(GPU)nReal time graphics application programming interfaces or APIs use points,lines,and triangles to internally represent the surface of an object.Copyright 2010,Elsevier Inc.All rights Reserved64GPUsnA graphics processing pipeline
44、converts the internal representation into an array of pixels that can be sent to a computer screen.nSeveral stages of this pipeline(called shader functions)are programmable.nTypically just a few lines of C code.Copyright 2010,Elsevier Inc.All rights Reserved65GPUsnShader functions are also implicitl
45、y parallel,since they can be applied to multiple elements in the graphics stream.nGPUs can often optimize performance by using SIMD parallelism.nThe current generation of GPUs use SIMD parallelism.nAlthough they are not pure SIMD systems.Copyright 2010,Elsevier Inc.All rights Reserved66MIMDnSupports
46、 multiple simultaneous instruction streams operating on multiple data streams.nTypically consist of a collection of fully independent processing units or cores,each of which has its own control unit and its own ALU.Copyright 2010,Elsevier Inc.All rights Reserved67Shared Memory System(1)nA collection
47、 of autonomous processors is connected to a memory system via an interconnection network.nEach processor can access each memory location.nThe processors usually communicate implicitly by accessing shared data structures.Copyright 2010,Elsevier Inc.All rights Reserved68Shared Memory System(2)nMost wi
48、dely available shared memory systems use one or more multicore processors.n(multiple CPUs or cores on a single chip)Copyright 2010,Elsevier Inc.All rights Reserved69Shared Memory SystemCopyright 2010,Elsevier Inc.All rights ReservedFigure 2.370UMA multicore systemCopyright 2010,Elsevier Inc.All righ
49、ts ReservedFigure 2.5Time to access allthe memory locationswill be the same forall the cores.71NUMA multicore systemCopyright 2010,Elsevier Inc.All rights ReservedFigure 2.6A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through a
50、nother chip.72Distributed Memory SystemnClusters(most popular)nA collection of commodity systems.nConnected by a commodity interconnection network.nNodes of a cluster are individual computations units joined by a communication network.Copyright 2010,Elsevier Inc.All rights Reserveda.k.a.hybrid syste