《Google云计算技术MapReduce国外课件.ppt》由会员分享,可在线阅读,更多相关《Google云计算技术MapReduce国外课件.ppt(48页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、MapReduce:Simplified Data Processing on Large ClustersJeffrey Dean&Sanjay GhemawatOSDI04“The density of transistors on a chip doubles every 18 months,for the same cost”(1965)The Free Lunch Is Almost Over!The Future is Multi-core!Web graphic Super ComputerJanet E.Ward,2000 Cluster of DesktopsThe Futu
2、re is Multi-core!Replace specialized powerful Super-Computers with large clusters of commodity hardwareBut Distributed programming is inherently complex.Googles MapReduce ParadigmPlatform for reliable,scalable parallel computingAbstracts issues of distributed and parallel environment from programmer
3、.Runs over Google File SystemsWhat is MapReduce?A programming model and an associated implementation(library)for processing and generating large data sets(on large clusters).A new abstraction allowing us to express the simple computations we were trying to perform but hides the messy details of para
4、llelization,fault-tolerance,data distribution and load balancing in a library.ReferencesJeffrey Dean,Sanjay Ghemawat:MapReduce:Simplified Data Processing on Large Clusters.OSDI 2004:137-150Also:Interpreting the Data:Parallel Analysis with Sawzall.Rob Pike,Sean Dorward,Robert Griesemer,Sean Quinlan.G
5、oogle Labs.Google File Systems(GFS)Highly scalable distributed file system for large data-intensive applications.Provides redundant storage of massive amounts of data on cheap and unreliable computersProvides a platform over which other systems like MapReduce,BigTable operate.GFS ArchitectureMapRedu
6、ce:Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents”How would you do it in parallel?One possible solutionMapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming languages like Lisp.
7、Users implement interface of two primary methods:1.Map:(key1,val1)(key2,val2)2.Reduce:(key2,val2)val3Many real world tasks are expressible in this model.Assumption:data has no correlation,or it is small.Big pictureMap operation Map,a pure function,written by the user,takes an input key/value pair an
8、d produces a set of intermediate key/value pairs.e.g.(docid,doc-content)Draw an analogy to SQL,map can be visualized as group-by clause of an aggregate query.Reduce operation On completion of map phase,all the intermediate values for a given output key are combined together into a list and given to
9、a reducer.Can be visualized as aggregate function(e.g.,average)that is computed over all the rows with the same group-by attribute.ExampleThe problem of counting the number of occurrences of each word in a large collection of documents.Pseudo-codemap(String input_key,String input_value):/input_key:d
10、ocument name/input_value:document contents for each word w in input_value:EmitIntermediate(w,1);reduce(String output_key,Iterator intermediate_values):/output_key:a word/output_values:a list of counts int result=0;for each v in intermediate_values:result+=ParseInt(v);Emit(AsString(result);More Examp
11、lesDistributed grep:Map:(key,whole doc/a line)(the matched line,key)Reduce:identity functionMore Examples Count of URL Access Frequency:Map:logs of web page requests (URL,1)Reduce:(URL,total count)More ExamplesReverse Web-Link Graph:Map:(source,target)(target,source)Reduce:(target,list(source)(targe
12、t,list(source)MapReduce:Execution overview ArchitectureMaster Data StructureTask state:idle,in-progress,completedIdentity of worker machine:for in-progress tasksLocation of intermediate file regions of map tasks.Receive from map tasksPush to reduce tasks.Execution overviewSplit input files(1)Master
13、and workers(2)Map task workers(3)Buffering of results(4)Copying and sorting(5)Reduce workers(6)Return to user code(7)MapReduce:Execution overviewMapReduce:Example MapReduce in Parallel:Example MapReduce:Runtime EnvironmentFault ManagementFault Tolerance in a word:redoMaster pings workers,re-schedule
14、s failed tasks.Note:Completed map tasks are re-executed on failure because their output is stored on the local disk.Master failure:redoSemantics in the presence of failures:Deterministic map/reduce function:Produce the same output as would have been produced by a non-faulting sequential execution of
15、 the entire programRely on atomic commits of map and reduce task outputs to achieve this property.MapReduce:Fault ToleranceHandled via re-execution of tasks.Task completion committed through master What happens if Mapper fails?Re-execute completed+in-progress map tasksWhat happens if Reducer fails?R
16、e-execute in progress reduce tasksWhat happens if Master fails?Potential trouble!MapReduce:Refinements Locality Optimization Leverage GFS to schedule a map task on a machine that contains a replica of the corresponding input data.Thousands of machines read input at local disk speedWithout this,rack
17、switches limit read rateMapReduce:Refinements Redundant Execution Slow workers are source of bottleneck,may delay completion time.Near end of phase,spawn backup tasks,one to finish first wins.Effectively utilizes computing power,reducing job completion time by a factor.MapReduce:Refinements Skipping
18、 Bad Records Map/Reduce functions sometimes fail for particular inputs.Fixing the Bug might not be possible:Third Party Libraries.On ErrorWorker sends signal to MasterIf multiple error on same record,skip recordMapReduce:Refinements Miscellaneous Combiner Function at MapperSorting Guarantees within
19、each reduce partition.Local execution for debugging/testing User-defined countersMapReduce:Walk through of One more ApplicationMapReduce:PageRankPageRank models the behavior of a“random surfer”.C(t)is the out-degree of t,and(1-d)is a damping factor(random jump)The“random surfer”keeps clicking on suc
20、cessive links at random not taking content into consideration.Distributes its pages rank equally among all pages it links to.The dampening factor takes the surfer“getting bored”and typing arbitrary URL.Computing PageRankPageRank:Key Insights Effect at each iteration is local.i+1th iteration depends
21、only on ith iterationAt iteration i,PageRank for individual nodes can be computed independently PageRank using MapReduce Use Sparse matrix representation(M)Map each row of M to a list of PageRank“credit”to assign to out link neighbours.These prestige scores are reduced to a single PageRank value for
22、 a page by aggregating over them.PageRank using MapReduceMap:distribute PageRank“credit”to link targetsReduce:gather up PageRank“credit”from multiple sources to compute new PageRank valueIterate untilconvergenceSource of Image:Lin 2008Phase 1:Process HTML Map task takes(URL,page-content)pairs and ma
23、ps them to(URL,(PRinit,list-of-urls)PRinit is the“seed”PageRank for URLlist-of-urls contains all pages pointed to by URLReduce task is just the identity functionPhase 2:PageRank Distribution Reduce task gets(URL,url_list)and many(URL,val)valuesSum vals and fix up with d to get new PREmit(URL,(new_ra
24、nk,url_list)Check for convergence using non parallel componentMapReduce:Some More AppsDistributed Grep.Count of URL Access Frequency.Clustering(K-means)Graph Algorithms.Indexing SystemsMapReduce Programs In Google Source Tree MapReduce Jobs run in Aug,2004MapReduce:Extensions and similar apps PIG(Ya
25、hoo)Hadoop(Apache)DryadLinq(Microsoft)Large Scale Systems Architecture using MapReduceTake Home Messages Although restrictive,provides good fit for many problems encountered in the practice of processing large data sets.Functional Programming Paradigm can be applied to large scale computation.Easy to use,hides messy details of parallelization,fault-tolerance,data distribution and load balancing from the programmers.And finally,if it works for Google,it should be handy!