《云计算与云数据.ppt》由会员分享,可在线阅读,更多相关《云计算与云数据.ppt(171页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、云计算与云数据管理云计算与云数据管理陆嘉恒中国人民大学中国人民大学先进数据管理前沿讲习班主要内容主要内容2 云计算概述云计算概述 Google 云计算技术:云计算技术:GFS,Bigtable 和和MapreduceYahoo云计算技术和云计算技术和Hadoop云数据管理的挑战云数据管理的挑战人民大学新开的人民大学新开的分布式系统与分布式系统与云计算云计算课程课程3 分布式系统概述分布式系统概述分布式云计算技术综述分布式云计算技术综述分布式云计算平台分布式云计算平台分布式云计算程序开发分布式云计算程序开发第一篇分布式系统概述第一篇分布式系统概述4第一章:分布式系统入门第一章:分布式系统入门 第
2、二章:客户第二章:客户-服务器端构架服务器端构架 第三章:分布式对象第三章:分布式对象 第四章:公共对象请求代理结构第四章:公共对象请求代理结构(CORBA)第二篇第二篇 云计算综述云计算综述 5第五章:第五章:云计算入门云计算入门 第六章:云服务第六章:云服务 第七章:云相关技术比较第七章:云相关技术比较7.17.1网格计算和云计算网格计算和云计算7.2 Utility7.2 Utility计算(效用计算)和云计算计算(效用计算)和云计算 7.37.3并行和分布计算和云计算并行和分布计算和云计算 7.47.4集群计算和云计算集群计算和云计算 第三篇第三篇 云计算平台云计算平台6第八章:Goo
3、gle云平台的三大技术 第九章:Yahoo云平台的技术 第十章:Aneka 云平台的技术第十一章:Greenplum云平台的技术第十二章:Amazon dynamo云平台的技术 第四篇第四篇 云计算平台开发云计算平台开发7第十三章:基于Hadoop系统开发 第十四章:基于HBase系统开发 第十五章:基于Google Apps系统开发 第十六章:基于MS Azure系统开发 第十七章:基于Amazon EC2系统开发Cloud computingWhy we use cloud computing?Why we use cloud computing?Case 1:Write a fileSa
4、veComputer down,file is lostFiles are always stored in cloud,never lostWhy we use cloud computing?Case 2:Use IE-download,install,useUse QQ-download,install,useUse C+-download,install,useGet the serve from the cloudWhat is cloud and cloud computing?CloudDemand resources or services over Internetscale
5、 and reliability of a data center.What is cloud and cloud computing?Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet.Users need not have knowledge of,expertise in,or control over the technology infrastruct
6、ure in the cloud that supports them.Characteristics of cloud computinglVirtual.software,databases,Web servers,operating systems,storage and networking as virtual servers.lOn demand.add and subtract processors,memory,network bandwidth,storage.IaaSInfrastructure as a ServicePaaSPlatform as a ServiceSa
7、aSSoftware as a ServiceTypes of cloud serviceSoftware delivery modellNo hardware or software to managelService delivered through a browserlCustomers use the service on demandlInstant ScalabilitySaaSExampleslYour current CRM package is not managing the load or you simply dont want to host it in-house
8、.Use a SaaS provider such as SlYour email is hosted on an exchange server in your office and it is very slow.Outsource this using Hosted Exchange.SaaSPlatform delivery modellPlatforms are built upon Infrastructure,which is expensivelEstimating demand is not a science!lPlatform management is not fun!
9、PaaSExampleslYou need to host a large file(5Mb)on your website and make it available for 35,000 users for only two months duration.Use Cloud Front from Amazon.lYou want to start storage services on your network for a large number of files and you do not have the storage capacityuse Amazon S3.PaaSCom
10、puter infrastructure delivery modellA platform virtualization environmentlComputing resources,such as storing and processing capacity.lVirtualization taken a step furtherIaaSExampleslYou want to run a batch job but you dont have the infrastructure necessary to run it in a timely manner.Use Amazon EC
11、2.lYou want to host a website,but only for a few days.Use Flexiscale.IaaSCloud computing and other computing techniquesThe 21st Century Vision Of ComputingLeonard Kleinrock,one of the chief scientists of the original Advanced Research Projects Agency Network(ARPANET)project which seeded the Internet
12、,said:“As of now,computer networks are still in theirinfancy,but as they grow up and become sophisticated,we will probably see the spread of computer utilities which,like present electric and telephone utilities,will service individual homes and offices across the country.”The 21st Century Vision Of
13、 ComputingSun Microsystemsco-founder Bill Joy He also indicated“It would take time until these markets to mature to generate this kind of value.Predicting now which companies will capture the value is impossible.Many of them have not even been created yet.”The 21st Century Vision Of ComputingDefinit
14、ionsCloudGridClusterutilityDefinitionsCloudGridClusterutilityUtility computing is the packaging of computing resources,such as computation and storage,as a metered service similar to a traditional public utilityDefinitionsCloudGridClusterutilityA computer cluster is a group of linked computers,worki
15、ng together closely so that in many respects they form a single computer.DefinitionsCloudGridClusterutilityGrid computing is the application of several computers to a single problem at the same time usually to a scientific or technical problem that requires a great number of computer processing cycl
16、es or access to large amounts of data DefinitionsCloudGridClusterutilityCloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.Grid Computing&Cloud Computinglshare a lot commonality intention,architecture and
17、technology lDifference programming model,business model,compute model,applications,and Virtualization.Grid Computing&Cloud Computinglthe problems are mostly the samelmanage large facilities;ldefine methods by which consumers discover,request and use resources provided by the central facilities;limpl
18、ement the often highly parallel computations that execute on those resources.Grid Computing&Cloud ComputinglVirtualizationlGridldo not rely on virtualization as much as Clouds do,each individual organization maintain full control of their resources lCloudlan indispensable ingredient for almost every
19、 Cloud2022/11/1836Any question and any comments?主要内容主要内容37 云计算概述云计算概述 Google 云计算技术:云计算技术:GFS,Bigtable 和和MapreduceYahoo云计算技术和云计算技术和Hadoop云数据管理的挑战云数据管理的挑战Google Cloud computing techniquesThe Google File System The Google File System(GFS)A scalable distributed file system for large distributed data int
20、ensive applicationsMultiple GFS clusters are currently deployed.The largest ones have:1000+storage nodes300+TeraBytes of disk storageheavily accessed by hundreds of clients on distinct machinesIntroductionShares many same goals as previous distributed file systemsperformance,scalability,reliability,
21、etcGFS design has been driven by four key observation of Google application workloads and technological environmentIntro:Observations 1l1.Component failures are the normconstant monitoring,error detection,fault tolerance and automatic recovery are integral to the systeml2.Huge files(by traditional s
22、tandards)Multi GB files are commonI/O operations and blocks sizes must be revisitedIntro:Observations 2l3.Most files are mutated by appending new dataThis is the focus of performance optimization and atomicity guaranteesl4.Co-designing the applications and APIs benefits overall system by increasing
23、flexibilityThe DesignCluster consists of a single master and multiple chunkservers and is accessed by multiple clientsThe MasterMaintains all file system metadata.names space,access control info,file to chunk mappings,chunk(including replicas)location,etc.Periodically communicates with chunkservers
24、in HeartBeat messages to give instructions and check stateThe MasterHelps make sophisticated chunk placement and replication decision,using global knowledgeFor reading and writing,client contacts Master to get chunk locations,then deals directly with chunkserversMaster is not a bottleneck for reads/
25、writesChunkserversFiles are broken into chunks.Each chunk has a immutable globally unique 64-bit chunk-handle.handle is assigned by the master at chunk creationChunk size is 64 MBEach chunk is replicated on 3(default)serversClientsLinked to apps using the file system API.Communicates with master and
26、 chunkservers for reading and writingMaster interactions only for metadataChunkserver interactions for dataOnly caches metadata informationData is too large to cache.Chunk LocationsMaster does not keep a persistent record of locations of chunks and replicas.Polls chunkservers at startup,and when new
27、 chunkservers join/leave for this.Stays up to date by controlling placement of new chunks and through HeartBeat messages(when monitoring chunkservers)Operation LogRecord of all critical metadata changesStored on Master and replicated on other machinesDefines order of concurrent operationsAlso used t
28、o recover the file system stateSystem Interactions:Leases and Mutation OrderLeases maintain a mutation order across all chunk replicasMaster grants a lease to a replica,called the primaryThe primary choses the serial mutation order,and all replicas follow this orderMinimizes management overhead for
29、the MasterAtomic Record AppendClient specifies the data to write;GFS chooses and returns the offset it writes to and appends the data to each replica at least onceHeavily used by Googles Distributed applications.No need for a distributed lock managerGFS choses the offset,not the clientAtomic Record
30、Append:How?Follows similar control flow as mutationsPrimary tells secondary replicas to append at the same offset as the primaryIf a replica append fails at any replica,it is retried by the client.So replicas of the same chunk may contain different data,including duplicates,whole or in part,of the s
31、ame recordAtomic Record Append:How?GFS does not guarantee that all replicas are bitwise identical.Only guarantees that data is written at least once in an atomic unit.Data must be written at the same offset for all chunk replicas for success to be reported.Detecting Stale ReplicasMaster has a chunk
32、version number to distinguish up to date and stale replicasIncrease version when granting a leaseIf a replica is not available,its version is not increasedmaster detects stale replicas when a chunkservers report chunks and versionsRemove stale replicas during garbage collectionGarbage collectionWhen
33、 a client deletes a file,master logs it like other changes and changes filename to a hidden file.Master removes files hidden for longer than 3 days when scanning file system name spacemetadata is also erasedDuring HeartBeat messages,the chunkservers send the master a subset of its chunks,and the mas
34、ter tells it which files have no metadata.Chunkserver removes these files on its ownFault Tolerance:High AvailabilityFast recoveryMaster and chunkservers can restart in secondsChunk ReplicationMaster Replication“shadow”masters provide read-only access when primary master is downmutations not done un
35、til recorded on all master replicasFault Tolerance:Data IntegrityChunkservers use checksums to detect corrupt dataSince replicas are not bitwise identical,chunkservers maintain their own checksumsFor reads,chunkserver verifies checksum before sending chunkUpdate checksums during writesIntroduction t
36、o MapReduce MapReduce:Insight l”Consider the problem of counting the number of occurrences of each word in a large collection of documents”lHow would you do it in parallel?MapReduce Programming Model lInspired from map and reduce operations commonly used in functional programming languages like Lisp
37、.lUsers implement interface of two primary methods:l1.Map:(key1,val1)(key2,val2)l2.Reduce:(key2,val2)val3 Map operation lMap,a pure function,written by the user,takes an input key/value pair and produces a set of intermediate key/value pairs.le.g.(docid,doc-content)lDraw an analogy to SQL,map can be
38、 visualized as group-by clause of an aggregate query.Reduce operation lOn completion of map phase,all the intermediate values for a given output key are combined together into a list and given to a reducer.lCan be visualized as aggregate function(e.g.,average)that is computed over all the rows with
39、the same group-by attribute.Pseudo-codemap(String input_key,String input_value):/input_key:document name/input_value:document contents for each word w in input_value:EmitIntermediate(w,1);reduce(String output_key,Iterator intermediate_values):/output_key:a word/output_values:a list of counts int res
40、ult=0;for each v in intermediate_values:result+=ParseInt(v);Emit(AsString(result);MapReduce:Execution overview MapReduce:Example MapReduce in Parallel:Example MapReduce:Fault TolerancelHandled via re-execution of tasks.Task completion committed through master lWhat happens if Mapper fails?lRe-execut
41、e completed+in-progress map taskslWhat happens if Reducer fails?lRe-execute in progress reduce taskslWhat happens if Master fails?lPotential trouble!MapReduce:Walk through of One more ApplicationMapReduce:PageRankPageRank models the behavior of a“random surfer”.C(t)is the out-degree of t,and(1-d)is
42、a damping factor(random jump)The“random surfer”keeps clicking on successive links at random not taking content into consideration.Distributes its pages rank equally among all pages it links to.The dampening factor takes the surfer“getting bored”and typing arbitrary URL.PageRank:Key Insights Effects
43、at each iteration is local.i+1th iteration depends only on ith iterationAt iteration i,PageRank for individual nodes can be computed independently PageRank using MapReduce lUse Sparse matrix representation(M)lMap each row of M to a list of PageRank“credit”to assign to out link neighbours.lThese pres
44、tige scores are reduced to a single PageRank value for a page by aggregating over them.PageRank using MapReduceMap:distribute PageRank“credit”to link targetsReduce:gather up PageRank“credit”from multiple sources to compute new PageRank valueIterate untilconvergenceSource of Image:Lin 2008Phase 1:Pro
45、cess HTML lMap task takes(URL,page-content)pairs and maps them to(URL,(PRinit,list-of-urls)lPRinit is the“seed”PageRank for URLllist-of-urls contains all pages pointed to by URLlReduce task is just the identity functionPhase 2:PageRank Distribution lReduce task gets(URL,url_list)and many(URL,val)val
46、ueslSum vals and fix up with d to get new PRlEmit(URL,(new_rank,url_list)lCheck for convergence using non parallel componentMapReduce:Some More AppslDistributed Grep.lCount of URL Access Frequency.lClustering(K-means)lGraph Algorithms.lIndexing SystemsMapReduce Programs In Google Source Tree MapRedu
47、ce:Extensions and similar apps lPIG(Yahoo)lHadoop(Apache)lDryadLinq(Microsoft)Large Scale Systems Architecture using MapReduceBigTable:A Distributed Storage System for Structured DataIntroductionlBigTable is a distributed storage system for managing structured data.lDesigned to scale to a very large
48、 sizelPetabytes of data across thousands of serverslUsed for many Google projectslWeb indexing,Personalized Search,Google Earth,Google Analytics,Google Finance,lFlexible,high-performance solution for all of Googles productsMotivationlLots of(semi-)structured data at GooglelURLs:lContents,crawl metad
49、ata,links,anchors,pagerank,lPer-user data:lUser preference settings,recent queries/search results,lGeographic locations:lPhysical entities(shops,restaurants,etc.),roads,satellite image data,user annotations,lScale is largelBillions of URLs,many versions/page(20K/version)lHundreds of millions of user
50、s,thousands or q/secl100TB+of satellite image dataWhy not just use commercial DB?lScale is too large for most commercial databaseslEven if it werent,cost would be very highlBuilding internally means system can be applied across many projects for low incremental costlLow-level storage optimizations h