清华云计算课件分布式集群.ppt

上传人:赵** 文档编号:82668094 上传时间:2023-03-26 格式:PPT 页数:54 大小:420KB
返回 下载 相关 举报
清华云计算课件分布式集群.ppt_第1页
第1页 / 共54页
清华云计算课件分布式集群.ppt_第2页
第2页 / 共54页
点击查看更多>>
资源描述

《清华云计算课件分布式集群.ppt》由会员分享,可在线阅读,更多相关《清华云计算课件分布式集群.ppt(54页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、Distributed Computing SeminarLecture 4:Clustering an Overview and Sample MapReduce ImplementationChristophe Bisciglia,Aaron Kimball,&Sierra Michels-SlettvetGoogle,Inc.Summer 2007Except as otherwise noted,the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

2、OutlineClusteringIntuitionClustering Algorithms The Distance MeasureHierarchical vs.PartitionalK-Means ClusteringComplexityCanopy ClusteringMapReducing a large data set with K-Means and Canopy ClusteringClusteringWhat is clustering?Google NewsThey didnt pick all 3,400,217 related articles by handOr

3、A Or NetflixOther less glamorous things.Hospital RecordsScientific ImagingRelated genes,related stars,related sequencesMarket ResearchSegmenting markets,product positioningSocial Network AnalysisData miningImage segmentationThe Distance MeasureHow the similarity of two elements in a set is determine

4、d,e.g.Euclidean DistanceManhattan DistanceInner Product SpaceMaximum Norm Or any metric you define over the spaceHierarchical Clustering vs.Partitional ClusteringTypes of AlgorithmsHierarchical ClusteringBuilds or breaks up a hierarchy of clusters.Partitional ClusteringPartitions set into all cluste

5、rs simultaneously.Partitional Clustering Partitions set into all clusters simultaneously.K-Means Clustering Super simple Partitional ClusteringChoose the number of clusters,kChoose k points to be cluster centersThenK-Means Clusteringiterate Compute distance from all points to all k-centers Assign ea

6、ch point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averagesBut!The complexity is pretty high:k*n*O(distance metric)*num(iterations)Moreover,it can be necessary to send tons of data to each Mapper Node.Depending on

7、your bandwidth and memory available,this could be impossible.FurthermoreThere are three big ways a data set can be large:There are a large number of elements in the set.Each element can have many features.There can be many clusters to discoverConclusion Clustering can be huge,even when you distribut

8、e it.Canopy ClusteringPreliminary step to help parallelize computation.Clusters data into overlapping Canopies using super cheap distance metric.EfficientAccurateCanopy ClusteringWhile there are unmarked points pick a point which is not strongly marked call it a canopy centermark all points within s

9、ome threshold of it as in its canopystrongly mark all points within some stronger threshold After the canopy clusteringResume hierarchical or partitional clustering as usual.Treat objects in separate clusters as being at infinite distances.MapReduce Implementation:Problem Efficiently partition a lar

10、ge data set(say movies with user ratings!)into a fixed number of clusters using Canopy Clustering,K-Means Clustering,and a Euclidean distance measure.The Distance MetricThe Canopy Metric($)The K-Means Metric($)Steps!Get Data into a form you can use(MR)Picking Canopy Centers(MR)Assign Data Points to

11、Canopies(MR)Pick K-Means Cluster CentersK-Means algorithm(MR)Iterate!Data MassageThis isnt interesting,but it has to be done.Selecting Canopy CentersAssigning Points to CanopiesK-Means MapIterating K-MeansElbow CriterionChoose a number of clusters s.t.adding a cluster doesnt add interesting informat

12、ion.Rule of thumb to determine what number of Clusters should be chosen.Initial assignment of cluster seeds has bearing on final model performance.Often required to run clustering several times to get maximal performanceConclusionsClustering is slickAnd it can be done super efficientlyAnd in lots of different ways

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 教育专区 > 高考资料

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知淘文阁网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号© 2020-2023 www.taowenge.com 淘文阁