《清华云计算课件分布式集群.ppt》由会员分享,可在线阅读,更多相关《清华云计算课件分布式集群.ppt(54页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、Distributed Computing SeminarLecture 4:Clustering an Overview and Sample MapReduce ImplementationChristophe Bisciglia,Aaron Kimball,&Sierra Michels-SlettvetGoogle,Inc.Summer 2007Except as otherwise noted,the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
2、OutlineClusteringIntuitionClustering Algorithms The Distance MeasureHierarchical vs.PartitionalK-Means ClusteringComplexityCanopy ClusteringMapReducing a large data set with K-Means and Canopy ClusteringClusteringWhat is clustering?Google NewsThey didnt pick all 3,400,217 related articles by handOr
3、A Or NetflixOther less glamorous things.Hospital RecordsScientific ImagingRelated genes,related stars,related sequencesMarket ResearchSegmenting markets,product positioningSocial Network AnalysisData miningImage segmentationThe Distance MeasureHow the similarity of two elements in a set is determine
4、d,e.g.Euclidean DistanceManhattan DistanceInner Product SpaceMaximum Norm Or any metric you define over the spaceHierarchical Clustering vs.Partitional ClusteringTypes of AlgorithmsHierarchical ClusteringBuilds or breaks up a hierarchy of clusters.Partitional ClusteringPartitions set into all cluste
5、rs simultaneously.Partitional Clustering Partitions set into all clusters simultaneously.K-Means Clustering Super simple Partitional ClusteringChoose the number of clusters,kChoose k points to be cluster centersThenK-Means Clusteringiterate Compute distance from all points to all k-centers Assign ea
6、ch point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averagesBut!The complexity is pretty high:k*n*O(distance metric)*num(iterations)Moreover,it can be necessary to send tons of data to each Mapper Node.Depending on
7、your bandwidth and memory available,this could be impossible.FurthermoreThere are three big ways a data set can be large:There are a large number of elements in the set.Each element can have many features.There can be many clusters to discoverConclusion Clustering can be huge,even when you distribut
8、e it.Canopy ClusteringPreliminary step to help parallelize computation.Clusters data into overlapping Canopies using super cheap distance metric.EfficientAccurateCanopy ClusteringWhile there are unmarked points pick a point which is not strongly marked call it a canopy centermark all points within s
9、ome threshold of it as in its canopystrongly mark all points within some stronger threshold After the canopy clusteringResume hierarchical or partitional clustering as usual.Treat objects in separate clusters as being at infinite distances.MapReduce Implementation:Problem Efficiently partition a lar
10、ge data set(say movies with user ratings!)into a fixed number of clusters using Canopy Clustering,K-Means Clustering,and a Euclidean distance measure.The Distance MetricThe Canopy Metric($)The K-Means Metric($)Steps!Get Data into a form you can use(MR)Picking Canopy Centers(MR)Assign Data Points to
11、Canopies(MR)Pick K-Means Cluster CentersK-Means algorithm(MR)Iterate!Data MassageThis isnt interesting,but it has to be done.Selecting Canopy CentersAssigning Points to CanopiesK-Means MapIterating K-MeansElbow CriterionChoose a number of clusters s.t.adding a cluster doesnt add interesting informat
12、ion.Rule of thumb to determine what number of Clusters should be chosen.Initial assignment of cluster seeds has bearing on final model performance.Often required to run clustering several times to get maximal performanceConclusionsClustering is slickAnd it can be done super efficientlyAnd in lots of different ways