聚类分析文献英文翻译(共14页).doc-淘文阁

资源描述

《聚类分析文献英文翻译(共14页).doc》由会员分享，可在线阅读，更多相关《聚类分析文献英文翻译(共14页).doc（14页珍藏版）》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、精选优质文档-倾情为你奉上电气信息工程学院外文翻译英文名称： Data mining-clustering 译文名称：数据挖掘聚类分析专业：自动化姓名： * 班级学号： * 指导教师： * 译文出处： Data mining：著二一年四月二十六日Clustering5.1 INTRODUCTION Clustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, t

2、he grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are diff

3、erent. Many definitions for clusters have been proposed: l Set of like elements. Elements from different clusters are not alike. l The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.A term similar to clustering is database segme

4、ntation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of cluste

5、ring is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward. As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type

6、 of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, m

7、arketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data t

8、o detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:l Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find larger

9、clusters, these outliers will be forced to be placed in some cluster. This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.l Dynamic data in the database implies that cluster membership may change over time.l Interpret

10、ing the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be

11、obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.l There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required.

12、For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created. l Another related issue

13、 is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as si

14、milar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):l The (best) number of clusters is not known.l There may not be any a priori knowledge concerning the clusters.l Cluster results are dynamic.The clustering problem is stated as shown

15、 in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem i

16、s that a set of clusters is created: K=.DEFINITION 5.1.Given a database D= of tuples and an integer value k, the clustering problem is to define a mapping f: where each is assigned to one cluster ,. A cluster, contains precisely those tuples mapped to it; that is, =and . A classification of the diff

17、erent types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is

18、 in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive

19、 how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larg

20、er databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping cluste

21、rs. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the t

22、raditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into t

23、he intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are created in a bottom-up fashion, while divisive algorith

24、ms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, s

25、erial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with deci

26、sion tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matr

27、ix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure. We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that h

28、ave been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. 5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem.

29、The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(), defined between any two tuples, . This provides a more strict and alternat

30、ive clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(), as

31、 opposed to similarity, is often used in clustering. The clustering problem then has the desirable property that given a cluster, and .Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then

32、be described by using several characteristic values. Given a cluster, of N points , we make the following definitions ZRL96:Here the centroid is the “middle” of the cluster; it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented

33、 by one centrally located object in the cluster called a medoid. The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation to indicate the medoid for cluster.Many clustering algorithms require t

34、hat the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters and, there are several standard alternatives to calculate the distance between clusters. A representative list is:l

35、 Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis()=and.l Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis()=and.l Average: Average distance between an element in one cluster

36、and an element in the other. We thus have dis()=and.l Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis()=dis(), whereis the centroid forand similarly for .l Medoid: Using a medoid to represent each clus

37、ter, the distance between the clusters can be defined by the distance between the medoids: dis()=5.3 OUTLIERSAs mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhaps a malfunctioning sensor

38、 recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals, this value probably would be viewed as an outlier.Some clustering techni

39、ques do not perform well with the presence of outliers. This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed

40、 in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.Clustering algorithms may actually find and remove outliers to ensure that they perform better.

41、 However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these value

42、s may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to

43、remove or treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, these tests are not very r

44、ealistic for real-world data because real-world data values may not follow well-defined data distributions. Also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. Alternative detection techniques may be based on distance measures.聚类分析5.1 INT

45、RODUCTION 5.1简介 Clustering is similar to classification in that data are grouped.聚类分析与分类数据分组类似。然而，与数据分类不同的是，所分的组预先是不确定的。相反，分组是根据在实际数据中发现的特点通过寻找数据之间的相关性来实现的。这些组被称为聚类。一些作者认为聚类分析作为一种特殊类型的分类。但是，在本文两个不同的观点中我们遵循更传统的看法。提出了许多有关聚类的定义：类似元素的集合Set of like elements. Elements from different clusters are not al类类

46、。不同聚类中的元素是不一样的。 The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.在聚类中的点之间的距离比在聚类中的一个点和聚类之外任何一点之间的距离要小。 A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. 与聚类类似的

47、术语是数据库分割，其中数据库中的元组（记录）被放在一起。 This is done to partition or segment the database into components that then give the user a more general view of the data.这样做是为了分割或划分成数据的数据库组件，然后给用户一个普遍的看法。这样本文In this case text, we do not differentiate between segmentation and clusterin这样本本我们就不区分分割和聚类。A simple example

48、of clustering is found in Example 5.1.This example illustrates the fact that that determining how to do the clustering is not straightforwar一个简单聚类分析的例子见例5.1.这个例子说明了决定如何做聚类并不是容易的。As illustrated in Figure 5.1,a given set of data may be clustered on different attributes. 正如图5.1所示，一个给定的数据集合可能汇聚不同的属性。Here a group of homes in a geographic area is show这里显示了一个地域的住宅群。The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.一楼的聚类

展开阅读全文