《数据挖掘5学习.pptx》由会员分享,可在线阅读,更多相关《数据挖掘5学习.pptx(41页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、2023/3/171What is concept description?What is concept description?Generate descriptions for characterization and comparison of the datathe simplest kind of descriptive data miningsometimes called class description when the concept to be described refers to a class of objectsCharacterization:provide
2、a concise and succinct summarization of the given collection of dataComparison(discrimination):provide descriptions comparing two or more collections of data第1页/共41页2023/3/172Data generalizationData generalization both characterization and discrimination are based on data generalization and summariz
3、ationData generalization a process which abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels Data generalization approaches:data cube approach attribute-oriented induction approach第2页/共41页2023/3/173Data cube approachData cube a
4、pproach The data for analysis are stored in a multidimensional database,or data cube generalization and specialization can be performed on a data cube by roll-up and drill-down this is not an approach for concept description,only for data generalization Limitations:most commercial data cube implemen
5、tations confine the types of dimensions to simple nonnumeric data and of measures to simple aggregated numeric values concept hierarchies can be automatically generated from numeric data to form numeric dimensions,however,this is a result of recent data mining research and is not available in most c
6、ommercial systemscannot tell which dimensions should be used and what levels should the generalization reach第3页/共41页2023/3/174Is OLAP enough?Is OLAP enough?OLAPrestricted to certain kinds of attributes and measure typesuser-controlled processConcept descriptioncan handle complex data types of the at
7、tributes and their aggregationsa more automated process第4页/共41页2023/3/175Attribute-oriented inductionAttribute-oriented inductionproposed in 1989Y.Cai,N.Cercone,and J.Han,KDD Workshop at IJCAI-89in its initial proposal,AOI is a relational database query-oriented,generalization-based,online data anal
8、ysis techniquenow data cube and offline precomputation can also be usedcan be used for both characterization and discriminationgeneral idea:collect the task-relevant dataperform generalization by attribute removal or attribute generalizationapply aggregation by merging identical,generalized tuples a
9、nd accumulating their respective countsinteractive presentation with users第5页/共41页2023/3/176Sketch of AOISketch of AOIData focusing the specification of task-relevant data,whose result is the initial relationData generalization attribute removal if there is a large set of distinct values for an attr
10、ibute,but either(1)there is no generalization operator on the attribute,or(2)its higher level concepts are expressed in terms of other attributes attribute generalization if there is a large set of distinct values for an attribute,and there exists a set of generalization operators on the attributePr
11、esentation第6页/共41页2023/3/177How to control generalization?How to control generalization?the control of how high an attribute should be generalized is quite subjective the control of this process is called attribute generalization controlattribute generalization threshold controlif the number of dist
12、inct values in an attribute is greater than the threshold,then further attribute removal or generalization should be performedeither set a generalization threshold for all the attributes,or set one threshold for each attributetypically ranging from 2 to 8generalized relation threshold controlif the
13、number of distinct tuples in a generalized relation is greater than the threshold,then further generalization should be performedset a threshold for the size of the generalized relation typically ranging from 10 to 30第7页/共41页2023/3/178Example of AOIExample of AOIInitial relation Prime generalized re
14、lation第8页/共41页2023/3/179RDBMS implementation of AOIRDBMS implementation of AOIInitialRelquery processing of task-relevant data,deriving the initial relationPreGenbased on the analysis of the number of distinct values in each attribute,determine generalization plan for each attribute:removal?or how h
15、igh to generalize?PrimeGenbased on the PreGen plan,perform generalization to the appropriate level to derive a“prime generalized relation”,accumulating the countsPresentationuser interaction:(1)adjust levels by drilling,(2)pivoting,(3)mapping into rules,cross tabs,visualization presentations第9页/共41页
16、2023/3/1710Data cube implementation of AOI(I)Construct a data cube on-the-fly if either the task-relevant data set is too specific to match any predefined data cube,or it is not very large benefitfacilitate efficient drill-down analysis costincrease response time because of the computation of data c
17、ube balancecompute a cube-structured“subprime”relation in which each dimension of the generalized relation is a few levels deeper than the level of the prime relation第10页/共41页2023/3/1711Data cube implementation of AOI(II)Use a predefined data cube if the granularity of the task-relevant data can mat
18、ch that of the predefined data cube and the set of task-relevant data is quite large benefit facilitate attribute analysis,attribute-oriented induction,slicing and dicing,drill-down,and roll-up cost the cost of cube computation,and the nontrivial storage overhead第11页/共41页2023/3/1712Presentation of c
19、haracterization(I)Generalized relationsales in 1997第12页/共41页2023/3/1713Presentation of characterization(II)Crosstabsales in 1997第13页/共41页2023/3/1714Presentation of characterization(III)Bar chartsales in 1997第14页/共41页2023/3/1715Presentation of characterization(IV)Pie chartsales in 1997第15页/共41页2023/3
20、/1716Presentation of characterization(V)3-D cube第16页/共41页2023/3/1717Presentation of characterization(VI)Quantitative characteristic rule X,item(X)=“computer”(location(X)=“Asia”)t:25.00%location(X)=“Europe”)t:30.00%(location(X)=“North_America”)t:45.00%a logic rule that is associated with quantitative
21、 information is called a quantitative rulethe general form of a quantitative characteristic rule is:X,target_class(X)condition1(X)t:w1conditionn(X)t:wnwhere t-weight describes the typicality of each disjunct in the rulecharacteristic rule is necessary condition of the target class第17页/共41页2023/3/171
22、8Why attribute relevance analysis?it is difficult for users to determine which dimensions should be includedattribute relevance analysis is used tofilter out statistically irrelevant or weakly relevant attributesretain or even rank the most relevant attributesclass characterization which includes th
23、e analysis of attribute/dimension relevance is called analytical characterizationclass comparison which includes such analysis is called analytical comparison even within the same dimension,different levels of concepts may have different powers for distinguishing a class from others第18页/共41页2023/3/1
24、719How to perform attribute relevanceanalysis?Intuitively,an attribute is considered highly relevant to a given class if it is likely that the values of the attribute may be used to distinguish the class from othersData collectioncollect data for both target class and contrasting classfor class char
25、acterization,the contrasting class is taken to be the set of comparable data,i.e.data sharing similar attributes,of other classes in the databaseAnalytical generalizationperform attribute removal and attribute generalization based on the set of provided attribute analytical thresholdRelevant analysi
26、ssort and then select the most relevant attributes第19页/共41页2023/3/1720Relevance measures The general idea behind attribute relevance analysis is to compute some measure which is used to quantify the relevance of an attribute with respect to a given classpopular measures include:information gaingain
27、ratiogini indexuncertaintycorrelation coefficients第20页/共41页2023/3/1721Information gain measureS:training setSi:training instances of class Ci(i=1,m)aj:values of attribute A(j=1,v)the information needed to correctly classify the training set issuppose attribute A is selected to partition the training
28、 set into the subsets S1,S2,Sv,then the entropy of A,i.e.the information needed to classify all the instances in those subsets iswhere Sij is the instances of class Ci that are covered by Sjthen the information gain of selecting A isthe bigger the information gain,the more relevant the attribute A第2
29、1页/共41页2023/3/1722Example of information gain(I)Target class:Graduate students(=120)Contrasting class:Undergraduate students(=130)gendermajorbirth_countryage_rangegpacountMFMFMFScience Science Engineering Science Science Engineering Canada Foreign Foreign Foreign Canada Canada 20-25 25-30 25-30 25-3
30、0 20-25 20-25 Very_good ExcellentExcellent Excellent Excellent Excellent 16221825 2118gendermajorbirth_countryage_rangegpacountMFMFMFScience Business Business Science Engineering Engineering Foreign Canada Canada Canada Foreign Canada20 20 2020-2520-2520 Very_goodFairFairFair Very_good Excellent 182
31、02224 2224第22页/共41页2023/3/1723Example of information gain(I)Target class:16221825 2118countGSGSGSGSGSGSVery_good ExcellentExcellent Excellent Excellent Excellent 20-25 25-30 25-30 25-30 20-25 20-25 Canada Foreign Foreign Foreign Canada Canada Science Science Engineering Science Science Engineering M
32、FMFMFclassgpaage_rangebirth_countrymajorgender18202224 2224SSSSSSVery_goodFairFairFair Very_good Excellent 20 20 2020-2520-2520 Foreign Canada Canada Canada Foreign CanadaScience Business Business Science Engineering Engineering MFMFMF第23页/共41页2023/3/1724Example of information gain(II)the informatio
33、n needed to correctly classify the training set is suppose attribute major is selected to partition the training setfor major=“Science”:S11=84,S21=42for major=“Engineer”:S12=36,S22=46for major=“Engineer”:S13=0,S23=42 then the entropy of major is第24页/共41页2023/3/1725Example of information gain(III)the
34、n the information gain of major is now suppose we use an attribute relevance threshold of 0.1:gender and birth_country are removed as weakly relevant attributesmajor,gpa,and age_range are kept as strong relevant attributes we can also get the information gain of other attributes:第25页/共41页2023/3/1726
35、Mining class comparisonsmost of the techniques developed for characterization can also be used in comparisondata generalization(including attribute removal and attribute generalization)should be performed synchronously among all the classes paring sales in China on Nov.9 with sales in USA in year 20
36、00 is almost meaningless however,the user can over-write such an synchronous comparison with his own choices e.g.the user may want to compare sales in Shanghai with sales in Vietnam第26页/共41页2023/3/1727How to mine comparisons?Data collectioncollect data for both target class and contrasting classRele
37、vance analysisuse attribute relevance analysis for analytical class comparisonSynchronous generalizationcontrolled by user specified dimension thresholdsDrill-down,roll-up and other OLAP operationsadjust the level of abstraction for the resulted descriptionPresentationas the same forms as that for c
38、haracterization,except the rule form第27页/共41页2023/3/1728Example of mining comparisonsPrime generalized relation for the target class:Graduate studentsPrime generalized relation for the contrasting class:Undergraduate studentsbirth_countryage_rangegpacount%CanadaCanadaCanadaOther20-2525-30Over_30 Ove
39、r_30 Good Good Very_good Excellent 5.532.32 5.86 4.68 birth_countryage_rangegpacount%CanadaCanadaCanadaOther15-2015-20 25-30 Over_30 Fair Good Good Excellent5.53 4.535.02 0.68第28页/共41页2023/3/1729Quantitative discriminant rulethe general form of a quantitative discriminant rule is:X,Target_class(X)co
40、ndition1(X)d:w1conditionn(X)d:wnwhere d-weight describes the discriminability of each disjunct in the rulediscriminant rule is sufficient condition of the target classexample:statusbirth_countryage_rangegpacountGraduateCanada25-30Good90UndergraduateCanada 25-30 Good210 X,graduate_student(X)birth_cou
41、ntry(X)=“Canada”age_range(X)=“25-30”gpa(X)=“good”d:30%第29页/共41页2023/3/1730Characteristic vs.discriminant ruleCharacteristic rulenecessary condition of the target classt_weight1+t_weightn=100Discriminant rulesufficient condition of the target classcondport1 d_weight1+condportn d_weightn=tclass_port w
42、here condporti is the portion of the instances covered by the i-th antecedents of the rule,tclass_port is the portion of the instances belong to the target class第30页/共41页2023/3/1731Quantitative description ruleQuantitative description rulequantitative description rule can also be expressed as a cros
43、stabexample:X,target_class(X)condtion1(X)t:w1,d:w1condtionn(X)t:wn,d:wn X,Europe(X)(item(X)=“TV”)t:25%,d:40%(item(X)=“computer”)t:75%,d:30第31页/共41页2023/3/1732Descriptive statistical measuresDescriptive statistical measuresMotivation:to better understand the data:central tendency,dispersiondispersion
44、:the degree to which numeric data tend to spreadmeasures for central tendency:meanmedianmodemidrangemeasures for dispersion:quartilesvariancestandard deviation第32页/共41页2023/3/1733Measures for central tendencyMeasures for central tendencyMedianmiddle value if odd number of values,or average of the mi
45、ddle two values otherwise(holistic)Modevalue that occurs most frequently in the data set(holistic)unimodal,bimodal,trimodal,multimodal,no modeMidrangethe average of the min and max values(algebraic)Mean weighted arithmetic mean(algebraic)第33页/共41页2023/3/1734Measures for dispersionMeasures for disper
46、sionQuartilesQuartiles(holistic):Q1(25th percentile),Q3(75th percentile)Inter-quartile range(holistic):IQR=Q3-Q1 Five number summary(holistic):min,Q1,median,Q3,max Outlier(holistic):a value higher or lower than 1.5IQRVariance (algebraic)standard deviation the square root of s2(algebraic)第34页/共41页202
47、3/3/1735Graph displays of statistical descriptions(I)Boxplotthe ends of the box are at the quartiles the box length is IQRthe median is marked by a line within the boxwhiskers outside the box extend to the min and max第35页/共41页2023/3/1736Graph displays of statistical descriptions(II)Histogrambar char
48、t is a special case of histogram,where each rectangle is of uniform width第36页/共41页2023/3/1737Graph displays of statistical descriptions(III)Quantile plotfor data xi sorted in increasing order,fi indicates that approximately 100 fi%of the data are smaller than or equal to the value xi第37页/共41页2023/3/
49、1738Graph displays of statistical descriptions(IV)Q-Q plot(Quantile-Quantile plot)the quantiles of one univariate distribution against the corresponding quantiles of another one第38页/共41页2023/3/1739Incremental and parallel mining ofIncremental and parallel mining ofconcept descriptionconcept descript
50、ion Given the huge amounts of data in a database,it is highly preferred to update data mining results incrementally rather than mining from scratch on each database updatesIncremental miningrevision based on newly added data DBgeneralize DB to the same level of abstraction in the generalized relatio