《数据挖掘6学习.pptx》由会员分享,可在线阅读,更多相关《数据挖掘6学习.pptx(46页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、2023/3/251What is association rule mining?What is association rule mining?Association rule mining searches for interesting relationships among items in a given data set association rule mining is motivated for market basket analysis,where the customer buying habits are analyzed by finding associatio
2、ns between the different items that customers place in their shopping baskets association rules are in the form of:BodyHead support,confidenceexample:buys(X,“diapers”)buys(X,“beers”)0.5%,60%major(X,“CS”)takes(X,“DB”)grade(X,“A”)1%,75%第1页/共46页2023/3/252Formal definition of association ruleI=i1,i2,im
3、is an itemset,i.e.a set of itemsD is the task relevant data,which is a set of transactions where each transaction T is an itemset such that T Iif A is an itemset,we say T contains A iff A T an association rule is of the form AB,where A I,B I,and A B=the rule holds in D with support(AB)=Prob(AB)confi
4、dence(AB)=Prob(B|A)rules satisfying both min_sup and min_conf are strongan itemset contains k items is a k-itemsetthe number of transactions that contain an itemset is the frequency(or support count)of the itemsetif an itemset satisfies min_sup,then it is a frequent itemset(or large itemset)第2页/共46页
5、2023/3/253Categories of association rulesCategories of association rulesBased on the types of valuesBoolean association rulebuys(X,“diapers”)buys(X,“beers”)0.5%,60%quantitative association ruleage(X,“30-39”)income(X,“42-48K”)buys(X,“computer”)1%,75%Based on the dimensions of datasingle-dimensional a
6、ssociation rulebuys(X,“diapers”)buys(X,“beers”)0.5%,60%multidimensional association ruleage(X,“30-39”)income(X,“42-48K”)buys(X,“computer”)1%,75%Based on the levels of abstractionssingle-level association ruleage(X,“30-34”)buys(X,“computer”)1%,75%multilevel association ruleage(X,“30-32”)buys(X,“lapto
7、p computer”)0.5%,80%age(X,“30-34”)buys(X,“computer”)1%,75%第3页/共46页2023/3/254Two-step process of association rule miningstep1:find all frequent itemsetsstep2:generate strong association rules from the frequent itemsets step2 is easier,therefore most works on association rule mining focus on step1 a s
8、olution for step2:for each frequent itemset l,generate all non-empty subsets of l for every non-empty subset s of l,output the rule“s (l-s)”if第4页/共46页2023/3/255Apriori(I)Apriori(I)in initial is for finding frequent itemsets for Boolean association rules Agrawal&Srikant,VLDB94level-wise search,where
9、k-itemsets are used to explore(k+1)-itemsetsthe key of Apriori is the a priori knowledge:all non-empty subsets of a frequent itemset must also be frequentfrom Lk to Lk+1:join:a set of candidate frequent(k+1)-itemsets,Ck+1,is generated by joining Lk with itself(the set of frequent k-itemsets is denot
10、ed Lk)Lk Lk=A B|A,BLk,|AB|=k-1prune:a scan of the database for determining the count of each candidate in Ck+1 would result in the determination of Lk+1 a priori knowledge is used to reduce the size of Ck+1第5页/共46页2023/3/256Apriori(II)Apriori(II)an example Suppose min_sup_count=2Database DI1,I2,I5I2
11、,I3,I4I3,I4I1,I2,I3,I4T100T200T300T400ItemsTIDScan D23331I1I2I3I4I5suppitemsetC12333I1I2I3I4suppitemsetL1I1,I2I1,I3I1,I4I2,I3I2,I4I3,I4suppitemsetC2Scan D2223I1,I2I2,I3I2,I4I3,I4suppitemsetL2I2,I3,I4suppitemsetC3Scan D2I2,I3,I4suppitemsetL32112232第6页/共46页2023/3/257Apriori(III)Apriori(III)Input:D:dat
12、abase of transactions;min_sup:minimum support thresholdOutput:L:frequent itemsets in DMethod:L1=find_frequent_1-itemsets(D);for(k=1;Lk ;k+)Ck+1=apriori_gen(Lk,min_sup);/generate candidate frequent(k+1)-itemset for each transaction t D /scan D for counting Ct=subset(Ck+1,t);/get the subsets of t that
13、 are candidates for each candidate c Ct c.count+Lk+1=c Ck+1|c.count min_supreturn L=kLk;第7页/共46页2023/3/258Apriori(IV)Apriori(IV)generate candidate frequent(k+1)-itemsetprocedure apriori_gen(Lk:frequent k-itemsets;min_sup:minimum support)for each itemset l1 Lk for each itemset l2 Lk if(l11=l21)(l12=l
14、22)(l1k-1=l2k-1)(l1k=minsup求得该段局部频繁项集;求得该段局部频繁项集;将各段的局部频繁项集作为候选项集,扫描数据库,求出将各段的局部频繁项集作为候选项集,扫描数据库,求出全局频繁项集。全局频繁项集。特点:仅对数据库扫描两遍特点:仅对数据库扫描两遍问题:如何分段?问题:如何分段?大大内存无法处理内存无法处理小小局部频繁项集太多局部频繁项集太多 In Proceedings of the 1995 International Conference on Very Large Database第13页/共46页2023/3/2514Hash-based itemset c
15、ountingHash-based itemset counting思想:由于频繁项集的主要计算是在生成思想:由于频繁项集的主要计算是在生成2-频繁项集频繁项集L2上,故引入上,故引入Hash技术改进产生技术改进产生2-频繁项集的方法。频繁项集的方法。例:例:Hash函数函数(10 x+y)mod 7 Park proposed in 1995TID Items 1 I1,I2,I5 2 I2,I4 3 I2,I3 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 I1,I3 8 I1,I2,I3,I5 9 I1,I2,I3桶地址桶地址 桶计数桶计数 桶内容桶内容 1 2 I1,I4,
16、I3,I5 2 2 I1,I5,I1,I5 3 4 I2,I3,I2,I3,I2,I3,I2,I3 4 2 I2,I4,I2,I4 5 4 I1,I2,I1,I2,I1,I2,I1,I2 6 4 I1,I3,I1,I3,I1,I3,I1,I3第14页/共46页2023/3/2515Techniques for improving AprioriTechniques for improving AprioriHash-based itemset countinga k-itemset whose corresponding hashing bucket count is below the th
17、reshold cannot be frequentTransaction reductiona transaction that does not contain any frequent k-itemset is useless in subsequent scansPartitioningany itemset that is potentially frequent in a database must be frequent in at least one of the partitions of the databaseSamplingmining on a sampled sub
18、set of database with a lower support threshold,and using a method to determine the completenessDynamic itemset countingadding new candidate itemsets when all of their subsets are estimated to be frequent第15页/共46页2023/3/2516FP-Growth(I)FP-Growth(I)It adopts divide-and-conquer strategy as follows:comp
19、ress the database representing frequent items into a frequent-pattern tree,or FP-Tree,but retain the itemset association information,and then divide such a compressed database into a set of conditional databases(a special kind of projected database),such associated with one frequent item,and mine ea
20、ch such database separately.In Proceedings of the 2000 SIGMOD,by Han Jiawei,et al.第16页/共46页2023/3/2517FP-Growth(II)FP-Growth(II)Database DI1,I2,I5I2,I3,I4I3,I4I1,I2,I3,I4T100T200T300T400ItemsTID an example Suppose min_sup_count=2I2,I1I2,I3,I4I3,I4I2,I3,I4,I1T100T200T300T400ItemsTIDSort on supportNod
21、e linkitemrootI2:1I1:1I1I2第17页/共46页2023/3/2518FP-Growth(II)FP-Growth(II)Database DI1,I2,I5I2,I3,I4I3,I4I1,I2,I3,I4T100T200T300T400ItemsTID an example Suppose min_sup_count=2I2,I1I2,I3,I4I3,I4I2,I3,I4,I1T100T200T300T400ItemsTIDSort on supportNode linkitemrootI2:2I1:1I1I2I3:1I4:1I4I3I3:1I4:1第18页/共46页2
22、023/3/2519FP-Growth(II)FP-Growth(II)an example Suppose min_sup_count=2Database DI1,I2,I5I2,I3,I4I3,I4I1,I2,I3,I4T100T200T300T400ItemsTIDI2,I1I2,I3,I4I3,I4I2,I3,I4,I1T100T200T300T400ItemsTIDSort on supportNode linkitemrootI2:3I1:1I1I2I3:2I4:2I4I3I3:1I4:1I1:1第19页/共46页2023/3/2520FP-Growth(II)FP-Growth(
23、II)an example Suppose min_sup_count=2Node linkitemrootI2:3I1:1I1I2I3:2I4:2I4I3I3:1I4:1I1:1 item conditional pattern base I4 I2I3:1 I3 I2:1 I1 I2:2,I2I3I4:1 I2 Empty 第20页/共46页2023/3/2521Multilevel association ruleMultilevel association ruleRules generated from association rule mining with concept hie
24、rarchies are called multilevel association rulesstrong associations discovered at very high concept levels may represent common sense knowledgehowever,common sense knowledge to one person may be novel to another person第21页/共46页2023/3/2522Mining multilevel association rulesMining multilevel associati
25、on rulesA top-down,progressive deepening approach:Find high-level strong rules e.g.buys(X,“computer”)buys(X,“printer”)20%,60%Then find their lower-level“weaker”rules e.g.buys(X,“IBM computer”)buys(X,“HP printer”)6%,50%Variations:uniform support reduced support第22页/共46页2023/3/2523Mining multilevel as
26、sociation rulesMining multilevel association ruleswith uniform supportwith uniform supportDisadvantage:it is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstractionif the min_sup threshold is too high,interesting low level associations wil
27、l be missedif the min_sup threshold is too low,uninteresting high level associations will be generateduse a same min_sup for all levelsan exampleComputer support=10%Laptop computer support=6%home computer support=4%Level 1 min_sup=5%Level 2 min_sup=5%第23页/共46页2023/3/2524Mining multilevel association
28、 rulesMining multilevel association ruleswith reduced support(I)with reduced support(I)Level-by-level independent each node is examined,regardless of whether or not its parent node is found to be frequentan exampleComputer support=10%Laptop computer support=6%home computer support=4%Level 1 min_sup=
29、5%Level 2 min_sup=3%Advantage:wont miss any interesting associationsDisadvantage:examine numerous infrequent items at low levels,find associations between items of little importance第24页/共46页2023/3/2525Mining multilevel association rulesMining multilevel association ruleswith reduced support(II)with
30、reduced support(II)Level-cross filtering by k-itemsetk-itemset at the i-th level is examined if and only if its corresponding parent k-itemset at the(i-1)-th level is frequentan exampleComputer&printer support=10%Level 1 min_sup=5%Level 2 min_sup=2%Advantage:examine only the children of frequent k-i
31、temsets Disadvantage:many valuable patterns may be filtered outLaptop computer&b/w printersupport=1%Laptop computer&color printersupport=2%home computer&b/w printersupport=1%home computer&color printersupport=3%第25页/共46页2023/3/2526Mining multilevel association rulesMining multilevel association rule
32、swith reduced support(III)with reduced support(III)Level-cross filtering by single itema node at the i-th level is examined if and only if its parent node at the(i-1)-th level is frequentan exampleLevel 1 min_sup=12%Level 2 min_sup=3%Advantage:compromise between level-by-level independent and level-
33、cross filtering by k-itemsetDisadvantage:may miss associations between low level items that are frequent based on a reduced min_sup,but whose parent node are not frequentComputer support=10%Laptop computer(not examined)home computer(not examined)第26页/共46页2023/3/2527Mining multilevel association rule
34、sMining multilevel association ruleswith reduced support(IV)with reduced support(IV)Controlled level-cross filtering by single itema variant of level-cross filtering by single item.Each level has a level passage threshold which is used for passing down sub-frequent items to lower levelsan exampleLev
35、el 1 min_sup=12%Level-passage-sup=8%Level 2 min_sup=3%Advantage:allow the children of items that do not satisfy min_sup to be examined if these items satisfy the level passage thresholdDisadvantage:the setting of the level passage threshold is trickyComputer support=10%Laptop computer support=6%home
36、 computer support=4%第27页/共46页2023/3/2528Cross-level association ruleCross-level association ruleRules whose items do not belong to a same concept level are called cross-level association rules e.g.buys(X,“computer”)buys(X,“HP printer”)4%,70%Mining cross-level association rule:If mining associations
37、from concept levels i and j,where level j is more specific than i,then the reduced minimum support threshold of level j should be used overall第28页/共46页2023/3/2529Redundancy filteringRedundancy filteringwhen multilevel association rules are mined,some of the rules found may be redundant due to“ancest
38、or”relationships between itemsa rule R1 is an ancestor of a rule R2 if R1 can be obtained by replacing the items in R2 by their ancestors in a concept hierarchya rule can be considered redundant if its support and confidence are close to their“expected”values based on its ancestorexample:buys(X,“com
39、puter”)buys(X,“HP printer”)8%,70%buys(X,“IBM computer”)buys(X,“HP printer”)4%,72%if the sales of IBM computers is half of that of computers,then the second rule is redundant because it does not offer any additional information第29页/共46页2023/3/2530Multidimensional association ruleMultidimensional asso
40、ciation ruleSingle-dimensional association rulealso called intra-dimensional association rulean association rule contains a single predicate with multiple occurrences e.g.buys(X,“IBM computer”)buys(X,“printer”)Multidimensional association rulean association rule contains two or more predicates or di
41、mensionsinter-dimension association rule multidimensional association rule without repeated predicates e.g.age(X,“19-24”)occupation(X,“student”)buys(X,“computer”)hybrid-dimension association rule multidimensional association rule with repeated predicates e.g.age(X,“19-24”)buys(X,“computer”)buys(X,“p
42、rinter”)第30页/共46页2023/3/2531Mining multidimensional association ruleMultidimensional association rule mining searches for frequent predicatesets instead of frequent itemsetsk-predicateset is a set containing k conjunctive predicates multidimensional association rule mining techniques can be categori
43、zed by how quantitative attributes are treated:using static discretization of quantitative attributes quantitative attributes are statically discretized using predefined concept hierarchiesquantitative association rules quantitative attributes are dynamically discretized into“bins”based on the distr
44、ibution of the datadistance-based association rules quantitative attributes are clustered to capture the semantic meaning of data第31页/共46页2023/3/2532Static discretization of quantitativeStatic discretization of quantitativeAttributesAttributesDiscretization prior to mining using concept hierarchynum
45、eric values are replaced by rangesin relational database,finding all frequent k-predicatesets will require k or(k+1)scansevery subset of frequent predicatesets must also be frequentdata cube is well suited for miningthe cells of an n-dimensional cuboid correspond to the predicatesetsmining from data
46、 cubes can be more faster第32页/共46页2023/3/2533Quantitative association rule(I)Quantitative association rule(I)Quantitative attributes are dynamically discretized during the mining process so as to satisfy some mining criteriaa typical procedure(used in ARCS)map pairs of quantitative attributes onto a
47、 2-D grid for tuples satisfying a given categorical attribute condition.Then search the grid for clusters of points,from which the association rules are generatedstep1:binning partition the ranges of quantitative attributes into intervals that may be combined during the mining processstep2:finding f
48、requent predicatesets scan the grid to find frequent predicatesets that satisfy min_supp,then generate association rules from those predicatesetsstep3:clustering association rules map the strong association rules to the grid,and cluster the rules based on proximity第33页/共46页2023/3/2534Quantitative as
49、sociation rule(II)Quantitative association rule(II)age(X,“34”)income(X,“30-40K”)buys(X,“TV”)age(X,“35”)income(X,“30-40K”)buys(X,“TV”)age(X,“34”)income(X,“40-50K”)buys(X,“TV”)age(X,“35”)income(X,“40-50K”)buys(X,“TV”)age(X,“34-35”)income(X,“30-50K”)buys(X,“TV”)第34页/共46页2023/3/2535Distance-based associ
50、ation ruleDistance-based association rulebinning methods do not capture the semantics of interval datadistance-based disretization is more meaningful because it considers the density or number of points in an interval the“closeness”of points in an intervalan examplePrice($)Equi-width(width$10)Equi-d