《数据挖掘关联规则.ppt》由会员分享,可在线阅读,更多相关《数据挖掘关联规则.ppt(78页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、Chapter 4:Mining Frequent Patterns,Association and CorrelationsnBasic concepts and a road mapnScalable frequent itemset mining methodsnMining various kinds of association rulesnConstraint-based association miningnFrom association to correlation analysisnMining colossal patternsnSummary2023/3/121Data
2、 Mining:Concepts and TechniquesWhat Is Frequent Pattern Analysis?nFrequent pattern:a pattern(a set of items,subsequences,substructures,etc.)that occurs frequently in a data set nFirst proposed by Agrawal,Imielinski,and Swami AIS93 in the context of frequent itemsets and association rule miningnMotiv
3、ation:Finding inherent regularities in datanWhat products were often purchased together?Beer and diapers?!nWhat are the subsequent purchases after buying a PC?nWhat kinds of DNA are sensitive to this new drug?nCan we automatically classify web documents?nApplicationsnBasket data analysis,cross-marke
4、ting,catalog design,sale campaign analysis,Web log(click stream)analysis,and DNA sequence analysis.2023/3/122Data Mining:Concepts and Techniques关联规则挖掘n关联规则挖掘的典型案例:购物篮问题n在商场中拥有大量的商品(项目),如:牛奶、面包等,客户将所购买的商品放入到自己的购物篮中。n通过发现顾客放入购物篮中的不同商品之间的联系,分析顾客的购买习惯n哪些物品经常被顾客购买?n同一次购买中,哪些商品经常会被一起购买?n一般用户的购买过程中是否存在一定的购
5、买时间序列?n具体应用:利润最大化n商品货架设计:更加适合客户的购物路径n货存安排:实现超市的零库存管理n用户分类:提供个性化的服务2023/3/123Data Mining:Concepts and Techniques关联规则挖掘简单的说,关联规则挖掘就是发现大量数据中项集之间有趣的关联在交易数据、关系数据或其他信息载体中,查找存在于项目集合或对象集合之间的频繁模式、关联、相关性、或因果结构。应用 购物篮分析、交叉销售、产品目录设计、聚集、分类等两种策略:1。商品放近,增加销量2。商品放远,增加其他商品的销量2023/3/124Data Mining:Concepts and Techni
6、quesWhy Is Freq.Pattern Mining Important?nFreq.pattern:An intrinsic and important property of datasets nFoundation for many essential data mining tasksnAssociation,correlation,and causality analysisnSequential,structural(e.g.,sub-graph)patternsnPattern analysis in spatiotemporal,multimedia,time-seri
7、es,and stream data nClassification:discriminative,frequent pattern analysisnCluster analysis:frequent pattern-based clusteringnData warehousing:iceberg cube and cube-gradient nSemantic data compression:fasciclesnBroad applications2023/3/125Data Mining:Concepts and Techniques关联规则挖掘形式化定义给定:设I=iI=i1 1,
8、i,i2 2,i im m 是项(item)(item)的的集合。若干项的集合,称为项集(Item SetsItem Sets)记D D为交易(transaction)T(transaction)T(或或事务)的集合,这里交易T T 是项的集合,并且T T I I。对应每一个交易有唯一的标识,如交易号,记作TID TID。设X X是一个I I中项的集合,如果X X T T,那么称交易T T包含X X。寻找:有趣的关联规则(强规则).2023/3/126Data Mining:Concepts and Techniques关联规则n所有形如X X Y Y 蕴涵式的称为关联规则,这里X X I,Y
9、 I,Y I I,并且X XY=Y=。n关联规则是有趣的,如果它满足最小支持度阈值与最小置信度阈值,并称之为强规则2023/3/127Data Mining:Concepts and Techniquesconfidence and supportnItemset X=i1,iknFind all the rules X Y with min confidence and supportCustomerbuys diaperCustomerbuys bothCustomerbuys beersupportsupport,s s,probabilityprobability that a tha
10、t a transaction contains Xtransaction contains XY Y support(X support(X support(X support(XY)=Y)=Y)=Y)=同时包含项目集同时包含项目集同时包含项目集同时包含项目集X X X X和和和和Y Y Y Y的交的交的交的交易数易数易数易数/总交易数总交易数总交易数总交易数 用于描述有用性用于描述有用性用于描述有用性用于描述有用性.confidenceconfidence,c,c,conditional probabilityconditional probability that a transacti
11、on having X also that a transaction having X also contains contains Y Y.confidence(Xconfidence(Xconfidence(Xconfidence(XY)=Y)=Y)=Y)=同时购买商品同时购买商品同时购买商品同时购买商品X X X X和和和和Y Y Y Y的交的交的交的交易数易数易数易数/购买商品购买商品购买商品购买商品X X X X的交易数的交易数的交易数的交易数用于描述确定性用于描述确定性,即即”值得信赖的程度值得信赖的程度”可靠性可靠性”2023/3/128Data Mining:Concepts
12、 and TechniquesMining Association Rulesan ExampleFor rule A C:support=support(AC)=50%confidence=support(AC)/support(A)=66.6%Min.support 50%Min.confidence 50%Transaction-idItems bought10A,B,C20A,C30A,D40B,E,FFrequent patternSupportA75%B50%C50%A,C50%2023/3/129Data Mining:Concepts and Techniques关联规则的基本
13、形式n关联规则的基本形式:前提条件结论 支持度,置信度 nbuys(x,buys(x,“diapersdiapers”)buys(x,buys(x,“beersbeers”)0.5%,)0.5%,60%60%nmajor(x,major(x,“CSCS”)takes(x,)takes(x,“DBDB”)grade(x,grade(x,“A A”)1%,75%)1%,75%n包含k k个项目的集合,称为k-k-项集n项集的出现频率是包含项集的事务个数,称为项集的频率、支持计数或者计数2023/3/1210Data Mining:Concepts and TechniquesBasic Conce
14、pts:Frequent Patternsnitemset:A set of one or more itemsnk-itemset X=x1,xkn(absolute)support,or,support count of X:Frequency or occurrence of an itemset Xn(relative)support,s,is the fraction of transactions that contains X(i.e.,the probability that a transaction contains X)nAn itemset X is frequent
15、if Xs support is no less than a minsup thresholdCustomerbuys diaperCustomerbuys bothCustomerbuys beerTidItems bought10Beer,Nuts,Diaper20Beer,Coffee,Diaper30Beer,Diaper,Eggs40Nuts,Eggs,Milk50Nuts,Coffee,Diaper,Eggs,Milk2023/3/1211Data Mining:Concepts and TechniquesBasic Concepts:Association RulesnFin
16、d all the rules X Y with minimum support and confidencensupport,s,probability that a transaction contains X Ynconfidence,c,conditional probability that a transaction having X also contains YLet minsup=50%,minconf=50%Freq.Pat.:Beer:3,Nuts:3,Diaper:4,Eggs:3,Beer,Diaper:3Customerbuys diaperCustomerbuys
17、 bothCustomerbuys beerNuts,Eggs,Milk40Nuts,Coffee,Diaper,Eggs,Milk50Beer,Diaper,Eggs30Beer,Coffee,Diaper20Beer,Nuts,Diaper10Items boughtTidnAssociation rules:(many more!)nBeer Diaper (60%,100%)nDiaper Beer (60%,75%)2023/3/1212Data Mining:Concepts and TechniquesClosed Patterns and Max-PatternsnA long
18、 pattern contains a combinatorial number of sub-patterns,e.g.,a1,a100 contains(1001)+(1002)+(110000)=2100 1=1.27*1030 sub-patterns!nSolution:Mine closed patterns and max-patterns insteadnAn itemset X is closed if X is frequent and there exists no super-pattern Y X,with the same support as X(proposed
19、 by Pasquier,et al.ICDT99)nAn itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X(proposed by Bayardo SIGMOD98)nClosed pattern is a lossless compression of freq.patternsnReducing the#of patterns and rules2023/3/1213Data Mining:Concepts and TechniquesClosed Patt
20、erns and Max-PatternsnExercise.DB=,nMin_sup=1.nWhat is the set of closed itemset?n:1n:2nWhat is the set of max-pattern?n:1nWhat is the set of all patterns?n!2023/3/1214Data Mining:Concepts and TechniquesComputational Complexity of Frequent Itemset MiningnHow many itemsets are potentially to be gener
21、ated in the worst case?nThe number of frequent itemsets to be generated is senstive to the minsup thresholdnWhen minsup is low,there exist potentially an exponential number of frequent itemsetsnThe worst case:MN where M:#distinct items,and N:max length of transactionsnThe worst case complexty vs.the
22、 expected probabilitynEx.Suppose Walmart has 104 kinds of products nThe chance to pick up one product 10-4nThe chance to pick up a particular set of 10 products:10-40nWhat is the chance this particular set of 10 products to be frequent 103 times in 109 transactions?2023/3/1215Data Mining:Concepts an
23、d TechniquesChapter 4:Mining Frequent Patterns,Association and CorrelationsnBasic concepts and a road mapnScalable frequent itemset mining methodsnMining various kinds of association rulesnConstraint-based association miningnFrom association to correlation analysisnMining colossal patternsnSummary20
24、23/3/1216Data Mining:Concepts and Techniques使用Apriori方法挖掘关联规则n频繁项集的定义 如果项集满足最小支持度,则称之为频繁项集 (高频项集)n频繁项集的基本特征 任何频繁项集的非空子集均为频繁项集。例如:ABCABC是频繁项集,则ABAB、ACAC、BCBC均为频繁项集。反之:如ABAB不是不是频繁项集,则ABCABC不可能不可能是频繁项集2023/3/1217Data Mining:Concepts and TechniquesApriori方法n是一种称作逐层搜索的迭代方法。n用k-项集探求(k+1)-项集。n具体地:首先找出频繁1-项
25、集,该集合记为L1;用L1找出频繁2-项集的集合L2;如此继续下去,直到找到最大频繁项集n该方法,主要有连接和剪枝两步构成。2023/3/1218Data Mining:Concepts and TechniquesThe Apriori AlgorithmAn ExampleDatabase TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A,C,D20B,C,E30A,B,C,E40B,EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA,BA,CA,EB,CB,EC,EItemsetsupA,
26、B1A,C2A,E1B,C2B,E3C,E2ItemsetsupA,C2B,C2B,E3C,E2ItemsetB,C,EItemsetsupB,C,E22023/3/1219Data Mining:Concepts and Techniques由频繁项集产生关联规则n两步n对每一个频繁项集u,产生u的所有非空真子集n对u的每一个非空真子集s,若support_count(u)/support_count(s)=min_conf则输出:s(u-s)n例如:频繁项集u=B,C,E,产生所有非空真子集6个,对应有6个可能的规则,分别计算每一条规则的confidenceconfidence2023/3
27、/1220Data Mining:Concepts and TechniquesApriori:A Candidate Generation&Test ApproachnApriori pruning principle:If there is any itemset which is infrequent,its superset should not be generated/tested!(Agrawal&Srikant VLDB94,Mannila,et al.KDD 94)nMethod:nInitially,scan DB once to get frequent 1-itemse
28、tnGenerate length(k+1)candidate itemsets from length k frequent itemsetsnTest the candidates against DBnTerminate when no frequent or candidate set can be generated2023/3/1221Data Mining:Concepts and TechniquesThe Apriori AlgorithmAn Example Database TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems
29、10A,C,D20B,C,E30A,B,C,E40B,EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA,BA,CA,EB,CB,EC,EItemsetsupA,B1A,C2A,E1B,C2B,E3C,E2ItemsetsupA,C2B,C2B,E3C,E2ItemsetB,C,EItemsetsupB,C,E2Supmin=22023/3/1222Data Mining:Concepts and TechniquesThe Apriori Algorithm(Pseudo-Code)Ck:Candidate itemset of size kLk:
30、frequent itemset of size kL1=frequent items;for(k=1;Lk!=;k+)do begin Ck+1=candidates generated from Lk;for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 =candidates in Ck+1 with min_support endreturn k Lk;2023/3/1223Data Mining:Concepts
31、and TechniquesImplementation of ApriorinHow to generate candidates?nStep 1:self-joining LknStep 2:pruningnExample of Candidate-generationnL3=abc,abd,acd,ace,bcdnSelf-joining:L3*L3nabcd from abc and abdnacde from acd and acenPruning:nacde is removed because ade is not in L3nC4=abcd2023/3/1224Data Min
32、ing:Concepts and TechniquesCandidate Generation:An SQL ImplementationnSQL Implementation of candidate generationnSuppose the items in Lk-1 are listed in an ordernStep 1:self-joining Lk-1 insert into Ckselect p.item1,p.item2,p.itemk-1,q.itemk-1from Lk-1 p,Lk-1 qwhere p.item1=q.item1,p.itemk-2=q.itemk
33、-2,p.itemk-1 Wage:mean=$7/hr(overall mean=$9)2023/3/1257Data Mining:Concepts and TechniquesStatic Discretization of Quantitative AttributesnDiscretized prior to mining using concept hierarchy.nNumeric values are replaced by rangesnIn relational database,finding all frequent k-predicate sets will req
34、uire k or k+1 table scansnData cube is well suited for miningnThe cells of an n-dimensional cuboid correspond to the predicate setsnMining from data cubescan be much faster(income)(age)()(buys)(age,income)(age,buys)(income,buys)(age,income,buys)2023/3/1258Data Mining:Concepts and TechniquesQuantitat
35、ive Association RulesnProposed by Lent,Swami and Widom ICDE97nNumeric attributes are dynamically discretizednSuch that the confidence or compactness of the rules mined is maximizedn2-D quantitative association rules:Aquan1 Aquan2 AcatnCluster adjacent association rules to form general rules using a
36、2-D gridnExampleage(X,“34-35”)income(X,“30-50K”)buys(X,“high resolution TV”)2023/3/1259Data Mining:Concepts and TechniquesMining Other Interesting PatternsnFlexible support constraints(Wang,et al.VLDB02)nSome items(e.g.,diamond)may occur rarely but are valuable nCustomized supmin specification and a
37、pplicationnTop-K closed frequent patterns(Han,et al.ICDM02)nHard to specify supmin,but top-k with lengthmin is more desirablenDynamically raise supmin in FP-tree construction and mining,and select most promising path to mine2023/3/1260Data Mining:Concepts and TechniquesChapter 4:Mining Frequent Patt
38、erns,Association and CorrelationsnBasic concepts and a road mapnEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnConstraint-based association miningnMining colossal patternsnSummary2023/3/1261Data Mining:
39、Concepts and Techniques兴趣度的度量兴趣度的度量n客观度量n两个最为流行的度量:支持度和置信度(support and confidence)n主观度量(Silberschatz&Tuzhilin,KDD95)n一个规则(模式)是感兴趣的,如果n没有想到的(用户感到惊讶的);n可操作的(用户在得到结果后,可以在此之上做些什么)2023/3/1262Data Mining:Concepts and Techniques支持度支持度-置信度方法的不足置信度方法的不足nExample 1:(Aggarwal&Yu,PODS98)5000 个学生中n3000 喜欢打篮球n3750
40、 喜欢吃米饭n2000 同时喜欢打篮球和吃米饭n关联规则:play basketball eat cereal 40%,66.7%n该规则具有欺骗性,因为从整个学生情况来看,有75%的学生喜欢吃米饭,大大高于66.7%。n关联规则:play basketball not eat cereal 20%,33.3%n该规则虽然拥有较低的支持度和置信度,但是比较精确。BasketballNot basketballSum(row)Cereal200017503750Not cereal10002501250Sum(col.)3000200050002023/3/1263Data Mining:Con
41、cepts and Techniques提升:一种兴趣度的度量提升:一种兴趣度的度量n设A,B是两个项集,nP(AB)=P(B)*P(A),A 和B 是独立事件nA,B的相关性度量n取值小于1,A and B 负相关n取值大于1,A and B 正相关n如:上例corr=0.89nP(B|A)/P(B)称为规则A=B的“提升”n挖掘相关性关联规则,nX2检验是否有统计意义2023/3/1264Data Mining:Concepts and TechniquesInterestingness Measure:Correlations(Lift)nplay basketball eat cere
42、al 40%,66.7%is misleadingnThe overall%of students eating cereal is 75%66.7%.nplay basketball not eat cereal 20%,33.3%is more accurate,although with lower support and confidencenMeasure of dependent/correlated events:liftBasketballNot basketballSum(row)Cereal200017503750Not cereal10002501250Sum(col.)
43、3000200050002023/3/1265Data Mining:Concepts and TechniquesChapter 4:Mining Frequent Patterns,Association and CorrelationsnBasic concepts and a road mapnEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnCon
44、straint-based association miningnMining colossal patternsnSummary2023/3/1266Data Mining:Concepts and Techniques基于约束的挖掘基于约束的挖掘n使用约束的必要性 产生的多数规则是用户不感兴趣的,应在用户提供的各种约束的指导下进行挖掘n在数据挖掘中常使用的几种约束:n知识类型约束:指定要挖掘的知识类型,如关联知识n数据约束:指定与任务相关的数据集nFind product pairs sold together in Vancouverin Dec.98.n维/层次约束:指定所用的维或概念
45、结构中的层nin relevance to region,price,brand,customer category.n规则约束:指定要挖掘的规则形式(如规则模板)n单价(price$200).n兴趣度约束:指定规则兴趣度阈值或统计度量n如(min_support 3%,min_confidence 60%).2023/3/1267Data Mining:Concepts and TechniquesConstraint-based(Query-Directed)MiningnFinding all the patterns in a database autonomously?unreali
46、stic!nThe patterns could be too many but not focused!nData mining should be an interactive process nUser directs what to be mined using a data mining query language(or a graphical user interface)nConstraint-based miningnUser flexibility:provides constraints on what to be minednSystem optimization:ex
47、plores such constraints for efficient mining constraint-based mining:constraint-pushing,similar to push selection first in DB query processingnNote:still find all the answers satisfying constraints,not finding some answers in“heuristic search”2023/3/1268Data Mining:Concepts and TechniquesConstraints
48、 in Data MiningnKnowledge type constraint:nclassification,association,etc.nData constraint using SQL-like queries nfind product pairs sold together in stores in Chicago in Dec.02nDimension/level constraintnin relevance to region,price,brand,customer categorynRule(or pattern)constraintnsmall sales(pr
49、ice$200)nInterestingness constraintnstrong rules:min_support 3%,min_confidence 60%2023/3/1269Data Mining:Concepts and TechniquesConstraint-Based Frequent Pattern MiningnClassification of constraints based on their constraint-pushing capabilitiesnAnti-monotonic:If constraint c is violated,its further
50、 mining can be terminatednMonotonic:If c is satisfied,no need to check c againnData anti-monotonic:If a transaction t does not satisfy c,t can be pruned from its further miningnSuccinct:c must be satisfied,so one can start with the data sets satisfying cnConvertible:c is not monotonic nor anti-monot