《模式识别中的支持向量机(SVM)教程.pdf》由会员分享,可在线阅读,更多相关《模式识别中的支持向量机(SVM)教程.pdf(43页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、,143()c?Kluwer Academic Publishers,Boston.Manufactured in The Netherlands.A Tutorial on Support Vector Machines for PatternRecognitionCHRISTOPHER J.C.BURGESBell Laboratories,Lucent TechnologiesAbstract.The tutorial starts with an overview of the concepts of VC dimension and structural riskminimizati
2、on.We then describe linear Support Vector Machines(SVMs)for separable and non-separabledata,working through a non-trivial example in detail.We describe a mechanical analogy,and discusswhen SVM solutions are unique and when they are global.We describe how support vector training canbe practically imp
3、lemented,and discuss in detail the kernel mapping technique which is used to constructSVM solutions which are nonlinear in the data.We show how Support Vector machines can have very large(even infinite)VC dimension by computing the VC dimension for homogeneous polynomial and Gaussianradial basis fun
4、ction kernels.While very high VC dimension would normally bode ill for generalizationperformance,and while at present there exists no theory which shows that good generalization performanceis guaranteed for SVMs,there are several arguments which support the observed high accuracy of SVMs,which we re
5、view.Results of some experiments which were inspired by these arguments are also presented.We give numerous examples and proofs of most of the key theorems.There is new material,and I hope thatthe reader will find that even old material is cast in a fresh light.Keywords:Support Vector Machines,Stati
6、stical Learning Theory,VC Dimension,Pattern RecognitionAppeared in:Data Mining and Knowledge Discovery 2,121-167,19981.IntroductionThe purpose of this paper is to provide an introductory yet extensive tutorial on the basicideas behind Support Vector Machines(SVMs).The books(Vapnik,1995;Vapnik,1998)c
7、ontain excellent descriptions of SVMs,but they leave room for an account whose purposefrom the start is to teach.Although the subject can be said to have started in the lateseventies(Vapnik,1979),it is only now receiving increasing attention,and so the timeappears suitable for an introductory review
8、.The tutorial dwells entirely on the patternrecognition problem.Many of the ideas there carry directly over to the cases of regressionestimation and linear operator inversion,but space constraints precluded the exploration ofthese topics here.The tutorial contains some new material.All of the proofs
9、 are my own versions,whereI have placed a strong emphasis on their being both clear and self-contained,to make thematerial as accessible as possible.This was done at the expense of some elegance andgenerality:however generality is usually easily added once the basic ideas are clear.Thelonger proofs
10、are collected in the Appendix.By way of motivation,and to alert the reader to some of the literature,we summarizesome recent applications and extensions of support vector machines.For the pattern recog-nition case,SVMs have been used for isolated handwritten digit recognition(Cortes andVapnik,1995;S
11、ch olkopf,Burges and Vapnik,1995;Sch olkopf,Burges and Vapnik,1996;Burges and Sch olkopf,1997),object recognition(Blanz et al.,1996),speaker identification(Schmidt,1996),charmed quark detection1,face detection in images(Osuna,Freund andGirosi,1997a),and text categorization(Joachims,1997).For the reg
12、ression estimationcase,SVMs have been compared on benchmark time series prediction tests(M uller et al.,1997;Mukherjee,Osuna and Girosi,1997),the Boston housing problem(Drucker et al.,21997),and(on artificial data)on the PET operator inversion problem(Vapnik,Golowichand Smola,1996).In most of these
13、cases,SVM generalization performance(i.e.error rateson test sets)either matches or is significantly better than that of competing methods.Theuse of SVMs for density estimation(Weston et al.,1997)and ANOVA decomposition(Stit-son et al.,1997)has also been studied.Regarding extensions,the basic SVMs co
14、ntain noprior knowledge of the problem(for example,a large class of SVMs for the image recogni-tion problem would give the same results if the pixels were first permuted randomly(witheach image suffering the same permutation),an act of vandalism that would leave the bestperforming neural networks se
15、verely handicapped)and much work has been done on in-corporating prior knowledge into SVMs(Sch olkopf,Burges and Vapnik,1996;Sch olkopf etal.,1998a;Burges,1998).Although SVMs have good generalization performance,they canbe abysmally slow in test phase,a problem addressed in(Burges,1996;Osuna and Gir
16、osi,1998).Recent work has generalized the basic ideas(Smola,Sch olkopf and M uller,1998a;Smola and Sch olkopf,1998),shown connections to regularization theory(Smola,Sch olkopfand M uller,1998b;Girosi,1998;Wahba,1998),and shown how SVM ideas can be incorpo-rated in a wide range of other algorithms(Sc
17、h olkopf,Smola and M uller,1998b;Sch olkopfet al,1998c).The reader may also find the thesis of(Sch olkopf,1997)helpful.The problem which drove the initial development of SVMs occurs in several guises-thebias variance tradeoff(Geman,Bienenstock and Doursat,1992),capacity control(Guyonet al.,1992),ove
18、rfitting(Montgomery and Peck,1992)-but the basic idea is the same.Roughly speaking,for a given learning task,with a given finite amount of training data,thebest generalization performance will be achieved if the right balance is struck between theaccuracy attained on that particular training set,and
19、 the“capacity”of the machine,that is,the ability of the machine to learn any training set without error.A machine with too muchcapacity is like a botanist with a photographic memory who,when presented with a newtree,concludes that it is not a tree because it has a different number of leaves from any
20、thingshe has seen before;a machine with too little capacity is like the botanists lazy brother,who declares that if its green,its a tree.Neither can generalize well.The exploration andformalization of these concepts has resulted in one of the shining peaks of the theory ofstatistical learning(Vapnik
21、,1979).In the following,bold typeface will indicate vector or matrix quantities;normal typefacewill be used for vector and matrix components and for scalars.We will label componentsof vectors and matrices with Greek indices,and label vectors and matrices themselves withRoman indices.Familiarity with
22、 the use of Lagrange multipliers to solve problems withequality or inequality constraints is assumed2.2.A Bound on the Generalization Performance of a Pattern Recognition Learn-ing MachineThere is a remarkable family of bounds governing the relation between the capacity of alearning machine and its
23、performance3.The theory grew out of considerations of under whatcircumstances,and how quickly,the mean of some empirical quantity converges uniformly,as the number of data points increases,to the true mean(that which would be calculatedfrom an infinite amount of data)(Vapnik,1979).Let us start with
24、one of these bounds.The notation here will largely follow that of(Vapnik,1995).Suppose we are given lobservations.Each observation consists of a pair:a vector xi Rn,i=1,.,l and theassociated“truth”yi,given to us by a trusted source.In the tree recognition problem,ximight be a vector of pixel values(
25、e.g.n=256 for a 16x16 image),and yiwould be 1 if theimage contains a tree,and-1 otherwise(we use-1 here rather than 0 to simplify subsequent3formulae).Now it is assumed that there exists some unknown probability distributionP(x,y)from which these data are drawn,i.e.,the data are assumed“iid”(indepen
26、dentlydrawn and identically distributed).(We will use P for cumulative probability distributions,and p for their densities).Note that this assumption is more general than associating afixed y with every x:it allows there to be a distribution of y for a given x.In that case,the trusted source would a
27、ssign labels yiaccording to a fixed distribution,conditional onxi.However,after this Section,we will be assuming fixed y for given x.Now suppose we have a machine whose task it is to learn the mapping xi?yi.Themachine is actually defined by a set of possible mappings x?f(x,),where the functionsf(x,)
28、themselves are labeled by the adjustable parameters.The machine is assumed tobe deterministic:for a given input x,and choice of,it will always give the same outputf(x,).A particular choice of generates what we will call a“trained machine.”Thus,for example,a neural network with fixed architecture,wit
29、h corresponding to the weightsand biases,is a learning machine in this sense.The expectation of the test error for a trained machine is therefore:R()=?12|y f(x,)|dP(x,y)(1)Note that,when a density p(x,y)exists,dP(x,y)may be written p(x,y)dxdy.This is anice way of writing the true mean error,but unle
30、ss we have an estimate of what P(x,y)is,it is not very useful.The quantity R()is called the expected risk,or just the risk.Here we will call it theactual risk,to emphasize that it is the quantity that we are ultimately interested in.The“empirical risk”Remp()is defined to be just the measured mean er
31、ror rate on the trainingset(for a fixed,finite number of observations)4:Remp()=12ll?i=1|yi f(xi,)|.(2)Note that no probability distribution appears here.Remp()is a fixed number for aparticular choice of and for a particular training set xi,yi.The quantity12|yi f(xi,)|is called the loss.For the case
32、described here,it can onlytake the values 0 and 1.Now choose some such that 0 1.Then for losses takingthese values,with probability 1 ,the following bound holds(Vapnik,1995):R()Remp()+?h(log(2l/h)+1)log(/4)l?(3)where h is a non-negative integer called the Vapnik Chervonenkis(VC)dimension,and isa mea
33、sure of the notion of capacity mentioned above.In the following we will call the righthand side of Eq.(3)the“risk bound.”We depart here from some previous nomenclature:the authors of(Guyon et al.,1992)call it the“guaranteed risk”,but this is something of amisnomer,since it is really a bound on a ris
34、k,not a risk,and it holds only with a certainprobability,and so is not guaranteed.The second term on the right hand side is called the“VC confidence.”We note three key points about this bound.First,remarkably,it is independent of P(x,y).It assumes only that both the training data and the test data a
35、re drawn independentlyaccording to some P(x,y).Second,it is usually not possible to compute the left hand4side.Third,if we know h,we can easily compute the right hand side.Thus given severaldifferent learning machines(recall that“learning machine”is just another name for a familyof functions f(x,),a
36、nd choosing a fixed,sufficiently small,by then taking that machinewhich minimizes the right hand side,we are choosing that machine which gives the lowestupper bound on the actual risk.This gives a principled method for choosing a learningmachine for a given task,and is the essential idea of structur
37、al risk minimization(seeSection 2.6).Given a fixed family of learning machines to choose from,to the extent thatthe bound is tight for at least one of the machines,one will not be able to do better thanthis.To the extent that the bound is not tight for any,the hope is that the right handside still g
38、ives useful information as to which learning machine minimizes the actual risk.The bound not being tight for the whole chosen family of learning machines gives critics ajustifiable target at which to fire their complaints.At present,for this case,we must relyon experiment to be the judge.2.1.The VC
39、DimensionThe VC dimension is a property of a set of functions f()(again,we use as a generic setof parameters:a choice of specifies a particular function),and can be defined for variousclasses of function f.Here we will only consider functions that correspond to the two-classpattern recognition case,
40、so that f(x,)1,1 x,.Now if a given set of l points canbe labeled in all possible 2lways,and for each labeling,a member of the set f()canbe found which correctly assigns those labels,we say that that set of points is shattered bythat set of functions.The VC dimension for the set of functions f()is de
41、fined as themaximum number of training points that can be shattered by f().Note that,if the VCdimension is h,then there exists at least one set of h points that can be shattered,but it ingeneral it will not be true that every set of h points can be shattered.2.2.Shattering Points with Oriented Hyper
42、planes in RnSuppose that the space in which the data live is R2,and the set f()consists of orientedstraight lines,so that for a given line,all points on one side are assigned the class 1,and allpoints on the other side,the class 1.The orientation is shown in Figure 1 by an arrow,specifying on which
43、side of the line points are to be assigned the label 1.While it is possibleto find three points that can be shattered by this set of functions,it is not possible to findfour.Thus the VC dimension of the set of oriented lines in R2is three.Lets now consider hyperplanes in Rn.The following theorem wil
44、l prove useful(the proofis in the Appendix):Theorem 1 Consider some set of m points in Rn.Choose any one of the points as origin.Then the m points can be shattered by oriented hyperplanes5if and only if the positionvectors of the remaining points are linearly independent6.Corollary:The VC dimension
45、of the set of oriented hyperplanes in Rnis n+1,since wecan always choose n+1 points,and then choose one of the points as origin,such that theposition vectors of the remaining n points are linearly independent,but can never choosen+2 such points(since no n+1 vectors in Rncan be linearly independent).
46、An alternative proof of the corollary can be found in(Anthony and Biggs,1995),andreferences therein.5Figure 1.Three points in R2,shattered by oriented lines.2.3.The VC Dimension and the Number of ParametersThe VC dimension thus gives concreteness to the notion of the capacity of a given setof functi
47、ons.Intuitively,one might be led to expect that learning machines with manyparameters would have high VC dimension,while learning machines with few parameterswould have low VC dimension.There is a striking counterexample to this,due to E.Levinand J.S.Denker(Vapnik,1995):A learning machine with just
48、one parameter,but withinfinite VC dimension(a family of classifiers is said to have infinite VC dimension if it canshatter l points,no matter how large l).Define the step function(x),x R:(x)=1 x 0;(x)=1 x 0.Consider the one-parameter family of functions,defined byf(x,)(sin(x),x,R.(4)You choose some
49、number l,and present me with the task of finding l points that can beshattered.I choose them to be:xi=10i,i=1,l.(5)You specify any labels you like:y1,y2,yl,yi 1,1.(6)Then f()gives this labeling if I choose to be=(1+l?i=1(1 yi)10i2).(7)Thus the VC dimension of this machine is infinite.Interestingly,e
50、ven though we can shatter an arbitrarily large number of points,we canalso find just four points that cannot be shattered.They simply have to be equally spaced,and assigned labels as shown in Figure 2.This can be seen as follows:Write the phase atx1as 1=2n+.Then the choice of label y1=1 requires 0 .