《PRTools Version 3.0 A Matlab Toolbox for Pattern Recognition.pdf》由会员分享,可在线阅读,更多相关《PRTools Version 3.0 A Matlab Toolbox for Pattern Recognition.pdf(25页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、PRToolsPRToolsVersion 3.0Version 3.0A Matlab Toolbox for Pattern RecognitionA Matlab Toolbox for Pattern RecognitionR.PR.P.W.Duin.W.DuinJanuary 2000January 2000An introduction into the setup,definitions and use of PRToolsis given.Readers are assumed tobe familiar with Matlab and should have a basic
2、understanding of field of statistical patternrecognition.Pattern Recognition GroupDelft University of TechnolgyP.O.Box 5046,2600 GA DelftThe Netherlandstel:+31 15 2786143fax:+31 15 2786740email:duintn.tudelft.nlhttp:/www.ph.tn.tudelft.nl/prtools-1-1.IntroductionIn statistical pattern recognition one
3、 studies techniques for the generalisation of decision rulesto be used for the recognition of patterns in experimental data sets.This area of research has astrong computational character,demanding a flexible use of numerical programs for studyingthe data as well as for evaluating the data analysis t
4、echniques themselves.As still newtechniques are being proposed in the literature a programming platform is needed that enablesa fast and flexible implementation.Pattern recognition is studied in almost all areas of appliedscience.Thereby the use of a widely availablenumerical toolset like Matlab may
5、 be profitablefor both,the use of existing techniques,as well as for the study of new algorithms.Moreover,because of its general nature in comparison with more specialised statistical environments,itoffers an easy integration with the preprocessing of data of any nature.This may certainly befacilita
6、ted by the large set of toolboxes available in Matlab.Themorethan100routinesofferedbyPRToolsinitspresentstaterepresentabasicsetcoveringlargelytheareaofstatisticalpatternrecognition.Inordertomaketheevaluationandcomparisonof algorithms more easy a set of data generation routines is included,as well as
7、 a small set ofstandardrealworlddatasets.Ofcourse,manymethodsandproposalsarenotyetimplemented.Anybodywho likesto contributeis cordially invitedto do so.The veryimportant field of neuralnetworkshasbeenskippedpartiallyasMatlabalreadyincludesaverygoodtoolboxinthatarea.At the moment just some basic rout
8、ines based on that toolbox are included in order to facilitatea comparison with traditional techniques.PRTools has a few limitations.Due to the heavy memory demands of Matlab very largeproblems with learning sets of tens of thousands of objects cannot be handled on moderatemachines.Moreover,some alg
9、orithms are slow as it appeared to be difficult to avoid nestedloops.A fundamental drawback with respect to some applications is that PRTools as yet doesnot offer the possibility of handling missing data problems,nor the use of fuzzy or symbolicdata.These areas demand their own sets of routines and
10、are waiting for manpower.In the next sections,first the area of statistical pattern recognition covered by PRTools isdescribed.Following the toolbox is summarized and details are given on some specificimplementations.Finally some examples are presented.-2-2.The area of statistical pattern recognitio
11、nPRTools deals with sets oflabeled objects and offers routines for generalising such sets intofunctions for data mapping and classification.An object is a k-dimensional vector of featurevalues.It is assumed that for all objects in a problem all values of the same set of features aregiven.The space d
12、efined by the actual set of features is called the feature space.Objects arerepresented as points or vectors in this space.A classification function assigns labels to newobjects inthe feature space.Usually,this is not done directly,butina numberof stages inwhichthe initial feature space is successiv
13、ely mapped into intermediate stages,finally followed by aclassification.The concept of mapping spaces and dataset is thereby important and constitutesthe basis of many routines in the toolbox.Sets of objects may be given externally or may be generated by one of the data generationroutines of PRTools
14、.Their labels may also be givenexternally or may be the result of a clusteranalysis.By this technique similar objects within a larger set are grouped(clustered).Thesimilarity measure is defined by the cluster technique in combination with the objectrepresentation in the feature space.A fundamental p
15、roblem is to find a good distance measurethat agrees with the dissimilarity ofthe objects represented by the feature vectors.Throughout PRTools the Euclidean distance isused as default.However,scaling the features and transforming the feature spaces by differenttypes of maps effectively changes the
16、distance measure.The dimensionality of the feature space may be reduced by the selection of subsets of goodfeatures.Several strategies and criterion functions are possible for searching good subsets.Feature selection is important because it decreases the amount of features that have to bemeasuredand
17、processed.Inadditiontotheimprovedcomputationalspeedinlowerdimensionalfeature spaces there might also be an increase in the accuracy of the classification algorithms.This is caused by the fact that for less features less parameters have to be estimated.Another way to reduce the dimensionality is to m
18、ap the data on a linear or nonlinear subspace.This is called linear or nonlinear feature extraction.It does not necessarily reduce the numberof features to be measured,but the advantage of an increased accuracy might still be gained.Moreover,as lower dimensional representations yield less complex cl
19、assifiers bettergeneralisations can be obtained.Using a learning set(or training set)a classifier can be trained such that it generalizes this setof examples of labeled objects into a classification rule.Such a classifier can be linear ornonlinear and can be based on two different kinds of strategie
20、s.The first one minimizes theexpected classification error by using estimates of the probability density functions.In thesecond strategythis error is minimised directly by optimizing the classification function overitsperformance over the learning set.In this approach it has to be avoided that the c
21、lassifierbecomes entirely adapted to the learning set,including its noise.This decreases itsgeneralisation capability.This overtraining can be circumvented by several types overregularisation(often used in neural network training).Another technique is to simplify theclassifiaction function afterward
22、s(e.g.the pruning of decision trees).-3-If the class probability density functions are known like in simulations,the optimalclassificationfunctiondirectlyfollowsfromtheBayesrule.Insimulationsthisruleisoftenusedas a reference.Constructed classification functions may be evaluated byindependent test se
23、ts of labeledobjects.These objects have to be excluded from the learning set,otherwise the evaluationbecomesbiased.Iftheyareaddedtothelearningset,however,betterclassificationfunctionmaybe expected.A solution to this dilemma is the use of cross validation and rotation methods bywhich a small fraction
24、 of objects is excludedfrom learning and used for testing.This fraction isrotated over the availableset of objects and results are avaraged.The extreme case is the leave-one-out method for which the excluded fraction is as large as one object.The performance of classification functions can be improv
25、ed by the following methods:1.A reject option in which the objects close to the decision boundary are not classified.Theyare rejected and might be classified by hand or by another classifier.2.The selection or averaging of classifiers.3.A multi-stage classifier for combining classification results o
26、f several other classifiers.For all these methods it is profitable or necessary that a classifier yields some distance measureor aposteriori probability in addition to the hard,unambiguous assignment of labels.3.ReferencesYoh-Han Pao,Adaptive pattern recognition and neural networks,Addison-Wesley,Re
27、ading,Massachusetts,1989.K.Fukunaga,Introduction to statistical pattern recognition,second edition,Academic Press,New York,1990.S.M.Weiss and C.A.Kulikowski,Computer systems that learn,Morgan Kaufman Publishers,California,1991.C.M.Bishop,Neural Networks for Pattern Recognition,Clarendon Press,Oxford
28、,1995.B.D.Ripley,Pattern Recognition and Neural Networks,Cambridge University Press,1996.J.Schurmann,Pattern classification,a unified view of statistical and neural approaches,JohnWiley&Sons,New York,1996.E.Gose,R.Johnsonbaugh and S.Jost,Pattern recognition and image analysis,Prentice-Hall,Englewood
29、 Cliffs,1996S.Haykin,Neural Networks,a Comprehensive Foundation,second edition,Prentice-Hall,Englewood Cliffs,1999.S.Theodoridis and K.Koutroumbas,Pattern Recognition,Academic Press,New York,1999.-4-4.A review of the toolboxUnlabeled DataClusterAnalysisFeatureMeasurementData GenerationLabeled DataVi
30、sualisation2D ProjectionNonlinearMappingFeatureSelectionClassifierTrainingMultistageClassifiersCombiningClassifiersClassificationErrorEstimationPlotResultsPRTools makes use of the possibility offered by Matlab 5 to define Classes and Objects.These programmatic concepts should not be confused with th
31、e classes and objects as defined inPatternRecognition.TwoClasseshavebeendefined:datasetandmapping.Alargenumberofoperators(like*)andMatlabcommandshavebeenoverloadedandhavetherebyaspecialmeaning when applied to adataset and/or amapping.The central data structure of PRTools is thedataset.It primarily c
32、onsists of a set of objectsrepresented by a matrix of feature vectors.Attached to this matrix is a set of labels,one for eachobject and a set of feature names.Moreover,a set of apriori probabilities,one for each class,isstored.In most help files of PRTools,adatasetis denoted byA.In almost any routin
33、e this isone of the inputs.Almost all routines can handle multiclass object sets.In the abovescheme the relations between the varioussets of routines are given.At the moment-5-there are no commands for measuring features,so theyhaveto be supplied externally.There arevariouswaysto regroupthe data,sca
34、le and transformthe featurespace,find good features,buildclassifiers,estimate the classification performances and compute(new)object labels.Data structures of the Classmappingstore trained classifiers,feature extracting results,datascaling definitions,nonlinear projections,etcetera.They are usually
35、denoted byW.The result ofthe operationA*W is again a dataset.It is the classified,rescaled or mapped result of applyingthe mapping definition stored inW toA.A typical example is given below:A=gendath(100);%Generate Highleymans classes,100 objects/class%Training set C(20 objects/class)%Test set D(80
36、objects/class)C,D=gendat(A,20);%Compute classifiersW1=ldc(C);%linearW2=qdc(C);%quadraticW3=parzenc(C);%ParzenW4=bpxnc(C,3);%Neural net with 3 hidden units%Compute and display errorsdisp(testd(D*W1),testd(D*W2),testd(D*W3),testd(D*W4);%Plot data and classifiersscatterd(A);%scatter plotplotd(W1,-);%pl
37、ot the 4 discriminant functionsplotd(W2,-.);plotd(W3,-);plotd(W4,:);This commandfile first generates bygendathtwo sets of labeled objects,both containing 100two-dimensional object vectors,and stores them and their labels and apriori probabilities in thedataset A.The distribution follows the so-calle
38、d Highleyman classes.The next call togendat takes thisdataset and splits it at random into adataset C,further on used fortraining,and adataset D,used for testing.This training setCcontains 20 objects from bothclasses.The remaining 2 x 80 objects are collected in D.Inthenextlines fourclassificationfu
39、nctions(discriminants)are computed,calledW1,W2,W3and W4.The linear and quadratic classifier are both based on the assumption of normallydistributed classes.The Parzen classifier estimates the class densities by the Parzen densityestimation and has a built-in optimization for the smoothing parameter.
40、The fourth classifieruses a feedforward neural network with three hidden units.It is trained by the backpropagationrule using a varying stepsize.Hereafter the results are displayed and plotted.The testdataset Dis used in a routinetestd(test discriminant)on each of the four discriminants.The estimate
41、d probabilities of error aredisplayed in the Matlab command window and look like:0.1750 0.1062 0.1000 0.1562-6-Finally the classes are plotted in a scatter diagram together with the discriminants,see below.The plot routineplotddraws a vectorized straight line for the linear classifiers and computest
42、he discriminant function values in all points of the plot grid(default 30 x 30)for the nonlineardiscriminants.After that,the zero discriminant values are computed by interpolation andplotted.:543210123451.510.500.511.522.533.5Wewill now shortly discuss the PRToolscommands group by group.The two basi
43、c structuresofthetoolboxcanbedefinedbytheconstructorsdatasetandmapping.Thesecommandscanalso be used to retrieveor redefine the data.It is thereby not necessary to use the general Matlabconverterstruct()for decomposing the structures.Bygetlab andgetfeat the labelsassignedtothe objectsandfeaturescanbe
44、found.Thegenerationandhandlingofdataisfurtherfacilitated bygenlab andrenumlab.Datasets and MappingsdatasetgetlabgetfeatgenlabrenumlabmappinggetlabDefine dataset from datamatrix and labels and retrieveRetrieve object labels from datasetRetrieve feature labels from datasetGenerate dataset labelsConver
45、t labels to numbersDefine mapping and classifier from data and retrieveRetrieve labels assigned by a classifier-7-Data GenerationData GenerationgaussgendatgendatbgendatcgendatdgendathgendatkgendatlgendatmgendatpgendatsgendattprdataGeneration of multivariate Gaussian distributed dataGeneration of sub
46、sets of a given data setGeneration of banana shaped classesGeneration of circular classesGeneration of two difficult classesGeneration of Higleyman classesNearest neighbour data generationGeneration of Lithuanian classesGeneration of many Gaussian distributed classesParzen density data generationGen
47、eration of two Gaussian distributed classesGeneration of testset from given datasetRead data from file and convert into a datasetThere is a large set of routines for the generation of arbitrary normally distributed classes(gauss),and for variousspecific problems(gendatc,gendatd,gendath,gendatmandgen
48、dats).There are two commands for enriching classes by noise injection(gendatk andgendatp).These are used for the general testset generatorgendatt.A given dataset can bespit into a training set and a testsetgendat.The routinegendat splits the dataset at randominto two sets.Linear and Higher Degree Po
49、lynnomial ClassifiersLinear and Higher Degree Polynnomial ClassifiersklclckljlcloglcfishercldcnmcnmscperlcperscpfsvcqdcudcpolycclasscclassdtestdLinear classifier by KL expansion of common cov matrixLinear classifier by KL expansion on the joint dataLogistic linear classifierFishers discriminant(mini
50、mum least square linear classifier)Normal densities based linear classifier(Bayes rule)Nearest mean classifierScaled nearest mean classifierLinear classifier by linear perceptronLinear classifier by nonlinear perceptronPseudo-Fisher support vector classifierNormal densities based quadratic(multi-cla