Visual Categorization with Bags of Keypoints.pdf-淘文阁

资源描述

《Visual Categorization with Bags of Keypoints.pdf》由会员分享，可在线阅读，更多相关《Visual Categorization with Bags of Keypoints.pdf（16页珍藏版）》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、Visual Categorization with Bags of KeypointsVisual Categorization with Bags of KeypointsGabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cdric BrayXerox Research Centre Europe6, chemin de Maupertuis38240 Meylan, Francegcsurka,Abstract.Abstract. We present a novel method for gene

2、ric visual catego-rization: the problem of identifying the object content of naturalimages while generalizing across variations inherent to the ob-ject class. Thisbag of keypoints method is based on vectorquantization of affine invariant descriptors of image patches.We propose and compare two altern

3、ative implementations usingdifferent classifiers: Nave Bayes and SVM. The main advan-tages of the method are that it is simple, computationally effi-cient and intrinsically invariant. We present results for simulta-neously classifying seven semantic visual categories. These re-sults clearly demonstr

4、ate that the method is robust to back-ground clutter and produces good categorization accuracy evenwithout exploiting geometric information.1. 1. IntroductionIntroductionThe proliferation of digital imaging sensors in mobile phones and consumer-levelcameras is producing a growing number of large dig

5、ital image collections. To managesuch collections it is useful to have access to high-level information about objectscontained in the image. Given an appropriate categorization of image contents, onemay efficiently search, recommend, react to or reason with new image instances.We are thus confronted

6、 with the problem ofgeneric visual categorization . Weshould like to identify processes that are sufficiently generic to cope with many objecttypes simultaneously and which are readily extended to new object types. At the sametime, these processes should handle the variations in view, imaging, light

7、ing and oc-clusion, typical of the real world, as well as the intra-class variations typical of seman-tic classes of everyday objects.The task-dependent and evolving nature of visual categories motivates an example-based machine learning approach. This paper presents a bag of keypoints approach tovi

8、sual categorization. A bag of keypoints corresponds to a histogram of the number ofoccurrences of particular image patterns in a given image. The main advantages of themethod are its simplicity, its computational efficiency and its invariance to affinetransformations, as well as occlusion, lighting

9、and intra-class variations.It is important to understand the distinction of visual categorization from three re-lated problems:Recognition: This concerns the identification of particular object instances. For in-stance, recognition would distinguish between images of two structurally distinctcups, w

10、hile categorization would place them in the same class.Content Based Image Retrieval: This refers to the process of retrieving images on thebasis of low-level image features, given a query image or manually constructed de-scription of these low-level features. Such descriptions frequently have littl

11、e rela-tion to the semantic content of the image.Detection: This refers to deciding whether or not a member ofone visual category ispresent in a given image. Most previous work on detection has centered on ma-chine learning approaches to detecting faces, cars or pedestrians 1-6 While itwould be poss

12、ible to perform generic categorization by applying a detector for eachclass of interest to a given image, this approach becomes inefficient given a largenumber of classes. In contrast to the technique proposed in this paper, most existingdetection techniques require precise manual alignment of the t

13、raining images andthe segregation of these images into different views 5, neither of which is neces-sary in our approach.Our bag of keypoints approach can be motivated by an analogy to learning methodsusing the bag-of-words representation for text categorization 7-10. The idea ofadapting text catego

14、rization approaches to visual categorization is not new. Zhu et al11 investigated the vector quantization of small square image windows, which theycalled keyblocks. They showed that these features produced more “semantics-oriented” results than color and texture based approaches, when combined with

15、ana-logues of the well-known vector-, histogram-, and n-gram-models of text retrieval. Incontrast to our approach, their keyblocks do not possess any invariance properties.The idea of clustering invariant descriptors of image patches has previously beenapplied to the problem of texture classificatio

16、n 12-14. Clearly the problem oftexture classification is rather different from that of generic categorization. Thereforeit is natural that these approaches differ from ours. While our method uses clusteringto obtain quite high-dimensional feature vectors for a classifier, these texture recog-nizers

17、use clustering to obtain relatively low-dimensional histograms and evaluate thesimilarity of these histograms to previously seen probability densities. In 12-13filter responses are clustered and the recognition is done using the closest modelmeasured by a 2 test. Lazebnik et al 14 cluster affine inv

18、ariant interest points ineach image individually and summarize the distribution of the descriptors in form of asignature composed of representative cluster members and weights proportional tocluster sizes. Signatures of different images are compared using the Earth MoversDistance 15.Recently Fergus

19、et al 16 proposed a visual categorization method based on in-variant descriptors of image patches. Their method exploits a probabilistic model thatcombines likelihoods for appearance, relative scale and position, as well as a model ofthe statistics of their patch detector. This elegant approach has

20、a number of limita-tions. Firstly the method is not efficient: even when models are restricted to 6 imagepatches and training images only contain up to 30 patches, days of CPU time are re-quired to learn several categories. Secondly, views of objects used for training must besegregated, for instance

21、 into cars (rear) and cars (side). This is unsurprising given theuse of an explicit 2D model of relative positions.In section 2 we explain the categorization algorithms and the choice of their com-ponents. In section 3 we present results from applying of the algorithms to the datasetof Fergus et al

22、and to a more challenging seven class dataset. We demonstrate that ourapproach is robust to the presence of background clutter and produces state-of-the-artrecognition performance.2. 2. The methodThe methodThe main steps of our method are:Detection and description of image patchesAssigning patch des

23、criptors to a set of predetermined clusters (avocabu-lary) with a vector quantization algorithmConstructing a bag of keypoints , which counts the number of patches as-signed to each clusterApplying a multi-class classifier, treating the bag of keypoints as the fea-ture vector, and thus determine whi

24、ch category or categories to assign tothe image.Ideally these steps are designed to maximize classification accuracy while minimiz-ing computational effort. Thus, the descriptors extracted in the first step should beinvariant to variations that are irrelevant to the categorization task (image transf

25、orma-tions, lighting variations and occlusions) but rich enough to carry enough informationto be discriminative at the category level. The vocabulary used in the second stepshould be large enough to distinguish relevant changes in image parts, but not so largeas to distinguish irrelevant variations

26、such as noise.We refer to the quantized feature vectors (cluster centres) as “keypoints” by anal-ogy with “keywords” in text categorization. However, in our case “words” do notnecessarily have a repeatable meaning such as “eyes”, or “car wheels”, nor is there anobvious best choice of vocabulary. Rat

27、her, our goal is to use a vocabulary that allowsgood categorization performance on a given training dataset. Therefore the steps in-volved in training the system allow consideration of multiple possible vocabularies:Detection and description of image patches for a set of labeled trainingimagesConstr

28、ucting a set of vocabularies: each is a set of cluster centres, with re-spect to which descriptors are vector quantized.Extracting bags of keypoints for these vocabulariesTraining multi-class classifiers using the bags of keypoints as feature vec-torsSelecting the vocabulary and classifier giving th

29、e best overall classifica-tion accuracy.We now discuss the choices made for each step in more detail.2.12.1Feature extractionFeature extractionIn computer vision, local descriptors (i.e. features computed over limited spatial sup-port) have proved well-adapted to matching and recognition tasks, as t

30、hey are robustto partial visibility and clutter. Such tasks require descriptors that are repeatable. Here,we mean repeatable in the sense that if there is a transformation between two instancesof an object, corresponding points are detected and (ideally) identical descriptor val-ues are obtained aro

31、und each. This has motivated the development of several scaleand affine invariant point detectors, as well as descriptors that are resistant to geomet-ric and illumination variations 17-21.It was shown in 21 that if we have affine transformation between two images ascale invariant point detector is

32、not sufficient to have the stability of the points loca-tion. Therefore we preferred to work with the Harris affine detector described in 21.However, the reader should be aware that the benefits of this choice are not clear cut:firstly because most real world objects have three-dimensional structure

33、s whose varia-tions are not well captured by affine transformations; secondly since attempts to in-crease the invariance of a feature typically result in a loss of discriminative informa-tion.Harris affine pointsare detected by an iterative process. Firstly, positions andscales of interest points ar

34、e determined as local maxima (in position) of a scale-adapted Harris function, and as local extrema in scale of the Laplacian operator. Thenan elliptical (i.e. affine) neighborhood is determined. This has a size given by theselected scale and a shape given by the eigenvalues of the images second mom

35、entmatrix. The selection of position/scale and the elliptical neighborhood estimation arethen iterated and the point is kept only if the process converges within a fixed numberof iterations.The affine region is then mapped to a circular region, so normalizing it for affinetransformations. Scale Inva

36、riant Feature Transform (SIFT) descriptors 18 are com-puted on that region. SIFT descriptors are multi-image representations of an imageneighborhood. They are Gaussian derivatives computed at 8 orientation planes over a4x4 grid of spatial locations, giving a 128-dimension vector. Fig. 1 shows an exa

37、mpleof the maps of gradient magnitude corresponding to the 8 orientations.Fig. 1.Fig. 1. (From left to right) A Harris affine region; the normalized region; and the 8 maps ofgradient magnitude constituting the SIFT descriptor.We prefer SIFT descriptors to alternatives such as steered Gaussian deriva

38、tives ordifferential invariants of the local jet for the following reasons:1.They are simple linear Gaussian derivatives. Hence we expect them to be morestable to typical image perturbations such as noise than higher Gaussian deriva-tives or differential invariants.2.The use of a simple Euclidean me

39、tric in the feature space seems justified. In thecase of differential invariants obtained by a combination of the components of thelocal jet, the use of a Mahalanobis distance is more appropriate. For instance, onewould expect a second derivative feature to have a higher variance than a first de-riv

40、ative. Selecting an appropriate Mahalanobis distance a priori seems challeng-ing. It would not be appropriate to use the covariance matrix of SIFT descriptorsover the entire dataset, since this is predominantly influenced by inter-class varia-tions (or more precisely, by variations between keypoints

41、 that we do not wish toignore). Measuring a Mahalanobis distance would probably requiring manualspecification of multiple homologous matching points between different imagesof objects of the same category, seriously working against our objective of pro-ducing a simple and automated categorization sy

42、stem.3.There are far more components to these feature vectors (128 rather than 12 to 16).Hence we have a richer and potentially more discriminative representation.Recently Mikolajczyk et al 22 have compared several descriptors for matchingand found that SIFT descriptors perform best.2.32.3Visual voc

43、abulary constructionVisual vocabulary constructionIn our method, the vocabulary is a way of constructing a feature vector for classifi-cation that relates “new” descriptors in query images to descriptors previously seen intraining. One extreme of this approach would be to compare each query descript

44、or toall training descriptors: this seems impractical given the huge number of training de-scriptors involved (over 600 000 for our data set). Another extreme would be to try toidentify a small number of large “clusters” that are good at discriminating a givenclass: for instance 16 operates with 6 p

45、arts per category. In practice we find that thebest tradeoffs of accuracy and computational efficiency are obtained for intermediatesizes of clustering.Most clustering or vector quantization algorithms are based on iterative square-error partitioning or on hierarchical techniques. Square-error parti

46、tioning algorithmsattempt to obtain the partition which minimizes the within-cluster scatter or maximizesthe between-cluster scatter. Hierarchical techniques organize data in a nested sequenceof groups which can be displayed in the form of a dendrogram or a tree. They needsome heuristics to form clu

47、sters and hence are less frequently used than square-errorpartitioning techniques in pattern recognition.We chose to use the simplest square-error partitioning method: k-means 23. Thisalgorithm proceeds by iterated assignments of points to their closest cluster centersand recomputation of the cluste

48、r centers. Two difficulties are that the k-means algo-rithm converges only to local optima of the squared distortion, and that it does notdetermine the parameter k. There exist methods allowing to automatically estimatingthe number of clusters. For example, Pelleg et al 24 use cluster splitting to d

49、o it,where the splitting decision is done by computing the Bayesian Information Criterion.However, in the present case we do not really know anything about the density or thecompactness of our clusters. Moreover, we are not even interested in a “correct clus-tering” in the sense of feature distribut

50、ions, but rather in accurate categorization. Wetherefore run k-means several times with different number of desired representativevectors (k) and different sets of initial cluster centers. We select the final clusteringgiving the lowest empirical risk in categorization 25.2.32.3CategorizationCategor

展开阅读全文