法国专利FR3016066A1 WEIGHTING SYSTEM AND METHOD FOR COLLECTING IMAGE DESCRIPTORS

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A method and system for generating an image representation which comprises generating a set of integrated descriptors, including, for each of a set of areas of an image, the extraction (S106) an area descriptor that is representative of the pixels in the area and integrating (S108) the area descriptor into a multidimensional space to form an integrated descriptor. An image representation is generated by aggregating the set of integrated descriptors (S110). In the aggregation, each descriptor is weighted with a respective weight in a set of weights, the set of weights being calculated based on the area descriptors for the image. Information based on the image representation is outputted (S114). At least one of the extraction of zone descriptors, the integration of the zone descriptors, and the generation of the image representation is performed with a computer processor.
公开号:FR3016066A1
申请号:FR1462835
申请日:2014-12-18
公开日:2015-07-03
发明作者:Naila Murray；Florent C Perronnin
申请人:Xerox Corp；
IPC主号:

专利说明:

[0001] The present invention relates to a method and a weighting system for pooling image descriptors. The exemplary embodiment relates to an image representation, for tasks such as classification and extraction, and finds a particular application in a system and method for aggregating encoded local descriptors using a function. Pooling allows more weight to be placed on local descriptors that occur less frequently in the descriptor group. [0002] Conventional image classification methods include extracting areas from the image and generating a representation of each area, called a local descriptor or a zone descriptor. The area descriptors (such as SIFT (scaled invariant feature transform) or color descriptors) are then encoded using an integration function cp that maps the descriptors in a non-linear fashion into a dimension space to form integrated zone descriptors. The built-in descriptors are then aggregated into a vector or fixed-length image representation using a pooling function. Representations of this type include the Visual Words Bag (BOV) (see, G. Csurka, et al., "Visual Categorization with Bags of Key Points", ECCV Workshop SLCV 2004, hereinafter, Csurka 2004; Sivic, et al., "Google Video: A Text Extraction Approach to Object Matching in Videos," ICCV 2003, and US Publication No. 20080069456), the Fisher Vector (FV) (see, F. Perronnin, et al., "Fisher Cores on Visual Vocabularies for Image Categorization", CVPR 2007, hereafter, Perronnin 2007, and US Publications No. 20070005356 and 20120076401), the Descriptor Vector 2 3016066 Locally Aggregated (VLAD) (see, H. Jégou, et al., "Aggregation of Local Image Descriptors in Compact Codes", TPAMI 2012, hereinafter, Jégou 2012), Super-Vector (SV) ( see, Z. Zhou, et al., "Image Classification Using Super Vector 5 Encoding of Local Image Descriptors", ECCV 2010, ci -after, Zhou 2010) and the Efficient Matching Nucleus (EMK) (see, L. Bo, et al., "Efficient Matching Core Between Character Sets for Visual Recognition", NIPS 2009, Hereafter, Bo 2009). [0003] Pooling is the operation that involves aggregating several zone integrations into a single representation. If pooling allows for some invariance of descriptor disturbances, it may lead to loss of information. In order to reduce this loss as much as possible, only close descriptors should be pooled together. In order to apply the pooling of close descriptors in the geometric space, it is possible to use spatial pyramids (see, S. Lazebnik, et al., "Beyond Characteristic Bags: Pyramid Matching"). space for a recognition of natural scene categories ", CVPR, 2006). In the descriptor space, the proximity constraint is obtained by choosing an appropriate integration cp. [0004] Pooling is typically achieved either by averaging / summing or by taking the maximum response. A common pooling mechanism involves the averaging of descriptor integrations (see, Csurka 2007, Perronnin 2007, 30 Jégou 2012, Zhou 2010, and Bo 2009). Given a set of area descriptors {x /, xm}, the averaged pooling representation is simply 1-Em (1) (xi). An advantage of averaging is its generality, since it can be applied to any integration. A disadvantage of this method, however, is that frequent descriptors will be more influential in the final representation than rarely occurring descriptors. By "frequent descriptors" are meant descriptors which, although not necessarily identical, together form a mode in a descriptor space. However, such frequently occurring descriptors are not necessarily the most informative descriptors. [0005] For example, a fine grained classification task is considered where the purpose is to distinguish bird species. In a typical bird image, most areas could be background foliage or sky and thus carry little information about the bird category. On the other hand, the most discriminating information could be very localized and thus correspond to only a few areas. Thus, it is desirable to ensure that even these rare areas contribute significantly to the final representation. [0006] The problem of reducing the influence of frequent descriptors has gained a lot of attention in computer vision. This problem can be addressed at the pooling stage or a posteriori by performing some standardization on the image level shared descriptor. Several approaches have been proposed to address the problem of frequent descriptors at the pooling stage. However, all these solutions are heuristic by nature and / or limited to certain types of integrations. For example, an approach, referred to as maximum pooling (see, Y.L. Boureau, et al., "A Conceptual Character Recognition Analysis in Visual Recognition," ICML 2010) is only applicable when is applied to descriptor integrations that can be interpreted as counts, as is the case with the BOV. It is not directly applicable to representations that compute higher order statistics, such as FV, VLAD, SV, or EMK. [0007] Several extensions to standard averaging and maximum pooling frameworks have been proposed. For example, a smooth transition from averaging pooling to maximum pooling can be considered. It is also possible to add weights to obtain weighted pooling (see T. de Campos, et al., "Images as Locally-weighted Feature Sets", CVIU, 116 (1), pp. 68-85 (2012) (from Campos 2012)). The weights in Campos 2012 are computed from a separate preponderance model in an attempt to negate the influence of irrelevant descriptors, but such a model may not necessarily equalize the influence of frequent and rare descriptors. [0008] There remains a need for a pooling method that is generic and applicable to all aggregation-based representations. According to one aspect of the exemplary embodiment, a method for generating an image representation comprises generating a set of integrated field descriptors, each comprising a set of zones of an image, extracting an area descriptor that is representative of the pixels in the area, and integrating the area descriptor into a multidimensional space to form an integrated area descriptor. An image representation is generated. This includes the aggregation of the set of integrated zone descriptors. In the aggregation, each descriptor is weighted with a respective weight in a set of weights, the set of weights being calculated based on the integrated field descriptors for the image. Information based on the image representation is output. In another aspect, a system for generating an image representation includes a descriptor extractor that extracts a set of area descriptors, each area descriptor being representative of the pixels in an area of an image. . An integration component integrates each of the area descriptors into a multidimensional space to form a respective integrated descriptor. A pooling component aggregates the set of built-in descriptors. In the aggregation, each area descriptor is weighted with a respective weight in a set of weights, the weight set being calculated based on the integrated area descriptors for the image. A processor implements the descriptor extractor, the integration component, and the pooling component. According to another aspect, a method for generating an image representation comprises for each of a set of M areas of an image, the extraction of an area descriptor which is representative of the pixels in the area. and integrating the area descriptor into a multidimensional space with an integration function to form an integrated dimension D descriptor. With a processor, an aggregate representation of the image is generated. This includes aggregating the integrated descriptors as Y = wicp (x1), where W is the aggregate 1-1 representation, cp (x1) represents one of the integrated M area descriptors and w1 represents a respective weight, the weights being selected by one of: a) find a vector w = [w1, PIM which minimizes the expression: 11CDTCDW - Cm112 ± 211142 30 where cl) is a matrix D x M which contains the integrated area descriptors of dimension D, cm is a vector in which all the values are the same constant value, and λ is a non-negative regularization parameter; and 6 3016066 b) find the aggregated representation W which minimizes the expression: Oeilf - Cm 02 + 21012 (Equation 11), where cl) is a matrix D x M which contains the integrated area descriptors of dimension D, cm is a vector in which all the values are all the same constant value, and A is a non-negative regularization parameter. An image representation based on the aggregated representation W is generated. FIGURE 1 is a block diagram of a system for calculating a representation of an image; FIGURE 2 is a flowchart illustrating a method for calculating a representation of an image; [0014] FIGURE 3 illustrates the pooling effect of a single descriptor integration with a set of tightly clustered descriptor integrations using averaging pooling; [0015] FIGURE 4 illustrates the pooling effect of a single descriptor integration with a set of tightly clustered descriptor integrations using weighted pooling (GMP); FIG. 5 illustrates probability distributions formed by KDE with no weight (KDE), after exponentiation at power p = 0.5 and renormalization (KDE ° -5), and with weights calculated with the approach. proposed (weighted KDE). KDEs were generated using 5 one-dimensional observations (marked in black dots in the graph) with values [11, -10, 7, 8, 9]. Embodiments of the exemplary embodiment relate to a system and method for generating an image representation that utilize a weighted pooling method for integrated zone descriptor aggregation (also designated zone integrations). The pooling method is applicable to a variety of integration processes. When the BOV integration function is used, the pooling method results in maximum pooling and is thus referred to herein as the GMP. This approach allows a set w of weight to be chosen to linearly reweight the integrated zone descriptors so that the locations of the modes of their distribution will be the same, while the heights of their modes may be different. . Thus, rather than attempting to flatten the global distribution, the process flattens the probability of each sample. [0019] With reference to FIGURE 1, a system 10 for generating an image representation 12 of an input image 14, such as a photographic image, is illustrated. The system takes as input an image 14 for which a statistical representation 12, such as a fixed length vector, is desired. The illustrated system comprises a main memory 16 which stores instructions 18 for generating the representation and a processor 20, in communication with the memory, for executing the instructions. A data memory 22 stores the input image 14 during processing as well as information generated during image processing. One or more network interface (input / output) devices 24, 26 allow the external system, such as a source to communicate with image devices (not shown), a display device 28, such as a computer screen or an LCD screen, and a user input device 30, such as a keyboard, a keypad, a touch screen cursor control device, or a combination of those -this. Hardware components of the system may be communicatively connected by a data / control bus 32. The system may be hosted by one or more computing devices 34. The illustrated instructions include a zone extractor 40, a descriptor extractor 42, an integrating component 44, an image representation generator 46, and a representation employing component 48. Briefly, the zone extractor 40 extracts a set of areas from the image, by for example, an area comprising a set of pixels. The descriptor extractor 42 generates a zone descriptor 50 based on the pixels of the respective area. The integration component 44 integrates the zone descriptor into an integration space by using an integration function cp, generating an integrated descriptor 52 for each zone. In the case of BOV, the integration function may include assigning the zone descriptor to the nearest visual word in a set of visual words (or codebook), where each of the visual words represents a centroid of a cluster of area descriptors extracted from a set of learning images. The image representation generator 46 comprises a weighted pooling component (GMP) 54 which aggregates the built-in descriptors 52 to form an aggregation (denoted W) which can serve as an image representation 12, or be all first, normalized or otherwise processed to form the image representation 12. The representation job component 48 uses the representation 12, for example, for image classification or image extraction. Information 56 is output by the system, based on the image representation. The information 56 may comprise the representation 12 itself, a classification for the image, a set of similar images extracted from an associated image database 58, a combination thereof, or the like. The computer system 10 may comprise one or more computer devices, such as a PC, such as a desktop computer, a laptop, a handheld computer, a portable digital assistant (PDA), a computer server computer, cell phone, tablet computer, pager, combination thereof, or other computer device capable of executing instructions for performing the method as an example. The memory 16 may represent any type of non-transitory computer readable medium such as a random access memory (RAM), a read only memory (ROM), a disk or a magnetic strip, an optical disk, a flash memory, or a holographic memory. According to one embodiment, the memory 16 comprises a combination of random access memory and read only memory. In some embodiments, the processor 20 and the memory 16 may be combined in a single chip. The memory 16 stores instructions for carrying out the exemplary method as well as the processed data 12, 50, 52. The network interface 24, 26 allows the computer to communicate with other devices through via a computer network, such as a local area network (LAN) or a wide area network (WAN), or the Internet, and may include a modulator / demodulator (MODEM), a router, a cable, and / or a port Ethernet. The digital processor 20 may be embodied in a variety of ways, such as by a single-core processor, a dual-core processor (or more generally, a multi-core processor), a digital processor, and a mathematical coprocessor. cooperation, a digital controller, or the like. The digital processor 20, in addition to controlling the operation of the computer 34, executes the instructions 18 stored in the memory 16 to perform the method described in FIGURE 2. The term "software" as used in FIG. As used herein, it is intended to encompass any collection or set of instructions executable by a computer or other digital system to configure the computer or other digital system to accomplish the task which is the purpose of the software. The term "software" as used herein is meant to encompass such instructions stored in a storage medium such as a RAM, a hard disk, an optical disk, or the like, and is also intended to encompass a "Firmware" so named which is software stored on a ROM or etc. Such software may be organized in a variety of ways, and may include software components organized as libraries, Internet programs stored on a remote server, etc., source code, interpretive code, object code, directly executable code, etc. It is contemplated that the software may invoke system level code or calls to other software residing on a server or other location to perform certain functions. As will be appreciated, FIGURE 1 is a high level functional diagram of only a portion of the components that are incorporated in a computer system 10. Since the configuration and operation of programmable computers are well known, they will not be described further. [0029] FIGURE 2 illustrates a method for generating an image representation. The process starts at S100. At S102, an input image is received by the system from an external device, or from an internal memory of the computing device 34. At S104, fields are retrieved from the image by the zone extractor 40. At S106, a zone descriptor is extracted from each zone by the descriptor extractor 42. At S108, each zone descriptor is integrated using an integration function to form an integrated zone descriptor. At S110, the integrated zone descriptors are aggregated using weighted pooling (GMP) to form an image representation. Additional details of this step are discussed below. At S112, the image representation 12 may be used in a task, such as classification or extraction, by the representation job component 48. In order to compute a similarity between images, a kernel K (X, Y) can be calculated as a dot product between the GMP representations of the two images. In order to classify the image, a trained classifier with image representations formed by the present GMP method can be used. At S114, information 56 is outputted, such as the image representation, a category label for the image, or a set of images with similar image representations. The process ends at S116. [0033] The weighted pooling component (GMP) 54 employs a pooling mechanism which involves a reweighting of area statistics (descriptor integrations). It achieves the same equalization effect as a maximum pool but is applicable beyond the BOV and specifically to the Fisher Vector. It thus provides a generalized tool for maximum pooling. In Examples 35 below, it is shown that the performance of the weighted pooling approach is equal to, and sometimes significantly better than, heuristic alternatives. The exemplary GMP approach thus treats the frequent descriptors discussed above (descriptors that are close together and form a mode in a descriptor space) so that it is applicable to any descriptor integration, not just those that can be interpreted as counts. [0034] FIGURES 3 and 4 illustrate the pooling effect of a single descriptor integration with a set of tightly clustered descriptor integrations. Two shared representations are represented. With averaging (FIG 3), the descriptor cluster dominates the pooled representations, and therefore they are very similar to each other. With the present GMP approach (FIG 4), both descriptors contribute significantly, leading to highly recognizable pooled representations. [0035] The weights used to reweight the descriptor integrations are computed on a per-image basis to equalize the influence of frequent and rare integrated descriptors. If wi denotes the weight associated with a descriptor xi, a weighted representation W of the image can be represented as the sum, on all the M descriptors, of the product of the weight for the descriptor and the integrated descriptor: r, One advantage of this approach is that there is no need to quantify the descriptors in order to detect frequent descriptors (as is the case, for example in the case of representations As a result, the weighting is general and can be applied in combination with any integration function, for example, it is applicable to representations without a codebook such as EMK and to representations based on on higher order statistics, such as FV. [0038] A criterion for calculating weights wi is based on a kernel matrix of descriptor-to-descriptor kernels. According to one embodiment, the area weights are all d first calculated and then the weighted integrations are combined. According to another embodiment, the weighted representation can be calculated directly and efficiently using a least squares formulation. The GMP mechanism as an example, in the case of the BOV, produces the same result as maximum pooling. [0039] Subsequently, the terms "optimization", "minimization", and similar phraseology must be interpreted broadly as one skilled in the art would understand these terms. For example, these terms should not be construed as being limited to the absolute global optimum, the absolute minimum, and so on. For example, minimizing a function may employ an iterative minimization algorithm that ends at a stop criterion before an absolute minimum is reached. It is also contemplated that the optimum or minimum value is a local or minimum local optimal value. Let X = {x /, xm} and Y = {Y /, xN} designate two sets of area descriptors extracted from two images. Let Y, = 1-E1 (p (xi) and if Y, = N denote the averaged pooled representations for these images The scalar product K (X, Y) = ww may be rewritten as a kernel at Sum Correspondence (SMK), as follows: 1M K (X, Y) = - k (xi, Yi) MN (1) 14 3016066 Where k (xly) = cp (xi) Tcp (yi) is, by definition, a positive semi-definite nucleus (PSD) and T represents the transposition operator [0041] For example, in the case of BOV, p (x) is a binary vector whose size is equal to the pound size of code and with a single nonzero entry at the level of the centroid index closest to the descriptor x, in which case k (x, y) = 1 if x and y fall in the same Voronoi region and 0 10 otherwise In another example, if k is the Gaussian kernel 4 (x, y) a exp (- 3702), the SMK is designated Gaussian Correspondence Core 262 (GMK) .In this case, the integration cp is obtained by combining random projections with nonlinear cosine arties, thus leading to the EMK (see, Bo 2009). In the following discussion, the GMK is used as an example. This is why the GMK has a probabilistic interpretation that is exploited to develop the reweighting system. A criterion for calculating the weights that depends only on the k-kernel between individual descriptors and not on the integration of individual descriptors is described. A criterion for calculating the weights that depends only on the integration of individual descriptors and not the k-kernel between individual descriptors is then described, which is interpretable in a non-probabilistic configuration (referred to as Direct Solution). Thus, the weight calculation algorithm described can be extrapolated to any PSD k k even if it does not have a probabilistic interpretation. From the two sets X and Y, two Core Density Estimators (KDE) can be deduced: p (x) - 1m Ocxj m 15 3016066 and q (x) = -N1 EN, / ik (x , y). Given two probability distributions p and q, the Probability Product Core (PPK) (see, T. Jebara, et al., "Probability Product Cores" JMLR, pp. 819-844 (2004)) measures their similarity: KFp'r, k (P, (1) = P (x) f) (1 (x) f) dx (2) where p is a parameter of the kernel. When p = 1, the PPK is known as the expected probability kernel and p = 1/2 leads to the Bhattacharyya kernel. The GMK between X 10 written as a PPK between p and q: and Y can be 1 MN K peak (I) '(x, xi) k (x, Y j) cbc u 1 volx7N (3) MN Ldi-iLdi_ileG (xi Y;) = K g11, k (X, Y) This probabilistic visualization of the GMK provides a means of visualizing the impact of similar descriptors. Indeed, a group of similar descriptors in X will lead to a mode in distribution p. FIGURE 5 illustrates this effect, representing two groups of descriptors leading to a two-mode probability distribution. One way to reduce the effect of frequent descriptors is to choose p-values. <1 in the PPK as shown in FIGURE 5. However, this solution faces two major problems. First, for p <1, the PPK between two KDEs can no longer be reduced to an SMK. In such a case, the expensive kernel K (X, Y) can not be rewritten as an effective dot product. In the present method, being able to write K as a scalar product between pooled representations is advantageous because it allows for efficient linear classifiers to be trained on these representations. Then, in order to perfectly equalize the modes, it would be necessary to configure p 0. In such a case, pP becomes flat and thus uninformative. In order to deal with the problem of frequently occurring descriptors, the exemplary method reweighted their integrations. For each descriptor xi, a weight wi is learned and the weighted pooled representation is a function of. This has two major advantages over 1-1 "to the power alternative discussed above: First, the K (X, Y) kernel can still be expressed as a scalar product between GMP representations, facilitating a Then, one can equalize the modes without flattening the entire distribution Instead of exactly equalizing the modes, which would imply first of all a mode detection, which is an expensive process, As an example, equalize the distribution at the location of each sample x 1. As shown in FIGURE 4, this has a similar effect (see "weighted KDE"). of samples X = {xi, xm}, a vector of weight w = [w1, 20 PIM is learned in such a way that the sum on all the zone descriptors xi (including xi) of a weighted nucleus between the descriptor xi and the other descriptor xi is equal to a constant value c: = c for i = 1 ... M (4) where c is a constant value. It should be noted that the resulting weighted function is not necessarily a distribution in the sense that Em 'g, may not total one. However, the final image representation can be normalized by, e2. This is consistent with scalar product similarity (allowing use of linear kernel machines for classification) since this allows an image to be closer to itself. It has also been shown that this improves the results (see F. Perronnin, J. Sànchez, and T. Mensink, "Improvement of Fisher's nucleus for a large-scale image classification", ECCV, pp. 143 -156 (2010), hereinafter Perronnin 2010). Thus, it is only of interest to calculate w up to a multiplicative factor and the value c = 1 can be chosen arbitrarily. Then, if K is the M x M core matrix between individual elements xi and the vector M x 1 of weight, w, and if lm represents the vector M x 1 of all the ones, the equation ( 4) can be rewritten as: Kw = 1m. (5) [0050] That is, the product of the core matrixMxMK and the weight vector w is equal to a vector in which each element has a value of 1. (The value 1 can be replaced by another same constant value c, to produce a vector Cm). It should be noted that equation (5) (which depends only on k) is generic and can be applied to any PSD k kernel. However, there are two major limitations of this double formulation. First, its interpretability is unclear when applied beyond the GMK since there is generally no probabilistic interpretation of SMK. Then, it requires the calculation of the distance between all pairs of zone descriptors. This would be expensive in calculation when extracting tens of thousands of area descriptors, as is often the case. An alternative formulation of equation (5) is now given, which depends only on the integrated descriptors (p (X1). Since K is a PSD matrix, it can be rewritten as: K = CDTCD, (6) where cl) is the matrix D x M which contains the zone integrations of dimension D: (I) = ko (x /), (J0 (1) Thus equation 35 (5) is rewritten as: cl) Tcl) w = lm. (7) 18 3016066 where T = cDpv and Y = 1: wicp (xi), i.e., T is the 1-1 GMP representation to be calculated. Thus, the method finds T which optimizes: ## EQU1 ## An advantage of this formulation is that it offers a correspondence interpretation: a correspondence of a single zone integration cp (x_i with the representation weighted T should lead to a similarity equal to 1, for all 10 descriptors xi.Another advantage is that, instead of first calculating a set of weights and then combining the integrations by area, the image representation In general, equation (8) may not have a solution or it may have multiple solutions, so equation (8) is converted to a least squares regression problem and the equation (8) is converted to a least squares regression problem. The method searches for the value of T, denoted Tk, which makes the norm of el '- lm: 20 = arg min - infinite 2, (9) w so that Tk has one with the additional constraint of solutions. case of an alternating number Other norms are used Equation (9) has a simple closed-form solution: 111 '= (e) +1, = (CKDT) +1) 1,, (10) where is pseudo-inversion and the second equality is from property 11+ = (AT, WAT. It should be noted that .1) 1, = v, m (I (xi) is the pooled vector by summation of integrations which is equivalent in common by averaged averaged vectors since the final image descriptors are normalized, Thus, the weighted pooling mechanism (GMP) by way of example involves the projection of the pooled vector by averaging (Dlm over [0053] Since the pseudo-inversion is not a continuous operation it It is generally useful to add a regularization term to obtain a stable solution for T. Assume that this regularized GMP representation is denoted 2 = arg minTW - 1.0 + TO0W02. (1 1) 5 where the second term is the regularization term and A is a regularization parameter, which in the exemplary embodiment is non-negative and / or non-zero Equation (11) is a ridge regression problem whose solution is: Yx = ee + XIY11M (12) where / is the identity matrix A can be determined by cross validation experiments. For very large values of A, this gives T .; .--, / À and the result is an averaging pool. Thus, A does not only play a regulating role, but also allows a smooth transition between the equation 10 solution (A = 0) and averaging pooling (A, -). Therefore, in the exemplary embodiment, A is selected to provide some influence on pooling but not too high so that averaging is approximated. In practice, to calculate, equation (12) can be computed iteratively, for example using a gradient descent method, such as a conjugate gradient descent (CGD), which is designed to PSD matrices, or a stochastic gradient descent. This approach can be computationally intensive if the integration dimensionality D is large and the matrix c1) is full. However, the calculation may be faster if the individual area integrations cp (x1) are sparse by blocks. By sparse blocks, it is meant that the indices of the integration can be partitioned into a set of groups where activation of an entry in a group means activation of all 3016066 entries in the group. This is the case, for example, for VLAD and SV, where each cluster of indices corresponds to a given cluster centroid. This is also the case for the VF, if a hard assignment model is assumed, where each group 5 corresponds to the gradients with respect to the parameters of a given Gaussian. In such a case, the matrix (De is diagonal in blocks, therefore (From A / is diagonal in blocks and equation (12) can be solved block by block, which is significantly less demanding than solving the problem 10 directly The proposed GMP mechanism may be associated with maximum pooling, ie J5 = Wx1), where 1 = 1 ... RU denotes the set of descriptor integrations of a given image. It is assumed that these integrations are extracted from a finite codebook of possible integrations, cp (xl) C {qk, k = 1 ... K}. It should be noted that qk codebooks can be binary or real-valued Let Q denote the codebook matrix D x K of possible integrations where D is the output integration dimensionality Assuming that Q = cild is orthonormal, Qib = IK where IK is the matrix For example, in the case of the BOV (with a hard assignment), D = K and the qK are binary with only enter the kth entry equal to 1, so that Q = 'K rik designates the proportion of occurrences of qk in j5- [0056] It can be shown that W does not depend on the proportions nk, but only on the presence or the absence of qk in d5. This can be proved as follows: Let FI denote the diagonal matrix K x K which contains the values n1 on the diagonal. We rewrite = or, and (De = QIIQT) The last quantity is the SVD decomposition of (From and so (axIDT) + = Thus equation (10) becomes 35 * = QII + QTQIUK = Q (11 + 11) Since FI is diagonal, its inverse pseudo is diagonal and the values on the diagonal are 1 / Hk if rik 0 and 0 if rik = 0. Thus, MII is a diagonal matrix with a k element on the diagonal equal to 1 if rik 0 and 0 otherwise, so: 5 Ek: rik, o Cik (13) which does not depend on the proportions but just on the presence or absence of qk in d5. [0058] the BOV, equation (13) shows that W is a binary representation where each dimension informs about the presence / absence of each codeword in the image.This is exactly the maximum pooled representation. GMP pooling can provide a generalization of maximum pooling beyond the BOV. [0059] In the regularized case of the BOV, assuming a hard assignment, cp ( xl) is binary with a unique entry corresponding to the codeword index. Thus, cl) corresponds to the BOV histogram (non-normalized) and CIXIDT is a diagonal matrix with the BOV histogram on the diagonal. In such a case, equation (12) can be rewritten as: (ci) (14) cl) + À 'where the earlier division should be understood as a term termwise operation. With A infinitely small, this corresponds to the standard maximum pooling mechanism. The method illustrated in FIG. 2 can be implemented in a computer program product that can be executed on a computer. The computer program product 30 may comprise a non-transitory computer readable recording medium on which a control program is recorded (stored), such as a disk, a hard disk, or the like. Common forms of non-transitory computer readable media include, for example, floppy disks, floppy disks, hard disks, magnetic tape, or any other magnetic storage media, a CD-ROM, a DVD, or any other optical medium, RAM, PROM, EPROM, FLASH EPROM, or any other chip or memory cartridge, or any other non-transitory medium that a computer can read and use. The computer program product can be integrated with the computer 18, (for example, an internal hard disk of RAM), or can be separated (for example, an external hard disk functionally connected to the computer 18). , or can be separated and accessed via a digital data network such as a local area network (LAN) or the Internet (for example, as a redundant array of independent disks (RAID) or other server storage). network which is indirectly accessed by the computer 18, via a digital network). [0061] Alternatively, the method may be implemented in transient media, such as a transmitting carrier wave in which the control program is integrated as a data signal using transmission media. , such as acoustic or light waves, such as those generated during radio and infrared data communications, and the like. The exemplary method may be implemented on one or more universal computers, one or more special purpose computers, a programmed microprocessor or microcontroller, and peripheral integrated circuit elements, an ASIC (integrated circuit). application-specific) or other integrated circuit, a digital signal processor, a hard-wired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as PLD (programmable logic device), PLA (network) programmable logic), FPGA (user programmable gate array), graphics card (GPU) CPU, or PAL (programmable network logic), or the like. In general, any device capable of implementing a finite state machine which is in turn capable of implementing the flowchart shown in FIG. 2 can be used to implement the method. As will be understood, since the steps of the method may all be computer implemented, in some embodiments one or more of the steps may be at least partially manually performed. As will be understood, the process steps need not all be performed in the illustrated order and fewer steps, more steps or different steps can be performed. Additional details on the method and system are now described by way of example. Images [0065] Images 14 may be received by the system 10 in any convenient file format, such as JPEG, GIF, JBIG, BMP, TIFF, or the like, or other common file format used for images and which can be optionally converted to another suitable format before processing. Input images may be stored in a data memory 22 during processing. Images 14 may be inputted from any suitable image source, such as a workstation, a database, a memory storage device, such as a disk, an image capture device, retrieved a memory of the computer 34 or a web server, or the like. In general, each input digital image comprises image data for a pixel array forming the image. The images may be individual images, such as photographs, video images, or combined images, or the like. In general, each image 14 may be a digital photograph. The image data of the image may comprise dye values, such as gray level values, for each of a set of color separations, such as L * a * b * or RGB, or be expressed in another color space in which different colors can be represented. In general, a "gray level" refers to the optical density value of any single color channel, however, expressed (L * a * b *, RGB, YCbCr, etc.) and may include values for ranges of wavelengths outside the normal visible range, such as infrared or ultraviolet. The exemplary image representations 12 are of fixed dimensionality, i.e., each image representation has the same number of elements. In general, each image representation is at least 30, or at least 60, or at least 100, or at least 500 dimensions, and up to 1000 dimensions or more, each dimension having a respective characteristic value, which can be reduced to fewer dimensions, for example, by a principal component analysis (PCA). The zone extractor 40 extracts and analyzes low-level visual features of image areas 14, such as shape, texture, or color features, or the like. The areas can be obtained by image segmentation, by applying specific point-of-interest detectors, by consideration of a regular grid, or simply by random sampling of image areas. In the exemplary method, the areas are extracted on a regular grid, optionally at multiple scales, over the entire image, or at least a portion or a majority of the image. For example, at least 10, or at least 20, or at least 50, or at least 200, or at least 500, or at least 1000 areas are retrieved from each image. Each zone may comprise at least 40, or at least 100 pixels, and up to 1,000,000 pixels or more. The descriptor extractor 42 extracts a set of low level functions in the form of a zone descriptor, such as a vector or a histogram, from each zone. For example, such as area descriptors extracted from the 5 areas, SIFT descriptors or other intensity gradient feature descriptors may be used. See, for example, Lowe, "Distinctive Image Characteristics from Scale Involved Key Points", IJCV Volume 60 (2004). In an illustrative example employing SIFT features, features are extracted from 32x32 pixel areas on regular (every 16 pixel) five-scale grids, using 128-dimensional SIFT descriptors. Other suitable local descriptors that can be extracted include 96 single-dimensional color features in which an area is subdivided into 4x4 subregions and in each subregion the mean and standard deviation are calculated to the three channels (R, G and B). These are merely illustrative examples, and additional features and / or other features may be used. The number of features in each local descriptor is optionally reduced, for example, to 64 dimensions, using a Principal Component Analysis (PCA). [0069] As described above, the method is applicable to a variety of integration techniques. For example: 1. The Visual Word Bag (BOV) In this method, the area descriptors of the areas of an image are assigned to clusters. For example, a visual vocabulary is previously obtained by clustering field descriptors extracted from training images, for example using an average clustering analysis K. Each zone vector is then allocated to a closest cluster (visual word) in previously learned vocabulary and a histogram of assignments can be generated by accumulating occurrences of each visual word. For further details on the BOV integration method, see US Publication No. 20070005356, entitled METHOD AND SYSTEM FOR VISUAL GENERIC CATEGORIZATION, US Publication No. 20070258648, entitled GENERIC VISUAL CLASSIFICATION WITH IMPROVED DS DIMENSIONALITY BASE GRADIENT COMPONENTS, and U.S. Publication No. 20080069456 entitled SACS OF WORDS DEPENDENT OF THE VISUAL CONTEXT FOR A GENERIC VISUAL CATEGORIZATION, and Csurka 2004. [0072] The BOV representation can be viewed as being computed from a kernel of correspondence which counts 1 if two local characteristics fall in the same regions partitioned by visual words and 0 otherwise. This quantification is sometimes too coarse, motivating research into the design of matching cores that more accurately measure the similarity between local characteristics. However, it is impractical to use such cores for large data sets because of their significant computational cost. In order to address this problem, efficient matching kernels (EMKs) have been proposed that map local characteristics to a small-dimensional feature space and averaging the resulting vectors to form a level-adjusted feature. The local feature maps are learned so their internal products preserve, to the best extent possible, the 30 values of the specified kernel function. See, Bo 2009. [0073] An EMK uses explicit integration functions z (x), where z (x) approximates a k (xj, yi) k as k (xl, yri) z (xl) Tz (y1), to estimate SMKs using a single dot product. For a classification, given two sets of elements X = {xi; 1 = 1, ... At} and Y =; j = 1, ... N}, the sum correspondence kernel can be estimated as: MN 1 K (X, Y) - EE) =. (X) Tc1) (Y) MN 1 = 1 j = When the Fisher Vector (FV) is used for integration it is assumed that a generative zone model exists (such as a Gaussian Mixture Model (GMM)) from which all zone descriptors are emitted, and the logarithm gradient of the descriptor likelihood ratio is measured against the model parameters. The exemplary mixing model is a Gaussian mixing model (GMM) comprising a set of Gaussian (Gaussian) functions to which weights are assigned in parameter learning. Each Gaussian is represented by its mean vector, and a covariance matrix. It can be assumed that the covariance matrices are diagonal.
[0002] See, for example, Perronnin, et al., "Fisher kernels on visual vocabularies for image categorization" in CVPR (2007). Each zone used for a training can thus be characterized by a weight vector, a weight for each of the Gaussian functions forming the mixing model. In this case, the visual vocabulary can be estimated using the expectation-maximization (EM) algorithm. The GMM learned is intended to describe the contents of any image within a range of interest. Methods for calculating Fisher vectors are more fully described in US Publication No. 20120076401, published March 29, 2012, entitled IMAGE CLASSIFICATION EMPLOYING COMPRESSED IMAGE VECTORS USING VECTOR QUANTIFICATION, by Jorge Sanchez, et al., US Publication No. 20120045134, published on February 23, 2012, entitled LARGE SCALE IMAGE CLASSIFICATION, by Florent Perronnin, et al., In Perronnin 2010, and Jorge Sànchez and Florent Perronnin, "Compression de high dimensional signature for large scale image classification ", in CVPR 2011. As described above, in one embodiment, the GMP method comprises: [0078] 1. Learning a a set of weights w, one for each descriptor (w = [w. /, kW]) so that the sum over all the other descriptors xi of a weighted kernel between the descriptor xi and the other descriptor xi is equal to a con value stante c, for example, c = 1: c for i = 1, ... M (15) [0079] Each integrated descriptor is then assigned its respective weight w. [0080] 2. Aggregating integrated descriptors, such as FVs, for example, as a sum on all integrated descriptors of the product of the respective weight and the integrated descriptor (= 2.1 = 1 wiexi)) - [0081 In the direct process, the pooling involves the action of finding an image representation which optimizes equation (8) for example, by the action of finding the image representation Y, noted which makes minimal the expression pgy _CMII + 21112 (equation 11), where c1) is the matrix D x M which contains the area integrations with D dimensions: 3016066 (11) = ko (x /), (J0 (xm)], cm is a vector in which all values are at 1 (or some other value of c) and A is the regularization parameter In some embodiments, A ranges from 0.1 to 10,000. Achievement, A ranges from 1 to 1000. [0082] Once again, the image representation is a sum of the weighted area integrations = wi (P (xi). As a direct method, the weights are learned implicitly since the image representation T is learned directly by minimizing equation (11). In order to include spatial information concerning the image in the representation, the image may be partitioned into regions, the aggregated area statistics at a region level, and then the region-level image representations. concatenated to form the image representation. See, for example, S. Lazebnik, et al., "Beyond Feature Bags: Spatial Pyramid Matching for Natural Scene Category Recognition" CVPR '06 Proc. IEEE Conference of Computer Societies on Computer Vision and Pattern Recognition, Volume 2, pp. 2169-2178 (2006). According to one embodiment by way of example, the low level characteristics are gradient characteristics, such as SIFT descriptors, one per zone. The dimensionality of these descriptors can be reduced from 128 to 32 dimensions. A visual vocabulary of 16 or 64 Gaussian is used in the GMM and only the gradient with respect to the average parameters is considered. The image 14 can be divided into 4 regions (1 for the whole image and 3 vertical bands). In the case of 64 Gaussians, this leads to a VF at 32x64x4 = 8192 dimensions. The image representation W can be indexed or compressed using conventional techniques (localhielding hashing (LSH), product quantization, principal component analysis (PCA), etc.) to speed up the process. process performed by the representation job component and / or to use less data storage. An exemplary classifier is a linear classifier that calculates a kernel (e.g., a scalar product) between the image representation and a trained classifier. On the basis of the calculated kernel, the image is assigned to a respective class, or not (a binary decision), or is assigned a probability of being in the class. The classifier may be trained by a method that includes, for each of a set of tagged training images, retrieving a set of zone descriptors, as described for S104. The zone descriptors are integrated, as described for S108, using the same integration function as that selected for the input image. An image representation in the form of a multi-dimensional vector is generated for each training image in a first multi-dimensional vector space, using the GMP method as described for 5110. The classifier is trained on the image representations and their respective labels. Any suitable classifier learning method may be employed, which is suitable for learning linear classifiers, such as logistic regression, parsimonious linear regression, parsimonious multinomial logistic regression, carrier vector machines, or like. The classifier as an example is a binary classifier, although multiclass classifiers are also contemplated. The output of a set of binary classifiers can be combined to assign the image to one of a number of classes, or probabilistically to all categories. Since a linear classifier is used in the embodiment by way of example, in other embodiments a nonlinear classifier may be trained. Since it is expected that the GMP process will be more beneficial on fine-grained tasks where the most discriminating information could be associated with a few areas, the method has been evaluated on four sets of image classification data. Fine-grained: CUB-2010, CUB-2011, Oxford Pets, and Oxford Flowers. The 2007 PASCAL VOC dataset was also used, since it is one of the most widely used references in the image classification literature. On all these datasets, standard learning, validation and testing protocols were used. Subsequently, the best results that were found reported are mentioned. [0090] The 2007 Pascal VOC dataset (VOC-2007) contains 9,963 images of 20 classes (see, M. Everingham, et al., "The Visual Object Class Challenge PASCAL 2007", 25 Results ( VOC 2007)). A performance on this dataset is measured with average accuracy (mAP). A performance of 61.7% mAP using the FV descriptor with spatial pyramids has been reported for this set. See Perronnin 2010. [0091] The 2010 CalTech UCSD Bird dataset (CUB-2010) contains 6,033 images of 200 bird categories (see, S. Welinder, et al., "Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, pp. 1-15 ((2010)) Performance is Measured with Top-1 Accuracy Reported Performance for Data Set 3016066 CUB-2010 is 17.5% (see, A. Angelova and S. Zhu, "Effective object detection and segmentation for fine grain recognition", CVPR, pp. 811-818 (June 2013), "Angelova This process uses parsimonious coding in combination with object detection and segmentation before classification, and without detection and segmentation, Angelova reports that the performance drops to 14.4%. [0092] The CalTech dataset UCSD Birds 2011 (CUB-10 2011) is an extension of CUB-2010 that contains 11,788 images of the same 200 bird categories (see, C. Wah, et al., "The Caltech-UCSD Birds-200-2011 Data Set," Technical Report CNS-TR-2011-001, CalTech (2011)). A performance is measured with a precision of top-1. A reported performance 15 for CUB-2011 is 56.8%. This was obtained using field reality boxes and part detection (see T. Berg and Belhumeur NP, POOF: one-to-one part-based features for fine-grained categorization, verification of face and attribute estimation ", CVPR, pp.955-962 (2013)). Without field reality annotations or object location, performance drops to 28.2% (see, JA Rodriguez and D. Larlus, "Predicting an Object Location Using a Global Image Representation" , ICCV, 2013). [0093] The Oxford-1111-Domestic Animals database contains 7,349 images of 37 categories of cats and dogs (see, OM Parkhi, et al., "Cats and Dogs," CVPR, pp. 3498). -3505 (2012)). A performance is measured with top-1 accuracy. Angelova reports a performance for pets of 54.3%. Without detection and segmentation, Angelova reports that the performance drops to 50.8%. [0094] The Oxford 102 Flowers (Flowers) dataset contains 8,189 images of 102 flower categories (see, M.-E.
[0003] 33 3016066 Nilsback and A. Zisserman, "Automated Classical Flower Classification", ICCVGIP 2008, p. 722-729 (2008)). A performance is measured with top-1 precision. Angelova reports a performance for the Flowers of 80.7%.
[0004] Also, without detection and segmentation the performance drops to 76.7%. Areas are densely extracted at multiple scales leading to approximately 10,000 descriptors per image. Two types of low-level descriptors were evaluated: 128-dimensional SIFT descriptors (see, DG Lowe, "Distinctive image characteristics from scale-invariant key points", IJCV (2004)), and descriptors of 96-dimensional color (see S. Clinchant, et al., "Participation of XRCE at ImagEval", ImagEval workshop at CVIR (2007)). In both cases, their dimensionality was reduced to 64 dimensions with PCA. As mentioned above, the GMP method is general and can be applied to any aggregate representation. Having shown, and verified experimentally, the formal equivalence between GMP and a standard maximum pooling in the BOV case, results for the BOV are not reported. The evaluation focuses on two aggregated representations: 25 EMK (see, Bo 2009) and the FV (see F. Perronnin and C. Dance, "Fisher's nuclei on visual vocabularies for image categorization", CVPR , pp. 1-8 (2007)). [0100] In order to calculate the EMK representations, the method of Bo 2009 followed the following steps: descriptors were projected in random Gaussian directions, cosine non-linearity was applied and the responses aggregated. EMK is a vocabulary-free approach that does not perform any quantization and therefore preserves tiny and highly localized image details. The EMK is thus especially relevant for fine-grained problems.
[0005] However, since all integrations are pooled together rather than within Voronoi regions as with vocabulary-based approaches, EMK is particularly sensitive to the effect of frequent descriptors.
[0006] 5 Thus GMP is expected to have a significant positive impact on EMK performance. There is no other method that has been applied to the EMK to counter frequent descriptors. In particular, power normalization heuristics that are used for vocabulary-based approaches such as BOV or VF are not applicable. The EMK representation has two parameters: the number of output dimensions D (the number of random projections) and the bandwidth u of the Gaussian nucleus from which the random directions are extracted. Dimension D was set at 2048 for all experiments since there was a negligible improvement in performance for larger values. u was chosen by cross validation. The choice of A (the GMK control parameter) has a significant impact on the final performance and was chosen by cross-validation from the set {101, 102, 103, 104, 105}. Space pyramids have not been used. Results for the baseline EMK (no attenuation of frequent descriptors) and EMK with the GMP method as an example are shown in Table 1. TABLE 1: Results using EMK over 5 classification data sets for SIFT descriptors, color descriptors, and a late SIFT and color merge. Descriptor VOC-2007 CUB-2010 CUB-2011 Domestic Animals Flowers GMP Line GMP Line GMP Line GMP Line GMP Line 35 3016066 Base Base Base Base Base SIFT 42.2 46.0 2.9 6.4 5, 0 10.6 21.7 35.6 41.3 52.2 Color 31.7 34.8 2.8 12.1 3.5 22.0 13.7 28.4 41.8 58.7 Fusion 43, 9 49.7 3.4 12.8 5.0 24.9 22.8 42.4 54.0 70.8 [0103] As shown in TABLE 1, a significant improvement in performance, between 3% and 27% is obtained for all data sets when using GMP. This indicates that deletion of frequent descriptors is thus beneficial when using EMK. On the fine-grained data sets, the improvements are particularly impressive, averaging 15%. In order to construct the VF, for each descriptor, the log likelihood ratio gradient with respect to the parameters of a Gaussian mixing model (GMM) was calculated and the gradients pooled. For the FV, an increase in the number of Gaussian G against the negative effects of frequent descriptors since fewer and fewer descriptors are assigned to the same Gaussian. So it was predicted that the GMP would have a smaller impact than for the EMK, especially when mentioned, of the set G increases. With the exception, as of VOC-2007 data, the spatial pyramids were not used. Experiments were conducted for VFs with the number of Gaussian G set at 16 or 256, resulting in vectors of 2048 dimensions and 32 768 dimensions respectively.
[0007] Values of G of 16 and 256 were chosen to have a dimensionality comparable to that of the EMK representation in the first case, and to have a peak VF representation in the latter case. The value of A was again cross-validated from the set {101, 102, 103, 104, 105}. 1. Power Standard Baseline: The baseline method utilizes power normalization, a state-of-the-art and post-hoc approach to improve pooled FV representation (Perronnin 2010). Power 5 in earlier valuations has generally been set to 0.5. Here, a = 0.5 was also found to be optimal for VOC-2007 for SIFT descriptors. However, it has been shown, in the context of image extraction, that a lower value of a often allows significant performance gains to be obtained. The same effect for a classification has been observed here. Thus, the value of the parameter a has been validated by crossing. The next set of 5 values was evaluated: {1.0, 0.75, 0.5, 0.25, 0.0}. It should be noted that for a = 0, the method of F. Perronnin, et al., "Large Scale Extraction with Compressed Fisher Vectors", CVPR, pp. 3384-3391 (2010) was used and the power normalization applied only to non-zero inputs. The best performance (the value that leads to the best results on the validation set) is designated a * in Table 2. a * was determined on a descriptor and data set basis. Thus, the baseline a * is a very competitive baseline. For example, for CUB-2011, a performance with a late merge and G = 256 increases to 29.8% from 25.4% when a = a * as opposed to a = 0.5. It should be noted that a = 1 corresponds to an unmodified FV without any power normalization. [0107] 2. GMP without power normalization: the results are shown in Table 2. The GMP approach systematically performs a significantly better performance than having no normalization (a = 1) (10% better on average for late fusion and G = 256). The improvement is particularly impressive for several sets of fine-grained data. For example, for CUB-2011, GMP obtains a top-1 precision of 30.4% compared to 13.2% with a = 1. 37 3016066 3. GMP with power normalization: GMP almost always surpasses power normalization for all data sets for G = 16. The average improvement for late melting is 2.6%. As expected, when G 5 increases to 256, GMP has less impact, but still outperforms an average power normalization of 0.7%, with a late merge. [0109] On the late bloom data set with G = 256, 83.5% and 82.2% respectively were obtained for a * and GMP. These exceed the best values reported earlier (80.7%, Angelova). Also, on the Pets dataset with late fusion and G = 256, GMP achieves a top-1 accuracy of 55.7%, 15 compared to 54.0% with power normalization, a performance increase of 1.7%. This is to our knowledge the best reported result for this dataset, exceeding the best reported previous result (54.3%, Angelova). Thus GMP obtains or exceeds the performance of the ad-hoc power normalization technique, while having more principles and being more general. 4. Spatial Pyramid Effect: Additional experiments were performed on the 2007 VOC25 data set to investigate the effect of the method when using space pyramids (SPs). A coarse pyramid was used and 4 FV were extracted per image: one VF for the entire image and one VF for each of the three horizontal bands corresponding to the top, middle and bottom regions of the image. With SPs, GMP is once again experiencing improvements over power normalization. For example, with a late merger and G = 256, GMP gets 62.0% compared to 60.2% for the baseline a *, a performance increase of 1.8%. 5. Effect of Number of Gaussian G: As expected, there is a consistent and significant positive impact on performance when G is increased from 16 to 256. The GMP approach is complementary to the increase in G since a performance is generally improved when more Gaussians are used and GMP is applied. In addition, GMP is particularly attractive when small VFs are to be used. Table 2 shows results using FV on 5 sets of classification data for SIFT descriptors, color descriptors, and late SIFT and color merge. Results are represented for a Gaussian number G = 16 and G = 256, also for a = 1 (that is, no power normalization) and a = a (best-performing power normalization), and for the GMP approach. TABLE 2 Descriptor VOC-2007 CUB-2010 CUB-2011 a = 1 a = a * GMP a = 1 a = a * GMP a = 1 a = a * GMP G = 16 SIFT 48.8 51.7 52.7 3 , 7 6,7 6,4 7,9 11,0 11,5 Color 39,7 43,6 45,5 5,6 9,2 13,6 7,2 16,8 21,6 Fusion 52,2 55 , 1 56.8 5.8 10.2 14.3 10.0 18.9 22.8 G = 256 SIFT 52.6 57.7 58.1 5.3 8.1 7.7 10.2 16, 3 16.4 Color 39.4 49.3 50.0 4.1 13.8 15.1 9.0 26.4 27.0 Fusion 54.8 60.6 61.6 5.9 15.3 16, 7 13.2 29.8 30.4 20 Descriptor Animals Domestic flowers a = 1 a = a * GMP a = 1 a = a * GMP G = 16 SIFT 29.3 32.1 35.1 58.3 63.8 63.8 Color 22.6 29.1 32.5 55.3 65.3 65.9 Merger 33.6 39.8 42.9 69.9 77.5 78.8 G = 256 SIFT 38.0 46, 9 47.9 67.7 73.0 72.8 Color 23.6 41.0 41.6 63.8 74.4 72.8 Melting 40.5 54.0 55.7 77.2 83.5 82, From Tables 1 and 2 it is clear that the baseline EMK results are rather poor compared to the baseline FV results. However, for CUB-2010, CUB-2011, and Pets, the GMP 5 approach improves EMK performance to the point that EMK results with GMP are comparable to FV results with GMP when G = 16 (with G = 16, FV and EMK representations are both 2048 in size). In fact, for CUB-2011, EMK with GMP is higher than FV with GMP for G = 16 (24.9% vs. 22.8%). [0114] The principled and general method by way of example for pooling of zone level descriptors thus equates the influence of frequent and rare descriptors, preserving discriminant information in the resulting aggregate representation. The generalized maximum pooling (GMP) approach is applicable to any SMK and can thus be seen as a maximum pool extension, which can only be applied to count based representations such as BOV. In-depth experiments on several sets of public data show that GMP is equal to, and sometimes significantly better than, heuristic alternatives.

权利要求:
Claims (3)
[0001]
CLAIMS1 - System for generating an image representation, comprising: a descriptor extractor (42) which extracts a set of area descriptors, each area descriptor being representative of the pixels in an area of an image; an integration component (44) that integrates each of the area descriptors into a multidimensional space to form a respective integrated area descriptor; a pooling component (54) that aggregates the set of integrated descriptors, wherein in the aggregation each area descriptor is weighted with a respective weight in a set of weights, the weight set being calculated on the basis of the zone descriptors for the image; and a processor (20) which implements the descriptor extractor (42), the integration component (44) and the pooling component (54). 20
[0002]
2 - A method for generating an image representation, comprising: for each of a set of M areas of an image, extracting an area descriptor that is representative of the pixels in the area and integrating the zone descriptor in a multidimensional space with an integration function to form an integrated dimension D descriptor; with a processor (20), generating a representation of the image comprising aggregating the integrated descriptors as T where T is the aggregate representation 1-1, 9 (xi) is one of the M descriptors of integrated zone and wi represents a respective weight, the weights being selected by one of: a) find a vector w = [w1, wm] which minimizes the expression: 110Tow -.c.112 AII.II2 where is a matrix D x M which contains the integrated area descriptors of dimension D, cmcm is a vector in which all the values are the same constant value, and X is a non-negative regularization parameter; and b) find the aggregated representation W that minimizes the expression: ilee - CA2 + 41112 (Equation 11), where 0 is a D x M matrix which contains the integrated area descriptors of dimension D, cm is a vector in where all values are all one constant value, and la is a non-negative regularization parameter; and generating an image representation on the basis of T.
[0003]
3 - Computer program product comprising instructions stored on a non-transient recording medium storing instructions, which when executed on a computer causes the computer to perform the method of claim 2.

类似技术:

公开号 | 公开日 | 专利标题

FR3016066A1|2015-07-03|WEIGHTING SYSTEM AND METHOD FOR COLLECTING IMAGE DESCRIPTORS

Yang et al.2017|Canonical correlation analysis networks for two-view image recognition

US9633282B2|2017-04-25|Cross-trained convolutional neural networks using multimodal images

US9147132B2|2015-09-29|Classification of land based on analysis of remotely-sensed earth images

US8594385B2|2013-11-26|Predicting the aesthetic value of an image

Ravichandran et al.2012|Categorizing dynamic textures using a bag of dynamical systems

US10296846B2|2019-05-21|Adapted domain specific class means classifier

FR2990035A1|2013-11-01|EXTRACTION SYSTEM AND METHOD USING CATEGORY LEVEL LABELS

Li et al.2013|SHREC’13 track: large scale sketch-based 3D shape retrieval

JP5926291B2|2016-05-25|Method and apparatus for identifying similar images

US10354199B2|2019-07-16|Transductive adaptation of classifiers without source data

FR2974433A1|2012-10-26|EVALUATION OF IMAGE QUALITY

KR20130142191A|2013-12-27|Robust feature matching for visual search

FR2955681A1|2011-07-29|SYSTEM FOR NAVIGATION AND EXPLORATION OF CREATION IMAGES

US8526728B2|2013-09-03|Establishing clusters of user preferences for image enhancement

US8666992B2|2014-03-04|Privacy preserving method for querying a remote public service

FR2968426A1|2012-06-08|LARGE SCALE ASYMMETRIC COMPARISON CALCULATION FOR BINARY INTEGRATIONS

US9600738B2|2017-03-21|Discriminative embedding of local color names for object retrieval and classification

US20160103900A1|2016-04-14|Data structuring and searching methods and apparatus

Chiang et al.2009|Region-based image retrieval using color-size features of watershed regions

US20210334531A1|2021-10-28|Generating shift-invariant neural network feature maps and outputs

Abouelaziz et al.2018|Blind 3D mesh visual quality assessment using support vector regression

AU2015218184A1|2016-09-29|Processing hyperspectral or multispectral image data

EP2839410B1|2017-12-13|Method for recognizing a visual context of an image and corresponding device

US20210327041A1|2021-10-21|Image based novelty detection of material samples

同族专利:

公开号 | 公开日

US9424492B2|2016-08-23|

US20150186742A1|2015-07-02|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US7756341B2|2005-06-30|2010-07-13|Xerox Corporation|Generic visual categorization method and system|

US7680341B2|2006-05-05|2010-03-16|Xerox Corporation|Generic visual classification with gradient components-based dimensionality enhancement|

US7885466B2|2006-09-19|2011-02-08|Xerox Corporation|Bags of visual context-dependent words for generic visual categorization|

US7885794B2|2007-11-30|2011-02-08|Xerox Corporation|Object comparison, retrieval, and categorization methods and apparatuses|

US8249343B2|2008-10-15|2012-08-21|Xerox Corporation|Representing documents with runlength histograms|

US8463051B2|2008-10-16|2013-06-11|Xerox Corporation|Modeling images as mixtures of image models|

US8150858B2|2009-01-28|2012-04-03|Xerox Corporation|Contextual similarity measures for objects and retrieval, classification, and clustering using same|

US8774498B2|2009-01-28|2014-07-08|Xerox Corporation|Modeling images as sets of weighted features|

US8280828B2|2009-06-12|2012-10-02|Xerox Corporation|Fast and efficient nonlinear classifier generated from a trained linear classifier|

US8644622B2|2009-07-30|2014-02-04|Xerox Corporation|Compact signature for unordered vector sets with application to image retrieval|

US8380647B2|2009-08-14|2013-02-19|Xerox Corporation|Training a classifier by dimension-wise embedding of training data|

US20110137898A1|2009-12-07|2011-06-09|Xerox Corporation|Unstructured document classification|

US8532399B2|2010-08-20|2013-09-10|Xerox Corporation|Large scale image classification|

US8731317B2|2010-09-27|2014-05-20|Xerox Corporation|Image classification employing image vectors compressed using vector quantization|

US8370338B2|2010-12-03|2013-02-05|Xerox Corporation|Large-scale asymmetric comparison computation for binary embeddings|

US8699789B2|2011-09-12|2014-04-15|Xerox Corporation|Document classification using multiple views|

US9075824B2|2012-04-27|2015-07-07|Xerox Corporation|Retrieval system and method leveraging category-level labels|CN106537379A|2014-06-20|2017-03-22|谷歌公司|Fine-grained image similarity|

US9697439B2|2014-10-02|2017-07-04|Xerox Corporation|Efficient object detection with patch-level window processing|

US10387531B1|2015-08-18|2019-08-20|Google Llc|Processing structured documents using convolutional neural networks|

US10579688B2|2016-10-05|2020-03-03|Facebook, Inc.|Search ranking and recommendations for online social networks based on reconstructed embeddings|

CN106529569B|2016-10-11|2019-10-18|北京航空航天大学|Threedimensional model triangular facet feature learning classification method and device based on deep learning|

US10032110B2|2016-12-13|2018-07-24|Google Llc|Performing average pooling in hardware|

IE20190119A1|2016-12-13|2020-09-30|Google Inc.|Performing average pooling in hardware.|

US10147019B2|2017-03-20|2018-12-04|Sap Se|Small object detection|

US10740560B2|2017-06-30|2020-08-11|Elsevier, Inc.|Systems and methods for extracting funder information from text|

US10503978B2|2017-07-14|2019-12-10|Nec Corporation|Spatio-temporal interaction network for learning object interactions|

CN109783145B|2018-12-18|2022-02-08|潘润宇|Method for creating multi-image-based multifunctional embedded system|

法律状态:
2015-11-23| PLFP| Fee payment|Year of fee payment: 2 |

2016-11-21| PLFP| Fee payment|Year of fee payment: 3 |

2018-09-28| ST| Notification of lapse|Effective date: 20180831 |

优先权:

申请号 | 申请日 | 专利标题

US14/141,612|US9424492B2|2013-12-27|2013-12-27|Weighting scheme for pooling image descriptors|

[返回顶部]