2015年3月24日 星期二

Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-scale Image Retrieval

Yunchao Gong, Svetlana Lazebnik, Albert Gordo, Florent Perronnin 

Abstract

This paper proposes a method to learn similarity-preserving binary codes which is largely used in similar images search in large-scale dataset. When applying binary encoding on images, one aims to minimizes the error or quantization error between the transformed and the original point representing the image. This paper presents a method through which the transformed high-dimensional data points and the axises are iteratively "rotated" to achieve the optimum binary representation. 

With no big difference to commonly applied methods, data points are first applied upon PCA, which serves to reduce the dimensionality of the data points. After PCA operation, initial axises are obtained, which quantized each data point into a binary code. Then, an novel algorithm called "Iterative Quantization" are applied on the incipient binary codes to minimize the error, or distance to the original point. 

Conceptually, what ITQ does is to try find a transformation matrix that will transform the binary codes to achieve the minimum error given the axises are fixed. After acquiring the new points, the next step comes into play, in which those points are fixed while axises are shifted to find, again, the optimum configuration. By iteratively conducting such operation, according to the paper, for about 50 rounds, the performance will be close to that in the convergent state, which is shown to take, though, much time compared to other methods, still applicable in real settings. 

Novelties and Discoveries
  1. These methods performs well especially when the reduced dimensionality is small, that is, when number of bits used to represent the original data points are rather small. 
  2. It could be combined with supervised method such as Canonical Correlation Analysis(CCA) to achieve better performance if the ground-truth labels are provided.
Questions
  1. Say, the target is to maximize the variances of each bits(According to the paper, exactly half of the data points are on one side while the others the other side), then how come Figure 3. that of ITQ is the lowest compared to RR and even PCA?

2015年3月18日 星期三

Efficient Visual Search of Videos Cast as Text Retrieval

Josef Sivic and Andrew Zisserman 

Abstract

The target of this paper is to enhance the accuracy concerning search an object inside a video. It proposed a method which is, in some sense, pretty much like bag of words model. However, tf-idf, which is often used in text search is fused in to further consider the differentiating power of a certain word. Also, this paper also takes spatial information into consideration by a method called "spatial consistency vote".

Specifically, several frames are randomly selected from a training video, each of which yields a SIFT descriptor of 128 vector, which according to the paper, is designed to be invariant to a slight shift. Since this kind of shifts often happen in video, SIFT is considered superior to several other descriptors. Having acquired all the SIFT descriptors, namely visual word, mechanism similar to tf-idf is applied to the set removing those frequently appearing in all frames. 

With all the left visual words, a given testing image is represented. However, to take the spatial information into consideration, "spatial consistency voting" is conducted, of which the essence is support of other visual words within the nearby area. For instance, if a word is found both in training and testing image, the k-nearest visual words would come into play. if any of these also appears in both images then the center visual word get a support count. This technique, in the experiments, is showed to largely boost the accuracy by removing false positives.


Contribution and Novelty
  1. Propose a method taking spatial consistency into consideration called "spatial consistency vote"
  2. Apply the concept of stop-word removal, effectively remove visual words of low or even useless differentiating ability.



2015年3月14日 星期六

Fine-Grained Crowdsourcing for Fine-Grained Recognition

Jia Deng, Jonathan Krause, Li Fei-Fei 
Computer Science Department, Stanford University 


Summary

This paper presents a game capable of collecting the pivotal features when differentiating between two species of birds and an algorithm which further harness the gleaned information, that is, "bubble"s, in this case, to help boost success classification rate.

During the game, players are presented with two "clear" pictures telling them how a certain species looks and a blurred image which they have to classify into one of the two species. The user could uncover certain parts of the picture to ascertain the result. However, each revelation diminish the total score, making the player parsimoniously reveal the truly essential parts, during which the most substantial "bubbles" are obtained. 

Now, having a bunch of bubbles collected from the "training phase", that is, all the games played, concerning the most differentiating features between pairs of two species, they then create a detector represented by one or more descriptors for each bubble and apply it on the testing images to classify a given image. Here, they assume spatial prior when applying detectors, that is, since the probability of an arbitrary feature appears in roughly the same region in pictures of the same species with quite high probability, they could simply apply the detector to that area.

Contributions and Novelties
  1. Present a interactive game that can not only be utilized to gather consequential features when telling two species apart, but also fulfill entertaining and recreational purpose.
  2. The game is domain agnostic, that is, it could be applied on sorts of different fine-grained classification problems and get results that are warranted by the mechanism and design of the game.
Confusions
  1. Why is it ok to assume the spatial prior, that is, why is it assured that the bubbles will appear in roughly the same location in different pictures of even same species?
  2. What does it mean to convolve the descriptor with densely sampled patches and take the maximum response? Does it mean to in some way, retrieve several parts of the testing image and apply the detector on each of them and get the maximum score?