2015年3月18日 星期三

Efficient Visual Search of Videos Cast as Text Retrieval

Josef Sivic and Andrew Zisserman 

Abstract

The target of this paper is to enhance the accuracy concerning search an object inside a video. It proposed a method which is, in some sense, pretty much like bag of words model. However, tf-idf, which is often used in text search is fused in to further consider the differentiating power of a certain word. Also, this paper also takes spatial information into consideration by a method called "spatial consistency vote".

Specifically, several frames are randomly selected from a training video, each of which yields a SIFT descriptor of 128 vector, which according to the paper, is designed to be invariant to a slight shift. Since this kind of shifts often happen in video, SIFT is considered superior to several other descriptors. Having acquired all the SIFT descriptors, namely visual word, mechanism similar to tf-idf is applied to the set removing those frequently appearing in all frames. 

With all the left visual words, a given testing image is represented. However, to take the spatial information into consideration, "spatial consistency voting" is conducted, of which the essence is support of other visual words within the nearby area. For instance, if a word is found both in training and testing image, the k-nearest visual words would come into play. if any of these also appears in both images then the center visual word get a support count. This technique, in the experiments, is showed to largely boost the accuracy by removing false positives.


Contribution and Novelty
  1. Propose a method taking spatial consistency into consideration called "spatial consistency vote"
  2. Apply the concept of stop-word removal, effectively remove visual words of low or even useless differentiating ability.



沒有留言:

張貼留言