2015年7月8日 星期三

Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng and Michael I. Jordan

Abstract
This Paper proposed a new generative model that extrapolates how observed data could have similar distribution. Basically, it assumes that there exist several unseen topics or states that determines how a document or an article is formed, that is why certain words of similar semantics keep emerging. 

One might think of a pretty similar method, pLSA, which without further investigation, seems capable of achieving the same effect. However, the most differentiable point is that instead of generating words based on fixed topic ratio, LDA kinds of model how a "topic assortment" is generated, that is it produces a topic instance for example, "horror: 0.4, amusement: 0.2, sadness: 0.4" based on probability. And with these distribution, documents are then generated.

The main advantage of LDA over rather old-fashioned pLSA is that it possesses more flexibility and could fit better to a given set of training documents.

Contribution
  1. Proposed a novel and refined generative model compared to pLSA and LSA, which is more powerful modeling hidden topic distribution.
  2. Boosts Natural Language Processing, since it allows finer modeling of data.



2015年6月23日 星期二

Learning Everything about Anything: Webly-Supervised Visual Concept Learning

Santosh K. Divvala,, Ali Farhadi, Carlos Guestrin
University of Washington, The Allen Institute for AI 

Abstract
Recognition has ripped to be used on real-world application. However, how scalable and exhaustive could it be to cover all the aspects of a single concept? Also, to what point could human involvement be lessened? The author proposed a system capable of weakly supervised learning harnessing web data.

To learn a model of a concept, visual space of variance is first to retrieved. Then, a model is trained to deal with the intra-concept variance. So, to retrieve the visual space, the authors utilized Google Books Ngrmas to obtain possible variances of a concept, say, 'horse'. And to handle the unavoidable noiseness of the data, a weak classifier is trained for each variance with the intuition that meaningful aspect somehow processes saliency that could be recognized by the model. Thus, model trained on noisy ones will score relatively low, hence, be filtered out.

Moving on, within these left aspects, some are visually similar and thus training a model for each would be wasting. Thus the author constructed a graph, where each node represents a aspect and edge showing the similarity between the linked aspects. The edge weight Eij is the AP using the weak model trained on j to classify i. Through this procedure, several superngrams could be obtained each for which a strong model is trained. 


Contributions
  1. Propose a system that could learn every aspects given any concept with almost zero human involvement. To date, models for 50000 variations of 150 concepts are available.
  2. The performance is almost as high as the supervised method.
  3. Could be harnessed further to solve some major NLP problems such as coreference resolution, where two textual mentions are actually refer to the same entity.

2015年6月13日 星期六


Text Understanding from Scratch 
Xiang Zhang,  Yann LeCun

Abstract

Text understanding has always been a difficult problem due to the variability in language formation and traditionally researchers handled this in a statistical fashion. And when resorted to machine learning approach, several obstacles would be met such as word morphologism and ambiguous chunking, making effort made confined to that specific language. 

Inspired by recent successes made using word vector, that is representing each word with a fixed-length vector, the authors combined this concept with temporal convolution neural network and proposed a method claiming to achieve several tasks without any prior knowledge about words, phrases or sentences.

The authors first encoded each character in a sentence to a fixed-length vector using one-hot method, that is the author chose say, m characters including 26 English letters and set the corresponding dimension of that m-length vector to be one while others zero. That means the input vector being a matrix of length m * n, where n is the number of characters.
After system overview, experiment results on four tasks, including ontology classification, sentiment analysis, answer topic classification and news category are presented, showing that this method outperformed several methods. Also, the authors showed that this technique could be applied on Chinese as well by representing Chinese character with PinYin, that is to use their audio information instead of the from itself.

Contributions
  1. Proposed a method using Temporal Convolutional Neural Network to text-understanding tasks, showing that building a system from scratch without understanding words, phrases and sentences priorly is viable. 
  2. Showed that the method could be applied on not just English and suggested several future works that could be based on the paper, such as chunking, Named Entity Recognition and POS tagging... 

2015年6月7日 星期日


Deep Neural Networks for Acoustic Modeling
in Speech Recognition 
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, et. al

Abstract

This paper discussed about how artificial neural network could and has been used in speech related task and showed that in many cases it has already outperformed the conventional GMM-HMM method by a huge margin. 

Briefing the traditional GMM way, the authors stated the shortcoming of it, that is its incompetence of modeling data lying on a manifold. And researchers believe that neural networks has better performance on modeling this kind of data space. Following, the way of training neural network is introduced, starting from using Restricted Boltzmann Machine or noise tolerant auto-encoder to pre-train each layer and stacked those layers to form deep network. 

The authors then talked about how DNN could be used with HMM by either simply use output of it as a new kind of input features to HMM or take those as probability of certain state given the input features, whether it's MFCC, a common used features or others. Further, some groups even tried Convolution Neural Network directly on spectrogram or output of mel filters. 

Next, the authors examined some real cases mainly from several groups and stated that in all sorts of tasks, neural networks has been shown to be the state-of-the-art method and could be further exploited to future leap. However, the authors also listed several obstacles that should be fixed before we could make the full use of NN's power including its limit when it comes to parallelization. 

Contributions
1. Summarizing the methods now largely used in speech-related tasks when it comes to harnessing neural networks
2. Discussed about the merits and shortcomings of NN when applied on these tasks and the hinder awaiting solutions for further exploiting its power 

2015年6月2日 星期二


Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik 
UC Berkeley 


Abstract

Conventionally, visual recognition, no matter it's object recognition, classification or so, is primarily based on SIFT and HOG, which aims at finding local parts that are rather different or variant, hoping these would aid the tasks performed. However, biological visual system is more like a sequential or hierarchical process, inspiring the authors to harness neural network, termed R-CNN(Regional CNN) to facilitate object detection in this work.

The proposed system is composed of several components. First, given a image, around 2000 object proposals are produced harnessing selective search. Next, each proposal is fed into convolution neural network to get features, which would then be used as input to several SVMs, each trained for classifying a specific object, say, airplane. Combining output from all SVMs, a proposal could be classified as one of the objects or background.


After introduction to the overall design, experiment results are given. The authors showed that when training data are scarce, a pre-trained NN with similar domain could be fine-tuned using these scarce data, yielding significant performance boost. Even, one could simply use the pre-trained NN without fine-tuning it by taking the output from the last convolution layer as input features to SVM (taking that of fcs yields worst result), demonstrating that conv layers are like feature extractors and fcs classifiers. And by fine-tuning, it's like we're teaching NN to apply its generality of convs to the targeted task. 

The experiments showed a higher accuracy(54%) is reached compared to that using pre-defined features(around 35%). Also, not only the storage required is largely reduced due to a more compact feature representation but the computation time is two-orders of magnitude faster. 


Contributions
  1. show that NN could be harnessed on object proposals to accomplish object detection and segmentation
  2. When training data is scarce, one could fine-tune auxiliary pre-trained model to obtain a significant performance enhancement

2015年5月30日 星期六


ImageNet Classification with Deep Convolutional Neural Networks 

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton 

Abstract
This paper basically documented how the recently extremely popular topic arose -- how by using large scaled neural network with the aid of fast-growing computing power of GPU Alex managed to achieved an error rate around 20% lower compared to the then best model in ILSVRC. 

The author then went through all the important features and topics associated with this net. Starting from ReLU, which compared to the traditional activation function, is non-saturating, and thus could speed up the training speed several times due to not confine the gradient in a range. Following is about local response normalization, observed in and inspired by  the real neuron, is to take values of neighbors of a certain kernel into consideration and refine its value, which aids generalization and lowers the error rate by 1%. The next one is about overlapped pooling, which instead of applying kernel non-overlappingly, the authors moved the kernel by a step s, which is smaller than the kernel size, say z, which is proved to be less prone to overfitting.

And then, the author presented with the overall structure with five convolution layers followed by three fully connected layer with the last one equipped with a softmax layer to transform dimensionality from 4096 to 1000, the number of categories. The author also gave advices on how to further prevent overfitting by using data augmentation (producing more training data by cropping and flipping a given image) and dropout (randomly setting the value of a neuron to zero with a given probability, which is quite like ensembling several models). 

Finally the author presented the experiment results on ILSVRC 2010, and 2012, which is shown to significantly lower the error rate. 

Contributions
  1. Trained the largest NN then and applied it on image classification
  2. Proposed several methods to deal with the highly possible overfitting problems 
  3. The network contains several new features, which hugely accelerate the training process


2015年5月12日 星期二

Story-Driven Summarization for Egocentric Video 
Zheng Lu and Kristen Grauman 

Abstract

As cameras and media storage continues to grow, the author stated, the usage will become more and more ubiquitous. With these upwelling recorded videos whose length is ever increasing, it becomes impossible for human to view every single detail from start to end and hence techniques to analyze and summarize these videos become more significant than ever. 

This paper targets at summarizing eco-centric videos taken by wearable devices or as the authors stated, robots, producing a shortened clip given a long video without losing much information and context. Traditionally, papers handling this issue often focuses on selecting high-quality subshots, while putting little effort on inter-shot relationship and hence sometimes loses the context as how one shot transitions to the next and often includes too many redundant subshots. 

This paper hence focuses on "telling the story" out of the clip by selecting the best chain of subshots that maximizes a three-part objective function, consisting of story, importance and diversity. Of the three part, story is the essence of this paper -- through using relationship between objects to analyze the intimacy of subshots. Briefly, objects are first detected in subshots and a bipartite directed graph is constructed connecting subgraphs and objects with the weights denoting the probability the object given a certain scene or vice versa.  With the graph, random walk is initiated to get the closeness of a pair of subshots based on an intuitive assumption that if two subshots A and B are highly correlated, walks starting from A will be highly likely to end at B. 

With the story part and the other two, the best chain of subshots could be retrieved. And through experiments the author showed that subjects preferred their summarization to other three baselines in a blind test where each subject is given summarizations derived using different methods. 

Contributions
  1. Inspired by a previous work targeting at summarizing news using text, the authors successfully transform it and use the concept on video ego-centric summarization.
  2. Proposed a objective function that considers the preserved context of a video in a generated summarizing clip.
  3. Proposed a segmentation method tailored to ego-centric videos which deals with lack of sharp distinction often used to get subshots.

2015年5月5日 星期二

Probabilistic Latent Semantic Analysis

Thomas Hofmann

Abstract
Learning from text is not only the most challenging task but also of crucial significance in Machine Learning and AI, which any breakthrough would make a huge leap in the sphere. And to extract information from text or natural language, understanding the actual or semantic meaning is a must. Conventionally, Latent Semantic Analysis is applied to do the job, which harnesses SVD to reduce the data dimensionality, hence mapping them to a new feature space, in which hopefully, axises are of semantic meanings.

This paper proposes a new way of handling the issue. Viewing from a different angle, the author sees the problem from a statistical point of view, harnessing probabilistic model to tackle the problem. The method is called Probabilistic Latent Semantic Analysis, which different to LSA, it assumes that given the hidden latent topic, the probability of a document is independent to the probability of a certain word. 

Harnessing typical EM procedure, a new space could be "learnt" and hence a lower dimensional representation of a data could be obtained. Conceptually, a document could be represented by basis, meaning the probability of this document is of this latent topic. Combining with the probability of a word given a certain topic, we could obtain the probability of a word given a certain document.



Contributions 
  1. Proposed a new method to extract semantic components from text and natural language.
  2. Not only is the new method more compact, the accuracy is also higher.
  3. Combining with annealing, which is often used in Machine Learning area, the performance could be further enhanced.

2015年4月20日 星期一

Online Dictionary Learning for Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce and Guillermo Sapiro

Abstract

In sparse coding, where each training example is represented by sparse linear combination of basic elements, which forms a dictionary, a learned dictionary, is shown to have better performance over a pre-defined one. Though, evidences have demonstrated that learned dictionaries are crucial to achieve state-of-art results, the optimization problem one has to deal with is often the bottle neck and limits the more ubiquitous utilization. 

The authors proposed a online learning method, which hopefully solved problems often encountered in several dictionary learning algorithms such as too big a training set and a dynamic changing one. Basically, it harnesses the stochastic gradient descent, which is shown to often be the fastest in terms of convergence though not good conventionally, by taking an example (or a tiny set) a time. 


In a typical dictionary learning, one tries to minimizes the differences between training example and its sparse representation as shown in the first equation, which is equivalent to solving the one below, where D is the dictionary and alpha a slack variable ensuring sparsity. In the paper, the authors proposed a method iteratively solving D and alpha given the other fixed, which in the paper is proven of convergence. The authors also suggested several side methods to boost the performance.

Contributions
  1. Propose a method that is faster in both small and large training set than previous methods and show it could be used on real case such as removing imprinting of images.
  2. Cast dictionary learning problem as a optimization of a non-convex objective function over a convex set

2015年4月13日 星期一

To aggregate or not to aggregate: Selective match kernels for image search

Giorgos Tolias, Yannis Avrithis, Herve ́ Je ́gou

  Abstract

This paper focused on improving visual recognition accuracy, no matter it's objects, locations or scenes. The ubiquitous approach targeted at this problem is local descriptor and most of them are derived from BoW, which seeks to quantize the obtained descriptors into several discrete visual words and hence represent each image as a vector, which could be further combined with machine learning techniques such as SVM to achieve fair results. 

These methods, however, could be categorized into two "genres", of which are matching based and aggregated approaches. The authors hence analyzed them and proffered a new method, which is unprecedented in that it took the merits from these two and combine them to yield a result whose performance is even higher than that of state-of-art approaches. 

The matching based methods, in some sense, utilizes the descriptors independently, while the aggregated approach, indicated by its name, coalesced several descriptors, which if properly set not only preserves some individual information, but also has the advantage of using less memory. 

In the experiment, through setting different parameters, the authors showed that it's worth combing matching based and aggregated approaches to yield superior results.

Contribution
  1. Analyze the difference between these two seemingly different approaches 
  2. Combine flawlessly these two, proposing a method that is not only efficient in terms of memory consumption but also yields a better results.
Questions
  1. Not sure what it meant to have multiple assignments. And to "replicate each descriptor vector and assign each instance to a different visual word" ?
  2. The meaning of selectivity and threshold? Is it parameters to dictate how local descriptors are chosen or rendered?

2015年4月11日 星期六

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Sam T. Roweis1 and Lawrence K. Saul

Abstract

In this paper, the authors proposed a method, which could help dimension reduction. They begun saying since in lots of different spheres, professions often encounter tons of data, which if no further processing, are normally of hundreds if not thousands of dimensions, hindering scientists to efficiently deal with them. Of this reason, the importance of dimension reduction can't be overemphasized. However, during this process, similarities, distances or certain characteristics are apt to lose or wrongly preserved. Hence, the method proffered in this paper utilized local areas that are overlapped to preserve the traits shown in the original dimensionality after reduced to a lower one.

First, several common methods harnessed in dimension reduction are discussed, during which, the author stated that several methods simply use the Euclidian distance, which could produce misleading result when dealing with distribution like manifold. Though, some methods such as one employing the shortest path could in some way avoid this nuisance, their problem is the time-consuming dynamic programming required to compute the shortest-path distance.  

The authors hence proposed the method which harnesses local information to grasp the global one. Through representing each data point by the K nearest neighbors, each with a weight controlling how important that neighbor is when reconstructing the point, the authors managed to preserve the correlations between points and their neighbors after applying dimension reduction. Besides, by using overlapped local information, this method not only gets away with the rather heavy computation in shortest path distance, but also successfully model the non-linear characteristics of the whole distribution via these linear representations. 

Contributions

  1. Proposing a method preserving meaningful information in each axises of its reduced dimensionality, which has the merits of not only efficiency but also efficacy.
  2. Combining with other feature processing and learning schemas, the method will yield even better performance. 


2015年3月24日 星期二

Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-scale Image Retrieval

Yunchao Gong, Svetlana Lazebnik, Albert Gordo, Florent Perronnin 

Abstract

This paper proposes a method to learn similarity-preserving binary codes which is largely used in similar images search in large-scale dataset. When applying binary encoding on images, one aims to minimizes the error or quantization error between the transformed and the original point representing the image. This paper presents a method through which the transformed high-dimensional data points and the axises are iteratively "rotated" to achieve the optimum binary representation. 

With no big difference to commonly applied methods, data points are first applied upon PCA, which serves to reduce the dimensionality of the data points. After PCA operation, initial axises are obtained, which quantized each data point into a binary code. Then, an novel algorithm called "Iterative Quantization" are applied on the incipient binary codes to minimize the error, or distance to the original point. 

Conceptually, what ITQ does is to try find a transformation matrix that will transform the binary codes to achieve the minimum error given the axises are fixed. After acquiring the new points, the next step comes into play, in which those points are fixed while axises are shifted to find, again, the optimum configuration. By iteratively conducting such operation, according to the paper, for about 50 rounds, the performance will be close to that in the convergent state, which is shown to take, though, much time compared to other methods, still applicable in real settings. 

Novelties and Discoveries
  1. These methods performs well especially when the reduced dimensionality is small, that is, when number of bits used to represent the original data points are rather small. 
  2. It could be combined with supervised method such as Canonical Correlation Analysis(CCA) to achieve better performance if the ground-truth labels are provided.
Questions
  1. Say, the target is to maximize the variances of each bits(According to the paper, exactly half of the data points are on one side while the others the other side), then how come Figure 3. that of ITQ is the lowest compared to RR and even PCA?

2015年3月18日 星期三

Efficient Visual Search of Videos Cast as Text Retrieval

Josef Sivic and Andrew Zisserman 

Abstract

The target of this paper is to enhance the accuracy concerning search an object inside a video. It proposed a method which is, in some sense, pretty much like bag of words model. However, tf-idf, which is often used in text search is fused in to further consider the differentiating power of a certain word. Also, this paper also takes spatial information into consideration by a method called "spatial consistency vote".

Specifically, several frames are randomly selected from a training video, each of which yields a SIFT descriptor of 128 vector, which according to the paper, is designed to be invariant to a slight shift. Since this kind of shifts often happen in video, SIFT is considered superior to several other descriptors. Having acquired all the SIFT descriptors, namely visual word, mechanism similar to tf-idf is applied to the set removing those frequently appearing in all frames. 

With all the left visual words, a given testing image is represented. However, to take the spatial information into consideration, "spatial consistency voting" is conducted, of which the essence is support of other visual words within the nearby area. For instance, if a word is found both in training and testing image, the k-nearest visual words would come into play. if any of these also appears in both images then the center visual word get a support count. This technique, in the experiments, is showed to largely boost the accuracy by removing false positives.


Contribution and Novelty
  1. Propose a method taking spatial consistency into consideration called "spatial consistency vote"
  2. Apply the concept of stop-word removal, effectively remove visual words of low or even useless differentiating ability.



2015年3月14日 星期六

Fine-Grained Crowdsourcing for Fine-Grained Recognition

Jia Deng, Jonathan Krause, Li Fei-Fei 
Computer Science Department, Stanford University 


Summary

This paper presents a game capable of collecting the pivotal features when differentiating between two species of birds and an algorithm which further harness the gleaned information, that is, "bubble"s, in this case, to help boost success classification rate.

During the game, players are presented with two "clear" pictures telling them how a certain species looks and a blurred image which they have to classify into one of the two species. The user could uncover certain parts of the picture to ascertain the result. However, each revelation diminish the total score, making the player parsimoniously reveal the truly essential parts, during which the most substantial "bubbles" are obtained. 

Now, having a bunch of bubbles collected from the "training phase", that is, all the games played, concerning the most differentiating features between pairs of two species, they then create a detector represented by one or more descriptors for each bubble and apply it on the testing images to classify a given image. Here, they assume spatial prior when applying detectors, that is, since the probability of an arbitrary feature appears in roughly the same region in pictures of the same species with quite high probability, they could simply apply the detector to that area.

Contributions and Novelties
  1. Present a interactive game that can not only be utilized to gather consequential features when telling two species apart, but also fulfill entertaining and recreational purpose.
  2. The game is domain agnostic, that is, it could be applied on sorts of different fine-grained classification problems and get results that are warranted by the mechanism and design of the game.
Confusions
  1. Why is it ok to assume the spatial prior, that is, why is it assured that the bubbles will appear in roughly the same location in different pictures of even same species?
  2. What does it mean to convolve the descriptor with densely sampled patches and take the maximum response? Does it mean to in some way, retrieve several parts of the testing image and apply the detector on each of them and get the maximum score?