2015年7月8日 星期三

Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng and Michael I. Jordan

Abstract
This Paper proposed a new generative model that extrapolates how observed data could have similar distribution. Basically, it assumes that there exist several unseen topics or states that determines how a document or an article is formed, that is why certain words of similar semantics keep emerging. 

One might think of a pretty similar method, pLSA, which without further investigation, seems capable of achieving the same effect. However, the most differentiable point is that instead of generating words based on fixed topic ratio, LDA kinds of model how a "topic assortment" is generated, that is it produces a topic instance for example, "horror: 0.4, amusement: 0.2, sadness: 0.4" based on probability. And with these distribution, documents are then generated.

The main advantage of LDA over rather old-fashioned pLSA is that it possesses more flexibility and could fit better to a given set of training documents.

Contribution
  1. Proposed a novel and refined generative model compared to pLSA and LSA, which is more powerful modeling hidden topic distribution.
  2. Boosts Natural Language Processing, since it allows finer modeling of data.



2015年6月23日 星期二

Learning Everything about Anything: Webly-Supervised Visual Concept Learning

Santosh K. Divvala,, Ali Farhadi, Carlos Guestrin
University of Washington, The Allen Institute for AI 

Abstract
Recognition has ripped to be used on real-world application. However, how scalable and exhaustive could it be to cover all the aspects of a single concept? Also, to what point could human involvement be lessened? The author proposed a system capable of weakly supervised learning harnessing web data.

To learn a model of a concept, visual space of variance is first to retrieved. Then, a model is trained to deal with the intra-concept variance. So, to retrieve the visual space, the authors utilized Google Books Ngrmas to obtain possible variances of a concept, say, 'horse'. And to handle the unavoidable noiseness of the data, a weak classifier is trained for each variance with the intuition that meaningful aspect somehow processes saliency that could be recognized by the model. Thus, model trained on noisy ones will score relatively low, hence, be filtered out.

Moving on, within these left aspects, some are visually similar and thus training a model for each would be wasting. Thus the author constructed a graph, where each node represents a aspect and edge showing the similarity between the linked aspects. The edge weight Eij is the AP using the weak model trained on j to classify i. Through this procedure, several superngrams could be obtained each for which a strong model is trained. 


Contributions
  1. Propose a system that could learn every aspects given any concept with almost zero human involvement. To date, models for 50000 variations of 150 concepts are available.
  2. The performance is almost as high as the supervised method.
  3. Could be harnessed further to solve some major NLP problems such as coreference resolution, where two textual mentions are actually refer to the same entity.

2015年6月13日 星期六


Text Understanding from Scratch 
Xiang Zhang,  Yann LeCun

Abstract

Text understanding has always been a difficult problem due to the variability in language formation and traditionally researchers handled this in a statistical fashion. And when resorted to machine learning approach, several obstacles would be met such as word morphologism and ambiguous chunking, making effort made confined to that specific language. 

Inspired by recent successes made using word vector, that is representing each word with a fixed-length vector, the authors combined this concept with temporal convolution neural network and proposed a method claiming to achieve several tasks without any prior knowledge about words, phrases or sentences.

The authors first encoded each character in a sentence to a fixed-length vector using one-hot method, that is the author chose say, m characters including 26 English letters and set the corresponding dimension of that m-length vector to be one while others zero. That means the input vector being a matrix of length m * n, where n is the number of characters.
After system overview, experiment results on four tasks, including ontology classification, sentiment analysis, answer topic classification and news category are presented, showing that this method outperformed several methods. Also, the authors showed that this technique could be applied on Chinese as well by representing Chinese character with PinYin, that is to use their audio information instead of the from itself.

Contributions
  1. Proposed a method using Temporal Convolutional Neural Network to text-understanding tasks, showing that building a system from scratch without understanding words, phrases and sentences priorly is viable. 
  2. Showed that the method could be applied on not just English and suggested several future works that could be based on the paper, such as chunking, Named Entity Recognition and POS tagging... 

2015年6月7日 星期日


Deep Neural Networks for Acoustic Modeling
in Speech Recognition 
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, et. al

Abstract

This paper discussed about how artificial neural network could and has been used in speech related task and showed that in many cases it has already outperformed the conventional GMM-HMM method by a huge margin. 

Briefing the traditional GMM way, the authors stated the shortcoming of it, that is its incompetence of modeling data lying on a manifold. And researchers believe that neural networks has better performance on modeling this kind of data space. Following, the way of training neural network is introduced, starting from using Restricted Boltzmann Machine or noise tolerant auto-encoder to pre-train each layer and stacked those layers to form deep network. 

The authors then talked about how DNN could be used with HMM by either simply use output of it as a new kind of input features to HMM or take those as probability of certain state given the input features, whether it's MFCC, a common used features or others. Further, some groups even tried Convolution Neural Network directly on spectrogram or output of mel filters. 

Next, the authors examined some real cases mainly from several groups and stated that in all sorts of tasks, neural networks has been shown to be the state-of-the-art method and could be further exploited to future leap. However, the authors also listed several obstacles that should be fixed before we could make the full use of NN's power including its limit when it comes to parallelization. 

Contributions
1. Summarizing the methods now largely used in speech-related tasks when it comes to harnessing neural networks
2. Discussed about the merits and shortcomings of NN when applied on these tasks and the hinder awaiting solutions for further exploiting its power 

2015年6月2日 星期二


Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik 
UC Berkeley 


Abstract

Conventionally, visual recognition, no matter it's object recognition, classification or so, is primarily based on SIFT and HOG, which aims at finding local parts that are rather different or variant, hoping these would aid the tasks performed. However, biological visual system is more like a sequential or hierarchical process, inspiring the authors to harness neural network, termed R-CNN(Regional CNN) to facilitate object detection in this work.

The proposed system is composed of several components. First, given a image, around 2000 object proposals are produced harnessing selective search. Next, each proposal is fed into convolution neural network to get features, which would then be used as input to several SVMs, each trained for classifying a specific object, say, airplane. Combining output from all SVMs, a proposal could be classified as one of the objects or background.


After introduction to the overall design, experiment results are given. The authors showed that when training data are scarce, a pre-trained NN with similar domain could be fine-tuned using these scarce data, yielding significant performance boost. Even, one could simply use the pre-trained NN without fine-tuning it by taking the output from the last convolution layer as input features to SVM (taking that of fcs yields worst result), demonstrating that conv layers are like feature extractors and fcs classifiers. And by fine-tuning, it's like we're teaching NN to apply its generality of convs to the targeted task. 

The experiments showed a higher accuracy(54%) is reached compared to that using pre-defined features(around 35%). Also, not only the storage required is largely reduced due to a more compact feature representation but the computation time is two-orders of magnitude faster. 


Contributions
  1. show that NN could be harnessed on object proposals to accomplish object detection and segmentation
  2. When training data is scarce, one could fine-tune auxiliary pre-trained model to obtain a significant performance enhancement

2015年5月30日 星期六


ImageNet Classification with Deep Convolutional Neural Networks 

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton 

Abstract
This paper basically documented how the recently extremely popular topic arose -- how by using large scaled neural network with the aid of fast-growing computing power of GPU Alex managed to achieved an error rate around 20% lower compared to the then best model in ILSVRC. 

The author then went through all the important features and topics associated with this net. Starting from ReLU, which compared to the traditional activation function, is non-saturating, and thus could speed up the training speed several times due to not confine the gradient in a range. Following is about local response normalization, observed in and inspired by  the real neuron, is to take values of neighbors of a certain kernel into consideration and refine its value, which aids generalization and lowers the error rate by 1%. The next one is about overlapped pooling, which instead of applying kernel non-overlappingly, the authors moved the kernel by a step s, which is smaller than the kernel size, say z, which is proved to be less prone to overfitting.

And then, the author presented with the overall structure with five convolution layers followed by three fully connected layer with the last one equipped with a softmax layer to transform dimensionality from 4096 to 1000, the number of categories. The author also gave advices on how to further prevent overfitting by using data augmentation (producing more training data by cropping and flipping a given image) and dropout (randomly setting the value of a neuron to zero with a given probability, which is quite like ensembling several models). 

Finally the author presented the experiment results on ILSVRC 2010, and 2012, which is shown to significantly lower the error rate. 

Contributions
  1. Trained the largest NN then and applied it on image classification
  2. Proposed several methods to deal with the highly possible overfitting problems 
  3. The network contains several new features, which hugely accelerate the training process


2015年5月12日 星期二

Story-Driven Summarization for Egocentric Video 
Zheng Lu and Kristen Grauman 

Abstract

As cameras and media storage continues to grow, the author stated, the usage will become more and more ubiquitous. With these upwelling recorded videos whose length is ever increasing, it becomes impossible for human to view every single detail from start to end and hence techniques to analyze and summarize these videos become more significant than ever. 

This paper targets at summarizing eco-centric videos taken by wearable devices or as the authors stated, robots, producing a shortened clip given a long video without losing much information and context. Traditionally, papers handling this issue often focuses on selecting high-quality subshots, while putting little effort on inter-shot relationship and hence sometimes loses the context as how one shot transitions to the next and often includes too many redundant subshots. 

This paper hence focuses on "telling the story" out of the clip by selecting the best chain of subshots that maximizes a three-part objective function, consisting of story, importance and diversity. Of the three part, story is the essence of this paper -- through using relationship between objects to analyze the intimacy of subshots. Briefly, objects are first detected in subshots and a bipartite directed graph is constructed connecting subgraphs and objects with the weights denoting the probability the object given a certain scene or vice versa.  With the graph, random walk is initiated to get the closeness of a pair of subshots based on an intuitive assumption that if two subshots A and B are highly correlated, walks starting from A will be highly likely to end at B. 

With the story part and the other two, the best chain of subshots could be retrieved. And through experiments the author showed that subjects preferred their summarization to other three baselines in a blind test where each subject is given summarizations derived using different methods. 

Contributions
  1. Inspired by a previous work targeting at summarizing news using text, the authors successfully transform it and use the concept on video ego-centric summarization.
  2. Proposed a objective function that considers the preserved context of a video in a generated summarizing clip.
  3. Proposed a segmentation method tailored to ego-centric videos which deals with lack of sharp distinction often used to get subshots.