2015年6月23日 星期二

Learning Everything about Anything: Webly-Supervised Visual Concept Learning

Santosh K. Divvala,, Ali Farhadi, Carlos Guestrin
University of Washington, The Allen Institute for AI 

Abstract
Recognition has ripped to be used on real-world application. However, how scalable and exhaustive could it be to cover all the aspects of a single concept? Also, to what point could human involvement be lessened? The author proposed a system capable of weakly supervised learning harnessing web data.

To learn a model of a concept, visual space of variance is first to retrieved. Then, a model is trained to deal with the intra-concept variance. So, to retrieve the visual space, the authors utilized Google Books Ngrmas to obtain possible variances of a concept, say, 'horse'. And to handle the unavoidable noiseness of the data, a weak classifier is trained for each variance with the intuition that meaningful aspect somehow processes saliency that could be recognized by the model. Thus, model trained on noisy ones will score relatively low, hence, be filtered out.

Moving on, within these left aspects, some are visually similar and thus training a model for each would be wasting. Thus the author constructed a graph, where each node represents a aspect and edge showing the similarity between the linked aspects. The edge weight Eij is the AP using the weak model trained on j to classify i. Through this procedure, several superngrams could be obtained each for which a strong model is trained. 


Contributions
  1. Propose a system that could learn every aspects given any concept with almost zero human involvement. To date, models for 50000 variations of 150 concepts are available.
  2. The performance is almost as high as the supervised method.
  3. Could be harnessed further to solve some major NLP problems such as coreference resolution, where two textual mentions are actually refer to the same entity.

2015年6月13日 星期六


Text Understanding from Scratch 
Xiang Zhang,  Yann LeCun

Abstract

Text understanding has always been a difficult problem due to the variability in language formation and traditionally researchers handled this in a statistical fashion. And when resorted to machine learning approach, several obstacles would be met such as word morphologism and ambiguous chunking, making effort made confined to that specific language. 

Inspired by recent successes made using word vector, that is representing each word with a fixed-length vector, the authors combined this concept with temporal convolution neural network and proposed a method claiming to achieve several tasks without any prior knowledge about words, phrases or sentences.

The authors first encoded each character in a sentence to a fixed-length vector using one-hot method, that is the author chose say, m characters including 26 English letters and set the corresponding dimension of that m-length vector to be one while others zero. That means the input vector being a matrix of length m * n, where n is the number of characters.
After system overview, experiment results on four tasks, including ontology classification, sentiment analysis, answer topic classification and news category are presented, showing that this method outperformed several methods. Also, the authors showed that this technique could be applied on Chinese as well by representing Chinese character with PinYin, that is to use their audio information instead of the from itself.

Contributions
  1. Proposed a method using Temporal Convolutional Neural Network to text-understanding tasks, showing that building a system from scratch without understanding words, phrases and sentences priorly is viable. 
  2. Showed that the method could be applied on not just English and suggested several future works that could be based on the paper, such as chunking, Named Entity Recognition and POS tagging... 

2015年6月7日 星期日


Deep Neural Networks for Acoustic Modeling
in Speech Recognition 
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, et. al

Abstract

This paper discussed about how artificial neural network could and has been used in speech related task and showed that in many cases it has already outperformed the conventional GMM-HMM method by a huge margin. 

Briefing the traditional GMM way, the authors stated the shortcoming of it, that is its incompetence of modeling data lying on a manifold. And researchers believe that neural networks has better performance on modeling this kind of data space. Following, the way of training neural network is introduced, starting from using Restricted Boltzmann Machine or noise tolerant auto-encoder to pre-train each layer and stacked those layers to form deep network. 

The authors then talked about how DNN could be used with HMM by either simply use output of it as a new kind of input features to HMM or take those as probability of certain state given the input features, whether it's MFCC, a common used features or others. Further, some groups even tried Convolution Neural Network directly on spectrogram or output of mel filters. 

Next, the authors examined some real cases mainly from several groups and stated that in all sorts of tasks, neural networks has been shown to be the state-of-the-art method and could be further exploited to future leap. However, the authors also listed several obstacles that should be fixed before we could make the full use of NN's power including its limit when it comes to parallelization. 

Contributions
1. Summarizing the methods now largely used in speech-related tasks when it comes to harnessing neural networks
2. Discussed about the merits and shortcomings of NN when applied on these tasks and the hinder awaiting solutions for further exploiting its power 

2015年6月2日 星期二


Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik 
UC Berkeley 


Abstract

Conventionally, visual recognition, no matter it's object recognition, classification or so, is primarily based on SIFT and HOG, which aims at finding local parts that are rather different or variant, hoping these would aid the tasks performed. However, biological visual system is more like a sequential or hierarchical process, inspiring the authors to harness neural network, termed R-CNN(Regional CNN) to facilitate object detection in this work.

The proposed system is composed of several components. First, given a image, around 2000 object proposals are produced harnessing selective search. Next, each proposal is fed into convolution neural network to get features, which would then be used as input to several SVMs, each trained for classifying a specific object, say, airplane. Combining output from all SVMs, a proposal could be classified as one of the objects or background.


After introduction to the overall design, experiment results are given. The authors showed that when training data are scarce, a pre-trained NN with similar domain could be fine-tuned using these scarce data, yielding significant performance boost. Even, one could simply use the pre-trained NN without fine-tuning it by taking the output from the last convolution layer as input features to SVM (taking that of fcs yields worst result), demonstrating that conv layers are like feature extractors and fcs classifiers. And by fine-tuning, it's like we're teaching NN to apply its generality of convs to the targeted task. 

The experiments showed a higher accuracy(54%) is reached compared to that using pre-defined features(around 35%). Also, not only the storage required is largely reduced due to a more compact feature representation but the computation time is two-orders of magnitude faster. 


Contributions
  1. show that NN could be harnessed on object proposals to accomplish object detection and segmentation
  2. When training data is scarce, one could fine-tune auxiliary pre-trained model to obtain a significant performance enhancement