2015年5月5日 星期二

Probabilistic Latent Semantic Analysis

Thomas Hofmann

Abstract
Learning from text is not only the most challenging task but also of crucial significance in Machine Learning and AI, which any breakthrough would make a huge leap in the sphere. And to extract information from text or natural language, understanding the actual or semantic meaning is a must. Conventionally, Latent Semantic Analysis is applied to do the job, which harnesses SVD to reduce the data dimensionality, hence mapping them to a new feature space, in which hopefully, axises are of semantic meanings.

This paper proposes a new way of handling the issue. Viewing from a different angle, the author sees the problem from a statistical point of view, harnessing probabilistic model to tackle the problem. The method is called Probabilistic Latent Semantic Analysis, which different to LSA, it assumes that given the hidden latent topic, the probability of a document is independent to the probability of a certain word. 

Harnessing typical EM procedure, a new space could be "learnt" and hence a lower dimensional representation of a data could be obtained. Conceptually, a document could be represented by basis, meaning the probability of this document is of this latent topic. Combining with the probability of a word given a certain topic, we could obtain the probability of a word given a certain document.



Contributions 
  1. Proposed a new method to extract semantic components from text and natural language.
  2. Not only is the new method more compact, the accuracy is also higher.
  3. Combining with annealing, which is often used in Machine Learning area, the performance could be further enhanced.

沒有留言:

張貼留言