2015年7月8日 星期三

Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng and Michael I. Jordan

Abstract
This Paper proposed a new generative model that extrapolates how observed data could have similar distribution. Basically, it assumes that there exist several unseen topics or states that determines how a document or an article is formed, that is why certain words of similar semantics keep emerging. 

One might think of a pretty similar method, pLSA, which without further investigation, seems capable of achieving the same effect. However, the most differentiable point is that instead of generating words based on fixed topic ratio, LDA kinds of model how a "topic assortment" is generated, that is it produces a topic instance for example, "horror: 0.4, amusement: 0.2, sadness: 0.4" based on probability. And with these distribution, documents are then generated.

The main advantage of LDA over rather old-fashioned pLSA is that it possesses more flexibility and could fit better to a given set of training documents.

Contribution
  1. Proposed a novel and refined generative model compared to pLSA and LSA, which is more powerful modeling hidden topic distribution.
  2. Boosts Natural Language Processing, since it allows finer modeling of data.