Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter.
LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words.
The “Dirichlet” distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.
Latent Dirichlet Allocation (LDA)
The purpose of LDA is to map each document in our corpus to a set of topics that covers a good deal of words in the document.
What LDA does in order to map the documents to a list of topics is: assign topics to the arrangement of words.
LDA assumes that each document k is generated by:
-
From our Dirichlet distribution for k, sample a random distribution of topics. That is, pick a place on that triangle that is associated with a certain probability of generating each topic. If we choose a place very close to the “sports article” edge, we have a higher probability of picking “sports article”. The probability of picking a particular place on the triangle is described by the pdf of the Dirichlet distribution (the placement of the purple mound).
-
For each topic, pick a distribution of words for that topic from the Dirichlet for that topic.
-
For each word in document k,
-
From the distribution of topics selected for k, sample a topic, like “sports article”.
-
From the distribution selected for “sports article”, pick the current word.