What even is Latent Dirichlet Allocation?
I’m going to try to explain what’s going on, but a) I’m going super high level and aiming for the big idea b) I’m mostly basing this on the Wikipedia entry, so you may just want to read that.
Our Model is: every document is a mixture of multiple topics. (The sum of the weights of the topics = 1.) Within each topic, each word has a certain probability of appearing next. Some words are going to appear in all topics (“the”), but we think of a topic as being defined by which words are most likely to appear in it. We pick the number of topics.
We only see which words appear in each document – the topics, the probability of each topic, and the probability of each word in a topic are all unknown and we estimate them.
“Latent” because we can’t directly see or estimate any of these.
We can describe Our Model as the interaction of a bunch of different probability distributions. We tell the computer the shape of Our Model, and what data we saw, and then have it try lots of things until it finds a good fit that agrees with both of those.
The Beta distribution is what you assume you have when you know that something has be between 0 and 1, but you don’t know much else about it. The Beta is super flexible.
Turns out, the Dirichlet distribution is the multi-dimensional version of this, so it’s a logical fit for both the distribution of words in a topic and the distribution of topics in a document.
This is a pretty common/well-understood/standard model, so all the hard part of describing the shape and telling the computer to look for fits is already done in sklearn for Python (and many other languages. I’ve definitely done this in R before.)
Getting the Computer to Allocate some Latent Dirichlets
1. we turn each document into a vector of words
2. we drop super common words (since we know “the” won’t tell us anything, just drop it)
3. we transform it to use term frequency-inverse document frequency as the vector weights
4. we choose a number of topics
5. we hand that info over to sklearn to make estimates
6. we get back: a matrix with num topics rows by num words columns. Entry i, j is the probability that word j comes up in topic i.
7. which we can: describe the topics and make sure they make sense to us, see how each document breaks down as a mixture of topics
Let’s step through doing all this in python.
import pandas as pd import os, os.path, codecs from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import decomposition from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS import numpy as np
ENGLISH_STOP_WORDS is a known list of super common English words that are probably useless.
I have a “data” dataframe that has an “id” column and a “clean” column with my text in it.
With the help of these libraries, we do steps 1-3:
tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, lowercase=True, strip_accents="unicode", use_idf=True, norm="l2", min_df = 5) A = tfidf.fit_transform(data['clean'])
A has a row for each conversation I’m looking at, and a column for each word. Entry i,j is the number of times word j appears in conversation i, weighted by the number of conversations j appears in.
model = decomposition.NMF(init="nndsvd", n_components=9, max_iter=200) W = model.fit_transform(A) H = model.components_
fit_transform is where we actually tell it to use the data in A to choose good parameters for our model
H is that topics by words matrix in step 6. We can look at the largest values of any row in H to see which words are most important to the topic represented by that row.
num_terms = len(tfidf.vocabulary_) terms = [""] * num_terms for term in tfidf.vocabulary_.keys(): terms[ tfidf.vocabulary_[term] ] = term
then we look at what appears in H
for topic_index in range( H.shape ): top_indices = np.argsort( H[topic_index,:] )[::-1][0:10] term_ranking = [terms[i] for i in top_indices] print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))
Topic 0: com, bigcartel, https, rel, nofollow, href, target, _blank, http, www Topic 1: paypal, account, business, verified, express, required, steps, login, isn, trouble Topic 2: plan, billing, store, amp, admin, clicking, close, corner, gold, downgrade Topic 3: shipping, products, product, options, scroll, select, costs, set, admin, add Topic 4: domain, custom, domains, www, provider, instructions, use, need, cartel, big Topic 5: help, hi, basics, thing, sure, need, https, store, instructions, close Topic 6: stripe, account, payment, checkout, bank, paypal, payments, transfer, order, orders Topic 7: know, thanks, page, hi, let, br, code, add, just, like Topic 8: duplicate, people, thread, error, service, yup, diamond, shows, able, thats
I started with 4 topics and kept increasing the number until it looked like some of them were too close together. Yep, picking the number is an art. Also you’ll get a different (but maybe hopefully similar) set of topics every time you generate these.
I don’t want to go into toooo much detail about what these are about, but when I inspected conversations that matched each topic strongly, they were very similar, so I feel pretty good about this set of topics. I especially like that even from just the terms, you can tell that are two distinct main types of payment support conversations: getting Paypal Express set up correctly, and general how-do-payments questions. We can also clearly see a topic for helping people set up their custom domains.
What might I use this for? So far I’m thinking:
- See which topics come up a lot and use that to decide which documentation to polish.
- Look at topics over time, especially as we make relevant changes – do the custom domain questions slack off after we partnered with Google to set them up right in your shop admin?
- How long does it take to answer questions that come from a certain topic?