Topic Modeling Example with Support Conversations

Big ups to Kara Woo for pointing me to this blog post tutorial. I followed it very closely, I’m mostly just adding more words around it.

What even is Latent Dirichlet Allocation?

I’m going to try to explain what’s going on, but a) I’m going super high level and aiming for the big idea b) I’m mostly basing this on the Wikipedia entry, so you may just want to read that.

Our Model is: every document is a mixture of multiple topics. (The sum of the weights of the topics = 1.) Within each topic, each word has a certain probability of appearing next. Some words are going to appear in all topics (“the”), but we think of a topic as being defined by which words are most likely to appear in it. We pick the number of topics.

We only see which words appear in each document – the topics, the probability of each topic, and the probability of each word in a topic are all unknown and we estimate them.
“Latent” because we can’t directly see or estimate any of these.

We can describe Our Model as the interaction of a bunch of different probability distributions. We tell the computer the shape of Our Model, and what data we saw, and then have it try lots of things until it finds a good fit that agrees with both of those.
The Beta distribution is what you assume you have when you know that something has be between 0 and 1, but you don’t know much else about it. The Beta is super flexible.
Turns out, the Dirichlet distribution is the multi-dimensional version of this, so it’s a logical fit for both the distribution of words in a topic and the distribution of topics in a document.

This is a pretty common/well-understood/standard model, so all the hard part of describing the shape and telling the computer to look for fits is already done in sklearn for Python (and many other languages. I’ve definitely done this in R before.)

Getting the Computer to Allocate some Latent Dirichlets

High Level:
1. we turn each document into a vector of words
2. we drop super common words (since we know “the” won’t tell us anything, just drop it)
3. we transform it to use term frequency-inverse document frequency as the vector weights
4. we choose a number of topics
5. we hand that info over to sklearn to make estimates
6. we get back: a matrix with num topics rows by num words columns. Entry i, j is the probability that word j comes up in topic i.
7. which we can: describe the topics and make sure they make sense to us, see how each document breaks down as a mixture of topics

Let’s step through doing all this in python.

Get you some libraries:
import pandas as pd
import os, os.path, codecs
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import numpy as np

ENGLISH_STOP_WORDS is a known list of super common English words that are probably useless.

I have a “data” dataframe that has an “id” column and a “clean” column with my text in it.

With the help of these libraries, we do steps 1-3:

tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, lowercase=True, strip_accents="unicode", use_idf=True, norm="l2", min_df = 5) 
A = tfidf.fit_transform(data['clean'])

A has a row for each conversation I’m looking at, and a column for each word. Entry i,j is the number of times word j appears in conversation i, weighted by the number of conversations j appears in.

 Steps 4 and 5 – we set the number of topics in the “n_components” argument:
model = decomposition.NMF(init="nndsvd", n_components=9, max_iter=200)
W = model.fit_transform(A)
H = model.components_    

fit_transform is where we actually tell it to use the data in A to choose good parameters for our model

H is that topics by words matrix in step 6. We can look at the largest values of any row in H to see which words are most important to the topic represented by that row.

In Python:

first we save the actual words from our tfidf transform
num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
    terms[ tfidf.vocabulary_[term] ] = term

then we look at what appears in H

for topic_index in range( H.shape[0] ):
    top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))
 This prints out the top terms by topic:
Topic 0: com, bigcartel, https, rel, nofollow, href, target, _blank, http, www
Topic 1: paypal, account, business, verified, express, required, steps, login, isn, trouble
Topic 2: plan, billing, store, amp, admin, clicking, close, corner, gold, downgrade
Topic 3: shipping, products, product, options, scroll, select, costs, set, admin, add
Topic 4: domain, custom, domains, www, provider, instructions, use, need, cartel, big
Topic 5: help, hi, basics, thing, sure, need, https, store, instructions, close
Topic 6: stripe, account, payment, checkout, bank, paypal, payments, transfer, order, orders
Topic 7: know, thanks, page, hi, let, br, code, add, just, like
Topic 8: duplicate, people, thread, error, service, yup, diamond, shows, able, thats

I started with 4 topics and kept increasing the number until it looked like some of them were too close together. Yep, picking the number is an art. Also you’ll get a different (but maybe hopefully similar) set of topics every time you generate these.

I don’t want to go into toooo much detail about what these are about, but when I inspected conversations that matched each topic strongly, they were very similar, so I feel pretty good about this set of topics. I especially like that even from just the terms, you can tell that are two distinct main types of payment support conversations: getting Paypal Express set up correctly, and general how-do-payments questions. We can also clearly see a topic for helping people set up their custom domains.

What might I use this for? So far I’m thinking:

  1. See which topics come up a lot and use that to decide which documentation to polish.
  2. Look at topics over time, especially as we make relevant changes – do the custom domain questions slack off after we partnered with Google to set them up right in your shop admin?
  3. How long does it take to answer questions that come from a certain topic?

Data Infrastructure Services, Part 5: Web Analytics

We feed both our web analytics tools, Google Analytics and GoSquared, via Segment. This should mean things like page views and definitions of new users should match up across tools. It also means we’re not running separate tracking code for the web analytics tools – just the one set of javascript for Segment.

Google Analytics is fine. I mean, it’s big and confusing but it probably does most everything you need somewhere in there, and if there’s a huge change in your traffic, you’ll be able to quickly see it.

Here’s one cool thing it does that I’ve seen people try to build independently:

ga user flow.png

That particular view is Acquistion -> Social -> Users Flow. People are often interested to see both how people enter their sites and where they go after the first page.

Behavior -> Behavior Flow gives a similar view but lets you choose the starting segment from things like landing page and country.

You can scroll further right and add more levels of depth if you have really long user sessions that cover many pages.

Here’s a wacky one that I learned about at this job: when you go to Acquisition -> Search Console -> Queries you’ll get something like this

ga queries

(Oh man, it looks like that one is from Acquisition -> All Traffic -> Channels -> Click Organic Search in the table. …. this is fine. And a nice example of how complex the tool is.)

ANYWAY you’ll notice that your top search term is (not provided). If I understand correctly, when someone is logged into a Google account, their search terms don’t get passed along to all our analytics.

Google also has a tool called Webmaster Tools  which is all about the Search Console. When you look at your queries there, you’ll get something like:

goog sc queries

what? all the searches are in this one. No big surprises in our case, but it was good to see that. You can get a few other search related metrics in the Webmaster Tools view, but the full queries are the main thing that you can’t find in Google Analytics. (Having access to one tool doesn’t magically get you access to to the other.)

By the way, I realized recently that the timezone for your Google Analytics is set at the “View” level:

Screenshot 2016-08-17 13.40.26.png

I get the difference between Account and Property, but I don’t really understand why you would have “View”s under that. But you do. and your timezone is set in the View Settings link right there.

(I found this out because our numbers weren’t matching with GoSquared despite both getting data from Segment. Pretty sure it was just that GoSquared was on Mountain Time and Google Analytics was on Eastern.)

Hey let’s talk about this other web analytics tool we use: GoSquared. I think they also want to be in the customer communication and analysis space, but we’re just using them for the web analytics.

They only do a few views, but they’re nice to look at and easy to read

gosquared screen

they’re all a variation on this for different time frames. It does quickly show you comparative traffic, referrals, pages viewed, user stats. The Now and Today views can be really fun to watch right after you post a tweet or a send a newsletter – seeing the traffic roll in real time is a thrill and incidentally validates that you did get the links right.

They also send Daily, Weekly, Monthly emails to show how you’re doing

gosquared email.png

Nitpicky statistician here: the red and green numbers are just about direction, not change size – I’m not really excited about a +0.1% change in pages per visit or worried about a +0.1% change in bounce rate. There’s also no consideration for variation – I’d love to see some std. error bars around that trend line! Comparing to the previous period is great, but it’s not enough.

That said, I do skim all these emails – looking for major changes in the shape of the trend, or any weird referrers that brought us traffic. (It’s fun to see someone linked to you in a blog post!)

We’ve recently added their Ecommerce emails which show in a similar way how much money you’re making at what time and from what sources – definitely nice for understanding the dynamics of your business.

What do you find most useful in your web analytics? What other tools should I be checking out?

 

Data Infrastructure Services, Part 4: Monitoring

tl;dr – We use New Relic, and we have ELK but aren’t relying on it much.

Despite having been to several meetups at the New Relic offices, I didn’t really understand what they do until I took this job. New Relic is a suite of tools that takes data from little programs running on the servers running your Ruby app(s) that can tell you how your app(s) are performing. This results in lots and lots of kinds of graphs, from a weekly email that tells you how all your apps are performing and when they are busiest to things like this:

new relic overview.png

I feel like there’s a lot you can DO with the New Relic data, but we basically use it for two things: Investigating a perceived slow page, and receiving alerts on performance issues.

new relic events

All of our apps have time out and error rate thresholds that they alert on. New Relic is always getting the data from our servers, so it can notice when these things happen and page/alert in Slack/send smoke signals as appropriate.

When we do have a slowdown for whatever reason, the New Relic alerts are usually tied with the support emails for letting us know quickly. And are more likely to be noticed in the middle of the night.

elk-logo.png

We also have the ELK stack collecting data for most of our apps. (ELK: ElasticSearch, Logstash, Kibana. Logstash sits on your servers and sends things to ElasticSearch, ElasticSearch indexes em, Kibana queries ElasticSearch and makes graphs and stuff.)

Unlike the web event logging, the ELK stack data is near real-time, allowing us to look into weirdness and debug a little quicker. We haven’t been relying on it much, and I have barely used it at all. It’s definitely been a nice-to-have investigative tool.

I played with it a bit this week, so while it’s on my mind, here’s what I’ve learned:

  • you can’t download search results from Kibana
  • you CAN download graph data from Kibana
  • you can directly query the ElasticSearch instance using curl(?) and get json
  • you can post json to the endpoint or do simple queries in the uri
  • and if you’re me, then you do terrible things to json in your Python Notebook

 

Data Infrastructure Services, Part 3: Other Data Sources, especially Web Events

This is an area where I feel like there’s GOT to be a ton of other solutions out there, but I only know of the one we happen to be using. (1) Please fill me in on other tools that do some of this and what you like/hate about them!

We use a tool called Segment for web events – getting a record of everything people do on your site. (I remember when we’d just look at web requests, but it’s all javascript these days. And ideally you want to see both together. ) As well as giving us libraries to log anything we want to Redshift, they also integrate with a bazillionity other services as sources and destinations of data. The only other source we are using is Intercom, our customer service platform. (We’re also sending data to it, which is real cool for people answering questions so they can have context on the account.) But we’re also sending the data to our web analytics tools. (I promise to talk more about them later.)

Segment lets us log events from the Ruby backend of our app and from the Javascript that’s running locally in the user’s browser. They provide a nice debugger view into the real-time events they’re receiving:

segment debugger

this is key for seeing if you have correctly added logging for a new event.

They also provide logging for our iOS app. To be nitpicky, it is occasionally frustrating that many of the web events I look at are in the “pages” table, but every iOS event is in its own table, each with exactly the same schema. (Write ALLLL the UNIONs) (Maybe I need to get the devs to use a screen_viewed event that has which screen as an attribute? )

Because of realistic engineering limits and Redshift loadtimes, Segment only promises to get your data to you within 24 hours. This is totally fine for me, but maybe you need something fancier? (Someone please tell me a really useful story of what you’re doing with real-time data?)

Big Caveat for those who haven’t fought this yet: because a lot of web event logging is in Javascript, you’re going to have missing data. Ad blockers will keep out your handy cookies and stop the Javascript from running. Slower computers or slower network connections will make it so that the Javascript doesn’t get a chance to actually send the info back to you. You will lose data and that data loss will be biased to people on mobile, and people with slower computers.

Maybe I just have too much trauma from past work, but I’m really happy to not own this system. There’s nothing I can do if somehow data gets lost – but there generally isn’t when it’s an internal system either. It’s just one more huge set of details to get right that I don’t want to specialize in.

(1) I have heard of precisely one other answer for web events, but haven’t yet found anyone who use(s|d) it: Snowplow.

Data Infrastructure Services, Part 2: Getting Data into the Warehouse

That data warehouse isn’t too useful until we fill it up with stuff. One of the key sources of data is your transactional database. We happen to be using a tool called Xplenty to fill this role for us.

Xplenty lets you create jobs called “packages”. Choose from a number of sources

xplenty sources.png

optionally perform some simple operations on them, and then choose from several destinations for your data.

xplenty destinations

Most of my packages look like this:

xplenty simple package.png

Once you have some packages, you can schedule them:

xplenty schedule

and you can see the history of your package runs or manually run one from your dashboard

xplenty dash

This is all pretty straightforward, but it adds up to a LOT of infrastructure we don’t have to own ourselves. Xplenty gives us a shared place to define the jobs and schedule them to run. It handles spinning up a cluster on AWS to run the jobs – right now we only need a small one, but if that changes, it’s very easy to ask for more nodes. They have really great customer support – chat in app that goes to email if you wander off. I get such friendly and helpful replies that I probably pester them _more_ because I know I’ll actually get a real response. We have Xplenty set up to tie into PagerDuty, so I get emails about job fails (and if we really needed to, I could get paged.)

(Old person aside – I have owned a web-based scheduler tool before and I would really rather not have to again.)

The upsides of this managed infrastructure led me to an interesting corner case – sometimes I want to compute a number and store it regularly. Usually once a day. What’s my tool to store things? Xplenty. What’s my tool to compute things? Uhm, well, usually it’s SQL. In this case, I had a fairly involved SQL query that I knew worked for the computation. AND I already had Xplenty sending the data it needed to Redshift. I was realllllly tempted to just put the query in a repo and then cron up the psql command to Redshift somewhere.

But:

  1. where would I run it?
  2. what would watch cron to let me know it successfully ran?
  3. what makes sure that box continues to have access to Redshift?
  4. what if someone else needs to see all the things running on Redshift?
  5. how would someone else find it to update it?

Running it via Xplenty answers all of these in a consistent way. It felt very weird to build a package this complex

xplenty complex package.png

but it was worth it to have all of my data transfer run in a consistent and easy-to-find way. What would you have done? Why?

What are some other tools in this particular niche: moving data out of your backend transactional databases to a data warehouse?

Data Infrastructure Services Intro & Part 1: Redshift

Did you know that you don’t have to build every part of your data infrastructure yourself? You don’t even have to debug and maintain it all. This was a big surprise to me when I started at a smaller company, so I’m going to share what little I know about the options.

This inspired my talk at Data Day Seattle, because I realized I hadn’t seen any talks on the tools available. From my abstract: “The goal is to give the audience a feel for what tools are available and generate discussion about attendees address these needs at their companies.”

Since I want to generate discussion, I decided I should do a blog post series on each of the sections of tools I talked about so I could reach more people and have something to refer to. Here we go!

Data Warehouses – or the one I know about, Redshift

I use Amazon’s Redshift at work. I can mostly treat it like Postgres or Vertica, so that’s fine by me. I feel like this is a pretty popular choice – would love to hear from people who use something else and what they think of it.

The only thing I’ve run into is that there’s not a default window frame for windowing functions like you get with Vertica. I searched for my “require a frame clause” error, and this Quora answer got me set up. The corresponding Redshift docs have all the info to create one, but don’t mention what the default is on other databases.

Here’s the window frame clause:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

and here’s an example query where I use it to generate a cumulative sum:

select day, start, change
sum(change) OVER (PARTITION BY start ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulate
from daily_change

Mostly I never have to think about Redshift, and that’s how I like it.

Next time: how we get data INTO Redshift.

The Saga of the Facebook CSV file

New Punchline: first thing totally works if I used the right file. Whoooooops.

Try: main2 = pd.read_csv(“Nov2015Mar2016.csv”, usecols=goodcols, encoding=”latin1″)

(Thanks to Andrew Mullins for the encoding fix)

If you have a Facebook Page, you can get metrics on it. The web view shows you the last 28 days and a selection of graphs. But you can do an export on a selected date range and download a whole bunch of metrics. Sounds great, right?

You know how spreadsheet columns are labeled A-Z, and then AA-AZ, BA-BZ? I didn’t know it could go to three letters. This spreadsheet goes to CCW.

Maybe you see why this is going to be fun. For starters, if your computer default opens it in Numbers and you save it from there, you get cut off at IU.

I spent a few fascinated hours going through the column headers and googling the more obscure ones to decide which ones I wanted. I got it down to 9 main ones(1), plus the hourly numbers on how many people who liked your page are online at that time. (2)

Complication – Some of the Metrics are broken down by city: “Lifetime Likes by City – Birmingham, England, United Kingdom” for example. It looks like you get a column for every relevant city for your data. Awesome for getting tons of data, but it means that the number of columns probably varies across each date range, and so does the index of any columns past the generated city ones.

Ok, fine. I’ll make a little python script that reads in the csv, selects the columns I want, writes out a new csv. This should be easy!

>>> import pandas as pd

>>> first = pd.read_csv(“June2015Oct2015.csv”)

>> first.shape

(153, 255)

womp. Pandas gets me 255 columns. And I didn’t notice at first until

>> main = pd.DataFrame(first, columns=goodcols)

ran with no error. It just shoved NaN in all the remaining columns. Silently.

I start complaining on Twitter at this point. This magically gives me the ability to read the docs and I am momentarily excited about the “usecols” option in Pandas.

>>> first = pd.read_csv(“June2015Oct2015.csv”, usecols=goodcols)

Traceback (most recent call last): (snip)

ValueError: Usecols do not match names.

Return to public complaining. Andrew Mullins reasonably suggests trying std lib’s csv DictReader.

>>> test = open(“June2015Oct2015.csv”)

>> reader = csv.DictReader(test)

>>> len(reader.fieldnames)

255

I think this is where I stopped for the day. Later that night I realized I should try R, but I put it off for a few days. For one thing – I had to install R.

> data <- read.csv(“June2015Oct2015.csv”)

> dim(data)

[1] 153 255

From googling, Stack Overflow suggested data.table. It was also happy to accept my list of columns on the import, but then just leave off the ones past the first 255 columns.

For my next weird trick, I tried SQLite

sqlite> .mode csv

sqlite> .import June2015Oct2015.csv June2015Oct2015

CREATE TABLE June2015Oct2015(…) failed: duplicate column name: Daily Logged-in Page Views

… oh cmon, Facebook. Unique column names are a pretty reasonable assumption. (FYI, Weekly Logged-in Page Views also appears to be in there twice. Probably some other things.) This would have been a decent idea for a reasonable file.

Props to Bob Rudis for coming back at this point with some R ideas that seemed plausible, even though none of them worked for me.

And then it’s time to hang out for Holden Karau’s Spark Office hours… and in my complaining, I nerdsnipe her into looking at it.

Here’s where we got:

scala> val df =spark.read.format(“csv”).option(“header”, “false”).option(“inferSchema”, “false”).load(“/Users/melissa/repos/data/social_media/example_facebook.csv”)

16/07/25 15:07:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

16/07/25 15:07:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string … 2125 more fields]

scala> val miniHeader = df.take(1)

16/07/25 15:07:43 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting ‘spark.debug.maxToStringFields’ in SparkEnv.conf.

miniHeader: Array[org.apache.spark.sql.Row] = Array([Date,Lifetime Total Likes,Daily New Likes,Daily Unlikes,Daily Page Engaged Users,Weekly Page Engaged Users,28 Days Page Engaged Users,Daily Like Sources – On Your Page,Daily Total Reach,Weekly Total Reach,28 Days Total Reach,Daily Organic Reach,Weekly Organic Reach,28 Days Organic Reach,Daily Total Impressions,Weekly Total Impressions,28 Days Total Impressions,Daily Organic impressions,Weekly Organic impressions,28 Days Organic impressions,Daily Reach of page posts,Weekly Reach of page posts,28 Days Reach of page posts,Daily Organic Reach of Page posts,Weekly Organic Reach of Page posts,28 Days Organic Reach of Page posts,Daily Total Impressions of your posts,Weekly Total Impressions of your posts,28 Days Total Impressions of your pos…

scala> val colheads = miniHeader(0).toSeq

colheads: Seq[Any] = WrappedArray(Date, Lifetime Total Likes, Daily New Likes, Daily Unlikes, Daily Page Engaged Users, Weekly Page Engaged Users, 28 Days Page Engaged Users, Daily Like Sources – On Your Page, Daily Total Reach, Weekly Total Reach, 28 Days Total Reach, Daily Organic Reach, Weekly Organic Reach, 28 Days Organic Reach, Daily Total Impressions, Weekly Total Impressions, 28 Days Total Impressions, Daily Organic impressions, Weekly Organic impressions, 28 Days Organic impressions, Daily Reach of page posts, Weekly Reach of page posts, 28 Days Reach of page posts, Daily Organic Reach of Page posts, Weekly Organic Reach of Page posts, 28 Days Organic Reach of Page posts, Daily Total Impressions of your posts, Weekly Total Impressions of your posts, 28 Days Total Impressions of…

scala> val goodcols = List(“Date”,”Daily New Likes”,”Daily Unlikes”,”Daily Page Engaged Users”,”Daily Total Reach”,”Daily Total Impressions”,”Daily Negative feedback”,”Daily Positive Feedback from Users – comment”,”Daily Positive Feedback from Users – like”,”Daily Daily count of fans online”)

goodcols: List[String] = List(Date, Daily New Likes, Daily Unlikes, Daily Page Engaged Users, Daily Total Reach, Daily Total Impressions, Daily Negative feedback, Daily Positive Feedback from Users – comment, Daily Positive Feedback from Users – like, Daily Daily count of fans online)

scala> val indices = goodcols.map(colheads.indexOf(_)).map(“_c” + _)

indices: List[String] = List(_c0, _c2, _c3, _c4, _c8, _c14, _c1740, _c1790, _c1791, _c1848)

scala> val out = df.select(indices(0),indices.tail:_*)

out: org.apache.spark.sql.DataFrame = [_c0: string, _c2: string … 8 more fields]

scala> out.write.format(“csv”).save(“/Users/melissa/repos/data/social_media/fb_main.csv”)

I’m sure it was TOTALLY REASONABLE to install a JDK and Spark for this.

Apologies if I missed any steps or something seems not quite right. Please do not try to argue with me and my code snippets. I am mostly documenting this because I still find it hilarious.

(1) Here are my main columns that I want:
“Daily New Likes”,
“Daily Unlikes”,
“Daily Page Engaged Users”,
“Daily Total Reach”,
“Daily Total Impressions”,
“Daily Negative feedback”,
“Daily Positive Feedback from Users – comment”,
“Daily Positive Feedback from Users – like”,
“Daily Daily count of fans online”

(2) Daily Liked and online – X, X in 0-23. I had to google to find out what this meant, but here’s a post that explains it.