Topic models are great to categorize WHAT a text is about. It is pretty easy as well: Get an off-the-shelf LDA, train it on your corpus and you are set to go. But there are even more insights you can get about your texts. Modifying your corpus in a certain way (mostly removing everything but verb phrases) allows you to gather a deeper understanding about WHY a certain text was written.
I tested it on stack-overflow (SO) questions within the scope of a bigger media mining project.
I was using python and nltk for this project. In order To play around with the code which is posted below, make sure to have python installed. Use
easy_install to install the following packages:
Furthermore, if you want to use the SO questions as well download them from https://archive.org/details/stackexchange.
Creation of a corpus and dictionary
Before we can start to train a topic model we need a dictionary of all the tokens that are part of our corpus (in our case the corpus consists of all SO questions).
from gensim import corpora class SOQuestionCorpus(corpora.TextCorpus): def __init__(self, question_file, tokenizer): # The stack-overflow questons are stored in a file, one question per line self.question_file = question_file # A tokenizer is a function that takes as input a text (possibly multiple sentences) and returns all # containing tokens (a token is the unit we are going to train the LDA on, can be either a single # word, a words stem or a word phrase) as an array of strings. self.tokenizer = tokenizer # The `TextCorpus` class is going to create a dictionary on all tokens of all documents we got. The # tokens for every document are provided in the `get_texts` function. super(SOQuestionCorpus, self).__init__(input=True) # Ignore common stop words (words that don't carry much meaning) lime 'the' or 'is' self.dictionary.filter_extremes(no_below=3, no_above=0.2) # Stack-overflow questions contain a lot of stuff we don't want to be included in our topic model, like # code snippets or other markup. def pre_process(body): return remove_tags(remove_code(body)) # Provides an array of arrays of all the tokens for all documents. # Example: # Let documents be # `["Hello world. I am doc1.", "Nice code! I like it."]` # In that case the function will yield two arrays with each cell containing the tokens of the sentence # `[["hello", "world", ".", "I", "am", "doc1", "."], ["Nice", "code", "!", "I", "like", "it", "."]]` def get_texts(self): with open(self.question_file) as questions: for question in questions: yield list(self.tokenizer(SOQuestionCorpus.pre_process(question)))
Creating the corpus will take some time since all the documents need to be processed. You should use
self.dictionary.save('mydictionary.dict') after the creation to store the dictionary for later use.
As you might have noticed, we did not define the
tokenizer function yet.
A standard topic model
For a standard topic model it is sufficient to use
utils.simple_preprocess as a tokenizer. It will lowercase the input and use a regex to split the text into single words:
from gensim import utils tokenizer = utils.simple_preprocess
We can now use this tokenizer and our corpus to train a model:
from gensim import models corpus = SOQuestionCorpus(tokenizer) model = models.LdaMulticore(corpus=corpus, iterations=50, chunksize=5000, num_topics=100, id2word=corpus.dictionary, eval_every=3, workers=5)
The model got trained to fit 100 topics to the corpus. After the training we can use this model to predict topics on new questions. Here are some examples of the resulting topics of the LDA trained topic model:
|#||Contained words sorted by frequency|
|50||request, response, requests, header, server, with, http, headers, proxy, get|
|62||ffmpeg, enable, gpu, cuda, sensitive, retina, with, sdl, for, opencl|
|73||feature, features, for, training, classification, dataset, lat, naive, predict, race|
As you can see #50 is mostly about HTTP communication, #62 is about GPU computing and #73 seems to be about machine learning. But there are no topics that reveal why the questioner is asking the question.
Verb-phrase tokenization of documents
Let’s fiddle around with the tokenization step to find a way to extract topics, that correspond to intentions of questioners. Nouns and noun phrases (NP) often contain information about the WHAT and as you can see in the above examples, the LDA mostly focuses on them. In contrast to nouns, verb phrases (VP) often contain information about the intentions of a questioner.
So let’s try to get rid of the noun phrases and train the LDA on the remaining phrases. To do so, we obviously need to figure out which parts of a sentence correspond to noun phrases. Nltk provides us with a neat base class for that task
ChunkParserI. We can build upon that and implement a simple bigram chunker (which can later be easily replaced by a more sophisticated model).
# Underlying chunker we are going to train class BigramChunker(ChunkParserI): def __init__(self, train_sentences): train_data = [[(t, c) for w, t, c in tree2conlltags(sent)] for sent in train_sentences] # Create a bigram tagger and use the supplied training data to create the model self.tagger = BigramTagger(train_data) # Extracts chunks from a sentence using the conll tree format. The incoming sentence needs to be # tokenized and annotated with POS tags def parse(self, sentence): pos_tags = [pos for (word, pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunk_tags = [chunk_tag for (pos, chunk_tag) in tagged_pos_tags] conll_tags = [(word, pos, chunk_tag) for ((word, pos), chunk_tag) in zip(sentence, chunk_tags)] return conlltags2tree(conll_tags)
As a training data set for the hunker the conll2000 data can be used. Now this hunker can be plugged in into a pipeline that first tokenizes, POS tags and chunks the sentences. Afterwards we can throw out chunks we do not want to use.
from nltk import sent_tokenize, word_tokenize, pos_tag, BigramTagger, ChunkParserI from nltk.chunk.util import conlltags2tree from nltk.corpus import conll2000 # This is an implementation of a text chunker. It tries to split sentences into its # phrases. To do that there are several steps: sentence splitting, word tokenization, # POS tagging. After that a trained chunker will group the tokens. class TextChunker: def __init__(self): # loading the training data for the chunker train_sents = conll2000.chunked_sents('train.txt', chunk_types=['VP', 'NP']) self.chunker = BigramChunker(train_sents) # Given a document of sentences calculate the containing chunks for each sentence def chunk_text(self, rawtext): if self.chunker is None: raise Exception("Text chunker needs to be trained before it can be used.") sentences = sent_tokenize(rawtext.lower()) # NLTK default sentence segmenter tokenized = [word_tokenize(sent) for sent in sentences] # NLTK word tokenizer postagged = [pos_tag(tokens) for tokens in tokenized] # NLTK POS tagger for tagged in postagged: for chunk in self._extract_chunks(self.chunker.parse(tagged), exclude=["NP", ".", ":", "(", ")"]): if len(chunk) >= 2: yield chunk def _token_of(self, tree): return tree def _tag_of(self, tree): return tree # The chunker will produce a parse tree. We need to analyse the parse tree and # extract and combine the tags we want. def _extract_chunks(self, tree, exclude): def traverse(tree): try: # Let's check if we are at a leaf node containing a token tree.label() except AttributeError: # We want to exclude all POS tags in `exclude` and furhtermore we want to ignore special characters. # The POS tag of a special character is equal to the character. The only other token for which this is # true is `to` so we need to make sure to exclude everything but `to`. if self._tag_of(tree) in exclude \ or self._token_of(tree) in exclude \ or (self._token_of(tree) != "to" and self._token_of(tree) == self._tag_of(tree)): return  else: # return the token of the node return [self._token_of(tree)] else: node = tree.label() if node in exclude: return  else: return [word for child in tree for word in traverse(child)] for child in tree: traversed = traverse(child) if len(traversed) > 0: # chunks get conected again using whitespaces yield " ".join(traversed)
Instead of the default tokenizer
utils.simple_preprocess we can now use this chunk based tokenizer to split our sentences and train our LDA model. The result is a model that mainly relies on verb phrases.
chunker = TextChunker() corpus = SOQuestionCorpus(chunker.chunk_text) model = models.LdaMulticore(corpus=corpus, iterations=50, chunksize=5000, num_topics=100, id2word=corpus.dictionary, eval_every=3, workers=5)
As a result, there will be topics that look somewhat like this:
|#||Contained words sorted by frequency|
|12||why, here, not, does, am getting, get, do not understand, is not working, works|
|32||compiled, using, by, get, is driving, compiling, invalid, signing, compile, is configured|
|43||can i do, using, want to show, want, not, only, now, can i achieve, here, am using|
The topics can be improved using stemming and a more sophisticated chunker.
In this post I tried to explain to you how to train a simple text chunker using the python library
nltk in combination with
gensim. Filtering out tokens before training a topic model or going even further and combining several towns to a single one will result in topics of a different meaning.
Let me know if you try this or have tried a similar approach. I would appreciate to get to know your insights and results.
The idea is inspired by Miltiadis Allamanis and Charles Sutton and published in Why, When and What: Analyzing Stack Overflow Questions by Topic, Type & Code