Topic modeling - finding topics for a dataset¶
In this tutorial, we show how to extract topics from/for a previously unknown dataset. A topic model
is a domain unspecific model, which works without extra training on any dataset. It’s a useful tool for a first and quick overview of an unfamiliar dataset and to decide on which classes to use for class and/or classlabel tasks.
Include library¶
The first step is to include the necessary class.
[1]:
from autonlu.topic_model import TopicModel
Provide a dataset¶
Usually, we would load some sort of dataset. However, for this tutorial, we create our own mini-dataset of hotel reviews containing just four sentences. A dataset has to be provided as a list of strings. For longer documents, it is a good idea to perform sentence splitting first (e.g. using spacy or nltk) and not give the whole documents to the system.
[2]:
demo_dataset = ["Best beach I've ever seen !!!.",
"Our kids loved the pool.",
"The selection at the buffet was not very large but everything was delicious.",
"Blue ocean, warm water and sometimes waves. Perfect!"]
Create a topic model¶
A topic model is an instance of the class TopicModel
(imported above). When we create an instance of the class TopicModel
, we have to pass the dataset and set the language
parameter (in our case "en"
for English).
[3]:
topic_model = TopicModel(dataset_or_path=demo_dataset, language="en")
Creating sentence-transformer ... Calculate vector representations for the samples
Create main vocabulary from dataset
Calculate vector representations for the vocabulary
Search for topics¶
Now, we can already search for topics. To do so, we need to pass the number of topics we’d like to have as argument to the method search_topics
. For our small example, 3
topics are sufficient. The results are returned in form of a python dictionary.
[4]:
return_dict = topic_model.search_topics(nb_topics=3)
Optimizing...
Show the results¶
Instead of digesting the return_dict
, we call the method show_results
to get a quick overview.
[5]:
topic_model.show_results(print_explanation=True)
According to the internal scoring system, the following 3 topic achieve 95.36% of the score points which could have been achieved by selecting all 19 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+----------+-----------+--------------+----------------------
1 | beach | 47.52% | 56.94% | 1.20
| |...........!..............!......................
| | * Best beach I've ever seen !!!.
| | * Blue ocean, warm water and sometimes waves. Perfect!
------+----------+-----------+--------------+----------------------
2 | buffet | 27.63% | 42.89% | 1.55
| |...........!..............!......................
| | * The selection at the buffet was not very large but everything was delicious.
------+----------+-----------+--------------+----------------------
3 | pool | 24.85% | 45.93% | 1.85
| |...........!..............!......................
| | * Our kids loved the pool.
-------------------------------------------------------------------
* score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.
* solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.
* solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.
The role of the vocabulary¶
The topic model algorithm matches the sample sentences with all words in its vocabulary. However, the topic model doesn’t have an internal vocabulary but derives it from the dataset. In our example, the total vocabulary (after filtering) comprises only 19 words. Although in our example, the 19 words from the dataset did a great job, we might like to have more control over the vocabulary. The class TopicModel offers several ways to influence the vocabulary and with that the topic selection. The
most direct way is by setting the argument main_vocabulary
, as we are going to do, below. In case you have a suitable vocabulary in form of a string list at hand, you can use it.
Vocabulary tools¶
In this demo, we use the class VocabularyTools
to generate our main vocabulary. VocabularyTools
provides the two functions get_small_english_vocab
and get_large_english_vocab
.
[6]:
from autonlu.topic_model import VocabularyTools
small_vocab = VocabularyTools.get_small_english_vocab()
large_vocab = VocabularyTools.get_large_english_vocab()
print(f"Number of words in the small vocabulary: {len(small_vocab)}")
print(f"Number of words in the large vocabulary: {len(large_vocab)}")
Number of words in the small vocabulary: 10382
Number of words in the large vocabulary: 235758
Setting the main vocabulary¶
We choose the small vocabulary with its roughly 10,000 words as new main vocabulary of our topic model. The argument main_vocabulary
can either be set when creating a new model
[7]:
topic_model = TopicModel(dataset_or_path=demo_dataset, language="en", main_vocabulary=small_vocab)
Creating sentence-transformer ... Calculate vector representations for the samples
Calculate vector representations for the vocabulary
or we pass the argument using the method change_settings
of an already exiting instance of TopicModel
[8]:
topic_model.change_settings(main_vocabulary=small_vocab)
With the method change_settings
we can change all arguments of the class. Now, we repeat the search for topics with the newly set main vocabulary
[10]:
topic_model.search_topics(nb_topics=3)
topic_model.show_results(print_explanation=True)
Optimizing...
According to the internal scoring system, the following 3 topic achieve 91.67% of the score points which could have been achieved by selecting all 10214 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+---------+-----------+--------------+----------------------
1 | beach | 40.08% | 56.00% | 1.40
| |...........!..............!......................
| | * Best beach I've ever seen !!!.
------+---------+-----------+--------------+----------------------
2 | pool | 30.94% | 30.94% | 1.00
| |...........!..............!......................
| | * Our kids loved the pool.
------+---------+-----------+--------------+----------------------
3 | ocean | 28.98% | 39.97% | 1.38
| |...........!..............!......................
| | * Blue ocean, warm water and sometimes waves. Perfect!
------------------------------------------------------------------
* score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.
* solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.
* solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.
Comparing vocabularies¶
When you run the code above, you find that the identified topics have changed. Before, with the 19 words vocabulary derived from the dataset, we found the topics “beach”, “buffet” and “pool” and with the 10,000 words vocabulary, we found “beach”, “pool” and “ocean”. The first selection based on only 19 words is better! The reason is that the 10,000 words vocabulary doesn’t contain the word “buffet”.
Surely, if we had chosen the large vocabulary defined above, the word “buffet” would have been part of the topic model’s vocabulary. On the other hand, the large vocabulary also contains words which can be considered as “rare”. So, in more complex datasets, these rare words could show up as a strange topic selection. Still, nothing against trying out different vocabulary sittings. However, if you run the code on a CPU, processing the large vocabulary might take a few minutes.
Meanwhile, we proceed with the small vocabulary and solve the problem of the missing “buffet” by other means.
Additional vocabulary¶
Next to the argument main_vocabulary
, we can also use the argument additional_vocabulary
to pass further words to the vocabulary of the topic model. As the name indicates, the words of the additional vocabulary are added to the main vocabulary. The core difference between the main_vocabulary
and the additional_vocabulary
is that a missing main_vocabulary
urges the computer to extract the main_vocabulary
from the dataset, while a missing additional_vocabulary
doesn’t
trigger such action. The general idea of usage is that the main_dictionary
should comprise several thousands of words while the additional_dictionary
should consist of only a few selected words or expressions. We demonstrate this with the code below, where we add the single word “buffet” to the already existing vocabulary set above.
[11]:
# We assume that the main_vocabulary was already set, as it is the case when you run this tutorial from top to bottom.
topic_model.change_settings(additional_vocabulary=["buffet"])
# Let's see the impact of our change
topic_model.search_topics(nb_topics=3)
topic_model.show_results(print_explanation=True)
Calculate vector representations for the vocabulary
Optimizing...
According to the internal scoring system, the following 3 topic achieve 90.50% of the score points which could have been achieved by selecting all 10215 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+----------+-----------+--------------+----------------------
1 | beach | 44.64% | 44.64% | 1.00
| |...........!..............!......................
| | * Best beach I've ever seen !!!.
| | * Blue ocean, warm water and sometimes waves. Perfect!
------+----------+-----------+--------------+----------------------
2 | buffet | 30.68% | 30.68% | 1.00
| |...........!..............!......................
| | * The selection at the buffet was not very large but everything was delicious.
------+----------+-----------+--------------+----------------------
3 | pool | 24.68% | 24.68% | 1.00
| |...........!..............!......................
| | * Our kids loved the pool.
-------------------------------------------------------------------
* score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.
* solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.
* solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.
Now, we obtain (once again) the topics “beach”, “buffet” and “pool”.
The overlap vocabulary¶
For our artificial, four sentence demo dataset, extracting the vocabulary from the dataset worked fine. However, a “dirty” real world dataset might be contaminated with spelling mistakes and strange character combinations which make up no words at all. These contaminations might spread to the extracted vocabulary and end up as topics. To clean the main vocabulary from wrong entries, TopicModel
allows to set the argument overlap_vocabulary
. Any word of the main vocabulary which is not
found in the overlap vocabulary is removed. More mathematical speaking, the topic model takes the intersection (overlap) of the main_vocabulary
with the overlap_vocabulary
. If you decide to use an overlap vocabulary, it should be a large vocabulary, as the large_vocab
defined above with roughly 235,000 words. However, for demonstrating purpose, we use a small vocabulary. Further, we clear the arguments main_vocabulary
and additional_vocabulary
by setting them to the empty
list []
.
[12]:
# UNWISE CODE - JUST FOR DEMONSTRATION - YOU SHOULD USE A LARGER OVERLAP_VOCABULARY
# Clear all vocabularies
topic_model.change_settings(main_vocabulary=[], additional_vocabulary=[], overlap_vocabulary=[])
# With main_vocabulary=[], the topic model derives its vocabulary from the 4 sentences in the demo dataset.
# The result is stored in the variable "word_list". Let's print it!
print(f"The vocabulary consists of the following {len(topic_model.word_list)} words:")
print(topic_model.word_list)
# Now, we include a small overlap_vocabulary and print the word_list again
topic_model.change_settings(overlap_vocabulary=small_vocab) # UNWISE !!!
print(f"With the overlap_vocabulary, only the following {len(topic_model.word_list)} words remain:")
print(topic_model.word_list)
Create main vocabulary from dataset
Calculate vector representations for the vocabulary
The vocabulary consists of the following 19 words:
["'", 'be', 'beach', 'blue', 'buffet', 'delicious', 'everything', 'good', 'kid', 'large', 'love', 'ocean', 'pool', 'see', 'selection', 've', 'warm', 'water', 'wave']
Calculate vector representations for the vocabulary
With the overlap_vocabulary, only the following 14 words remain:
['beach', 'blue', 'delicious', 'everything', 'good', 'large', 'love', 'ocean', 'pool', 'see', 'selection', 'warm', 'water', 'wave']
Excluding words¶
Imagine instead of our artificial 4 sentence demo dataset, we have a large real dataset with hotel reviews. It is highly likely that the word “hotel” becomes one of the top topics. Yet, we might be more interested in less obvious topics. One way to deal with this would be to ignore the topic “hotel”. However, you should keep in mind that the topic model algorithm tries to find topics that cover different areas. That is, once a topic is found that covers a certain area, other topics that cover
the same or a similar areas will be suppressed. Therefore, the better approach is to tell the topic model that you do not want a certain word to be a topic. To this end, the TopicModel class has the argument excluded_words
. We demonstrate this by excluding the word “beach” from the topics.
[13]:
# restore the vocabulary and undo previous code
topic_model.change_settings(main_vocabulary=small_vocab, additional_vocabulary=["buffet"], overlap_vocabulary=[])
# Now we exclude "beach"
topic_model.change_settings(excluded_words=["beach"])
# Lets see the impact of our change
topic_model.search_topics(nb_topics=3)
topic_model.show_results()
Calculate vector representations for the vocabulary
Calculate vector representations for the vocabulary
Optimizing...
According to the internal scoring system, the following 3 topic achieve 98.01% of the score points which could have been achieved by selecting all 10214 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+----------+-----------+--------------+----------------------
1 | ocean | 36.46% | 36.46% | 1.00
| |...........!..............!......................
| | * Blue ocean, warm water and sometimes waves. Perfect!
| | * Best beach I've ever seen !!!.
------+----------+-----------+--------------+----------------------
2 | buffet | 35.21% | 35.21% | 1.00
| |...........!..............!......................
| | * The selection at the buffet was not very large but everything was delicious.
------+----------+-----------+--------------+----------------------
3 | pool | 28.33% | 28.33% | 1.00
| |...........!..............!......................
| | * Our kids loved the pool.
-------------------------------------------------------------------
As you can see, since we explicitly removed "beach"
from our vocabulary, the word "ocean"
is now taking over.
Mandatory topics¶
We might also imagine a situation where we already decided for a bunch of topics and like the topic model to find some more. Again, due to the fact that the algorithm tries to find topics that cover different areas, it is advisable to tell the topic model about this choice by using the argument mandatory_topics
.
[14]:
# undo the last change
topic_model.change_settings(excluded_words=[])
# Force the word "meals" to become a topic
topic_model.change_settings(mandatory_topics=["meals"])
# Let's see the impact of our change
topic_model.search_topics(nb_topics=3)
topic_model.show_results()
Calculate vector representations for the vocabulary
Calculate vector representations for the vocabulary
Optimizing...
According to the internal scoring system, the following 3 topic achieve 68.56% of the score points which could have been achieved by selecting all 10216 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+---------+-----------+--------------+----------------------
1 | beach | 58.45% | 58.45% | 1.00
| |...........!..............!......................
| | * Best beach I've ever seen !!!.
| | * Blue ocean, warm water and sometimes waves. Perfect!
------+---------+-----------+--------------+----------------------
2 | pool | 32.45% | 32.45% | 1.00
| |...........!..............!......................
| | * Our kids loved the pool.
------+---------+-----------+--------------+----------------------
3 | meals | 9.10% | 9.10% | 1.00
| |...........!..............!......................
| | * The selection at the buffet was not very large but everything was delicious.
------------------------------------------------------------------
As you can see, the mandatory topic "meals"
now overshadows the originally found topic "buffet"
.
Saving and Loading¶
A topic model can be saved by calling its save
method with a string argument which determines the path plus filename.
[15]:
# We save the model in the current directory
filename = "dummy_save_demo.pickle" # internally, the data is pickled
topic_model.save(filename)
To load the data of a topic model into an already existing topic model, we can either use the load
method
[16]:
topic_model.load(filename)
or alternatively, we can use the change_settings
method
[17]:
topic_model.change_settings(dataset_or_path=filename)
We can also use the saved data to initialize a new topic model
[18]:
topic_model_2 = TopicModel(dataset_or_path=filename)
# By loading the data of the old topic model, the new topic model knows the dataset, all vocabulary settings and the
# results of the "search_topics" run. Let's check the results.
topic_model_2.show_results()
According to the internal scoring system, the following 3 topic achieve 68.56% of the score points which could have been achieved by selecting all 10216 words in the vocabulary as topics.
rank | topic | score | solo score | solo score / score
------+---------+-----------+--------------+----------------------
1 | beach | 80.48% | 80.48% | 1.00
| |...........!..............!......................
| | * Best beach I've ever seen !!!.
| | * Blue ocean, warm water and sometimes waves. Perfect!
------+---------+-----------+--------------+----------------------
2 | pool | 44.69% | 44.69% | 1.00
| |...........!..............!......................
| | * Our kids loved the pool.
------+---------+-----------+--------------+----------------------
3 | meals | 12.53% | 12.53% | 1.00
| |...........!..............!......................
| | * The selection at the buffet was not very large but everything was delicious.
------------------------------------------------------------------
A word on real life datasets¶
In our demo dataset, each sample was a single sentence. If possible, we would advise you to split your dataset into sentences, as well. However, if this is not possible or if you feel that this makes no sense for your data, you can also process longer documents. Still, you should be aware of the possibility that the quality of the found topics might drop when the samples become too long.