{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic modeling - finding topics for a dataset\n", "In this tutorial, we show how to extract topics from/for a previously unknown dataset. A `topic model` is a domain unspecific model, which works without extra training on any dataset. It's a useful tool for a first and quick overview of an unfamiliar dataset and to decide on which classes to use for class and/or classlabel tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Include library\n", "\n", "The first step is to include the necessary class." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from autonlu.topic_model import TopicModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Provide a dataset\n", "\n", "Usually, we would load some sort of dataset. However, for this tutorial, we create our own mini-dataset of hotel reviews containing just four sentences. A dataset has to be provided as a list of strings. For longer documents, it is a good idea to perform sentence splitting first (e.g. using [spacy](https://spacy.io/api/sentencizer) or [nltk](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize)) and not give the whole documents to the system." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "demo_dataset = [\"Best beach I've ever seen !!!.\", \n", " \"Our kids loved the pool.\",\n", " \"The selection at the buffet was not very large but everything was delicious.\",\n", " \"Blue ocean, warm water and sometimes waves. Perfect!\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a topic model\n", "\n", "A topic model is an instance of the class `TopicModel` (imported above). When we create an instance of the class `TopicModel`, we have to pass the dataset and set the `language` parameter (in our case `\"en\"` for English)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating sentence-transformer ... Calculate vector representations for the samples\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2d35575565824fc59e8021938845a374", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/4 [00:00= score.\n", " * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n", "\n" ] } ], "source": [ "topic_model.show_results(print_explanation=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The role of the vocabulary\n", "\n", "The topic model algorithm matches the sample sentences with all words in its vocabulary. However, the topic model doesn't have an internal vocabulary but derives it from the dataset. In our example, the total vocabulary (after filtering) comprises only 19 words.\n", "Although in our example, the 19 words from the dataset did a great job, we might like to have more control over the vocabulary. \n", "The class TopicModel offers several ways to influence the vocabulary and with that the topic selection. The most direct way is by setting the argument `main_vocabulary`, as we are going to do, below. In case you have a suitable vocabulary in form of a string list at hand, you can use it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocabulary tools\n", "\n", "In this demo, we use the class `VocabularyTools` to generate our main vocabulary. `VocabularyTools` provides the two functions `get_small_english_vocab` and `get_large_english_vocab`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of words in the small vocabulary: 10382\n", "Number of words in the large vocabulary: 235758\n" ] } ], "source": [ "from autonlu.topic_model import VocabularyTools\n", "small_vocab = VocabularyTools.get_small_english_vocab()\n", "large_vocab = VocabularyTools.get_large_english_vocab()\n", "print(f\"Number of words in the small vocabulary: {len(small_vocab)}\")\n", "print(f\"Number of words in the large vocabulary: {len(large_vocab)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting the main vocabulary\n", "\n", "We choose the small vocabulary with its roughly 10,000 words as new main vocabulary of our topic model. \n", "The argument `main_vocabulary` can either be set when creating a new model" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating sentence-transformer ... Calculate vector representations for the samples\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5cff882b6faf4610aa9a7acb8291289b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/4 [00:00= score.\n", " * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n", "\n" ] } ], "source": [ "topic_model.search_topics(nb_topics=3)\n", "topic_model.show_results(print_explanation=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing vocabularies\n", "\n", "When you run the code above, you find that the identified topics have changed. Before, with the 19 words vocabulary derived from the dataset, we found the topics \"beach\", \"buffet\" and \"pool\" and with the 10,000 words vocabulary, we found \"beach\", \"pool\" and \"ocean\". The first selection based on only 19 words is better! The reason is that the 10,000 words vocabulary doesn't contain the word \"buffet\".\n", "\n", "Surely, if we had chosen the large vocabulary defined above, the word \"buffet\" would have been part of the topic model's vocabulary. On the other hand, the large vocabulary also contains words which can be considered as \"rare\". So, in more complex datasets, these rare words could show up as a strange topic selection. Still, nothing against trying out different vocabulary sittings. However, if you run the code on a CPU, processing the large vocabulary might take a few minutes.\n", "\n", "Meanwhile, we proceed with the small vocabulary and solve the problem of the missing \"buffet\" by other means." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional vocabulary\n", "\n", "Next to the argument `main_vocabulary`, we can also use the argument `additional_vocabulary` to pass further words to the vocabulary of the topic model. As the name indicates, the words of the additional vocabulary are added to the main vocabulary. The core difference between the `main_vocabulary`and the `additional_vocabulary` is that a missing `main_vocabulary` urges the computer to extract the `main_vocabulary` from the dataset, while a missing `additional_vocabulary` doesn't trigger such action. The general idea of usage is that the `main_dictionary` should comprise several thousands of words while the `additional_dictionary` should consist of only a few selected words or expressions. We demonstrate this with the code below, where we add the single word \"buffet\" to the already existing vocabulary set above. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calculate vector representations for the vocabulary\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "aa77d37dfb4a46199fa8d4d6e76a70b8", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1 [00:00= score.\n", " * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n", "\n" ] } ], "source": [ "# We assume that the main_vocabulary was already set, as it is the case when you run this tutorial from top to bottom.\n", "topic_model.change_settings(additional_vocabulary=[\"buffet\"])\n", "# Let's see the impact of our change\n", "topic_model.search_topics(nb_topics=3)\n", "topic_model.show_results(print_explanation=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we obtain (once again) the topics \"beach\", \"buffet\" and \"pool\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The overlap vocabulary\n", "\n", "For our artificial, four sentence demo dataset, extracting the vocabulary from the dataset worked fine. However, a \"dirty\" real world dataset might be contaminated with spelling mistakes and strange character combinations which make up no words at all. These contaminations might spread to the extracted vocabulary and end up as topics. \n", "To clean the main vocabulary from wrong entries, `TopicModel` allows to set the argument `overlap_vocabulary`. Any word of the main vocabulary which is not found in the overlap vocabulary is removed. More mathematical speaking, the topic model takes the intersection (overlap) of the `main_vocabulary` with the `overlap_vocabulary`.\n", "If you decide to use an overlap vocabulary, it should be a large vocabulary, as the `large_vocab` defined above with roughly 235,000 words. However, for demonstrating purpose, we use a small vocabulary. Further, we clear the arguments `main_vocabulary` and `additional_vocabulary` by setting them to the empty list `[]`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Create main vocabulary from dataset\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "db82d16c93eb47318329c11953f141cf", "version_major": 2, "version_minor": 0 }, "text/plain": [ "First of two runs:: 0%| | 0/4 [00:00