{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Topic modeling  -  finding topics for a dataset\n",
    "In this tutorial, we show how to extract topics from/for a previously unknown dataset. A `topic model` is a domain unspecific model, which works without extra training on any dataset. It's a useful tool for a first and quick overview of an unfamiliar dataset and to decide on which classes to use for class and/or classlabel tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Include library\n",
    "\n",
    "The first step is to include the necessary class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autonlu.topic_model import TopicModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Provide a dataset\n",
    "\n",
    "Usually, we would load some sort of dataset. However, for this tutorial, we create our own mini-dataset of hotel reviews containing just four sentences. A dataset has to be provided as a list of strings. For longer documents, it is a good idea to perform sentence splitting first (e.g. using [spacy](https://spacy.io/api/sentencizer) or [nltk](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize)) and not give the whole documents to the system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "demo_dataset = [\"Best beach I've ever seen !!!.\", \n",
    "                \"Our kids loved the pool.\",\n",
    "                \"The selection at the buffet was not very large but everything was delicious.\",\n",
    "                \"Blue ocean, warm water and sometimes waves. Perfect!\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a topic model\n",
    "\n",
    "A topic model is an instance of the class `TopicModel` (imported above). When we create an instance of the class `TopicModel`, we have to pass the dataset and set the `language` parameter (in our case `\"en\"` for English)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating sentence-transformer ... Calculate vector representations for the samples\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2d35575565824fc59e8021938845a374",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Create main vocabulary from dataset\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3e5fd4e8ebb44dad8340bead10e29e11",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "First of two runs::   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8e948678b7ec4272ada53153efc79bdf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Second of two runs::   0%|          | 0/18 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "680fa279aaaa43bc97b081b6bde9d5b7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/19 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "topic_model = TopicModel(dataset_or_path=demo_dataset, language=\"en\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search for topics\n",
    "\n",
    "Now, we can already search for topics. To do so, we need to pass the number of topics we'd like to have as argument to the method `search_topics`. For our small example, `3` topics are sufficient. The results are returned in form of a python dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimizing...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f61f37aad0724ac0a8ac11705774be3f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Maximal time to finish:   0%|          | 0/4000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "return_dict = topic_model.search_topics(nb_topics=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show the results\n",
    "\n",
    "Instead of digesting the `return_dict`, we call the method `show_results` to get a quick overview."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 95.36% of the score points which could have been achieved by selecting all 19 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic   |   score   |  solo score  |  solo score / score\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  1   |  beach   |  47.52%   |    56.94%    |        1.20\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Best beach I've ever seen !!!.\n",
      "      |          |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  2   |  buffet  |  27.63%   |    42.89%    |        1.55\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * The selection at the buffet was not very large but everything was delicious.\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  3   |  pool    |  24.85%   |    45.93%    |        1.85\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Our kids loved the pool.\n",
      "-------------------------------------------------------------------\n",
      "\n",
      "  * score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.\n",
      "  * solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.\n",
      "  * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "topic_model.show_results(print_explanation=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The role of the vocabulary\n",
    "\n",
    "The topic model algorithm matches the sample sentences with all words in its vocabulary. However, the topic model doesn't have an internal vocabulary but derives it from the dataset. In our example, the total vocabulary (after filtering) comprises only 19 words.\n",
    "Although in our example, the 19 words from the dataset did a great job, we might like to have more control over the vocabulary. \n",
    "The class TopicModel offers several ways to influence the vocabulary and with that the topic selection. The most direct way is by setting the argument `main_vocabulary`, as we are going to do, below. In case you have a suitable vocabulary in form of a string list at hand, you can use it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vocabulary tools\n",
    "\n",
    "In this demo, we use the class `VocabularyTools` to generate our main vocabulary. `VocabularyTools` provides the two functions `get_small_english_vocab` and `get_large_english_vocab`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of words in the small vocabulary: 10382\n",
      "Number of words in the large vocabulary: 235758\n"
     ]
    }
   ],
   "source": [
    "from autonlu.topic_model import VocabularyTools\n",
    "small_vocab = VocabularyTools.get_small_english_vocab()\n",
    "large_vocab = VocabularyTools.get_large_english_vocab()\n",
    "print(f\"Number of words in the small vocabulary: {len(small_vocab)}\")\n",
    "print(f\"Number of words in the large vocabulary: {len(large_vocab)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting the main vocabulary\n",
    "\n",
    "We choose the small vocabulary with its roughly 10,000 words as new main vocabulary of our topic model. \n",
    "The argument `main_vocabulary` can either be set when creating a new model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating sentence-transformer ... Calculate vector representations for the samples\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5cff882b6faf4610aa9a7acb8291289b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "589ccea38e7545a9a815d5fb8f5d05ce",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10214 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "topic_model = TopicModel(dataset_or_path=demo_dataset, language=\"en\", main_vocabulary=small_vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or we pass the argument using the method `change_settings` of an already exiting instance of TopicModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "topic_model.change_settings(main_vocabulary=small_vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the method `change_settings` we can change all arguments of the class.\n",
    "Now, we repeat the search for topics with the newly set main vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimizing...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "85984964651c416a82324307183044a9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Maximal time to finish:   0%|          | 0/4000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 91.67% of the score points which could have been achieved by selecting all 10214 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic  |   score   |  solo score  |  solo score / score\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  1   |  beach  |  40.08%   |    56.00%    |        1.40\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Best beach I've ever seen !!!.\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  2   |  pool   |  30.94%   |    30.94%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Our kids loved the pool.\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  3   |  ocean  |  28.98%   |    39.97%    |        1.38\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "------------------------------------------------------------------\n",
      "\n",
      "  * score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.\n",
      "  * solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.\n",
      "  * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "topic_model.search_topics(nb_topics=3)\n",
    "topic_model.show_results(print_explanation=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparing vocabularies\n",
    "\n",
    "When you run the code above, you find that the identified topics have changed. Before, with the 19 words vocabulary derived from the dataset, we found the topics \"beach\", \"buffet\" and \"pool\" and with the 10,000 words vocabulary, we found \"beach\", \"pool\" and \"ocean\". The first selection based on only 19 words is better! The reason is that the 10,000 words vocabulary doesn't contain the word \"buffet\".\n",
    "\n",
    "Surely, if we had chosen the large vocabulary defined above, the word \"buffet\" would have been part of the topic model's vocabulary. On the other hand, the large vocabulary also contains words which can be considered as \"rare\". So, in more complex datasets, these rare words could show up as a strange topic selection. Still, nothing against trying out different vocabulary sittings. However, if you run the code on a CPU, processing the large vocabulary might take a few minutes.\n",
    "\n",
    "Meanwhile, we proceed with the small vocabulary and solve the problem of the missing \"buffet\" by other means."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Additional vocabulary\n",
    "\n",
    "Next to the argument `main_vocabulary`, we can also use the argument `additional_vocabulary` to pass further words to the vocabulary of the topic model. As the name indicates, the words of the additional vocabulary are added to the main vocabulary. The core difference between the `main_vocabulary`and the `additional_vocabulary` is that a missing `main_vocabulary` urges the computer to extract the `main_vocabulary` from the dataset, while a missing `additional_vocabulary` doesn't trigger such action. The general idea of usage is that the `main_dictionary` should comprise several thousands of words while the `additional_dictionary` should consist of only a few selected words or expressions. We demonstrate this with the code below, where we add the single word \"buffet\" to the already existing vocabulary set above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "aa77d37dfb4a46199fa8d4d6e76a70b8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimizing...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "51df729cf45847e3aa51240ba7ce2d3e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Maximal time to finish:   0%|          | 0/4000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 90.50% of the score points which could have been achieved by selecting all 10215 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic   |   score   |  solo score  |  solo score / score\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  1   |  beach   |  44.64%   |    44.64%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Best beach I've ever seen !!!.\n",
      "      |          |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  2   |  buffet  |  30.68%   |    30.68%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * The selection at the buffet was not very large but everything was delicious.\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  3   |  pool    |  24.68%   |    24.68%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Our kids loved the pool.\n",
      "-------------------------------------------------------------------\n",
      "\n",
      "  * score: The algorithm distributes score-points for the quality of the topic-sample-match. However, the algorithm also tries to find topics which describe different aspects. When two topics describe (partially) a similar subject, the score points are reduced. Hence, the topics are in competition with each other.\n",
      "  * solo score: The score a topic would achieve when it were the sole topic - i.e. when the topics were not in competition for score points. Hence, solo score >= score.\n",
      "  * solo score / score: The ratio between both scores explained above. A high value means the topic touches a subject with strong competition - most likely, you find a similar topic which touches the same subject.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#  We assume that the main_vocabulary was already set, as it is the case when you run this tutorial from top to bottom.\n",
    "topic_model.change_settings(additional_vocabulary=[\"buffet\"])\n",
    "# Let's see the impact of our change\n",
    "topic_model.search_topics(nb_topics=3)\n",
    "topic_model.show_results(print_explanation=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we obtain (once again) the topics \"beach\", \"buffet\" and \"pool\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The overlap vocabulary\n",
    "\n",
    "For our artificial, four sentence demo dataset, extracting the vocabulary from the dataset worked fine. However, a \"dirty\" real world dataset might be contaminated with spelling mistakes and strange character combinations which make up no words at all. These contaminations might spread to the extracted vocabulary and end up as topics. \n",
    "To clean the main vocabulary from wrong entries, `TopicModel` allows to set the argument `overlap_vocabulary`. Any word of the main vocabulary which is not found in the overlap vocabulary is removed. More mathematical speaking, the topic model takes the intersection (overlap) of the `main_vocabulary` with the `overlap_vocabulary`.\n",
    "If you decide to use an overlap vocabulary, it should be a large vocabulary, as the `large_vocab` defined above with roughly 235,000 words. However, for demonstrating purpose, we use a small vocabulary. Further, we clear the arguments `main_vocabulary` and `additional_vocabulary` by setting them to the empty list `[]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Create main vocabulary from dataset\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "db82d16c93eb47318329c11953f141cf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "First of two runs::   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "48ec98c3cc4341509b45cb965ae180b8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Second of two runs::   0%|          | 0/18 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5e02c9ec4d5c43089817bc393d617e84",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The vocabulary consists of the following 19 words:\n",
      "[\"'\", 'be', 'beach', 'blue', 'buffet', 'delicious', 'everything', 'good', 'kid', 'large', 'love', 'ocean', 'pool', 'see', 'selection', 've', 'warm', 'water', 'wave']\n",
      "Calculate vector representations for the vocabulary\n",
      "With the overlap_vocabulary, only the following 14 words remain:\n",
      "['beach', 'blue', 'delicious', 'everything', 'good', 'large', 'love', 'ocean', 'pool', 'see', 'selection', 'warm', 'water', 'wave']\n"
     ]
    }
   ],
   "source": [
    "# UNWISE CODE - JUST FOR DEMONSTRATION - YOU SHOULD USE A LARGER OVERLAP_VOCABULARY\n",
    "# Clear all vocabularies\n",
    "topic_model.change_settings(main_vocabulary=[], additional_vocabulary=[], overlap_vocabulary=[])\n",
    "# With main_vocabulary=[], the topic model derives its vocabulary from the 4 sentences in the demo dataset.\n",
    "# The result is stored in the variable \"word_list\". Let's print it!\n",
    "print(f\"The vocabulary consists of the following {len(topic_model.word_list)} words:\")\n",
    "print(topic_model.word_list)\n",
    "# Now, we include a small overlap_vocabulary and print the word_list again\n",
    "topic_model.change_settings(overlap_vocabulary=small_vocab)  # UNWISE !!!\n",
    "print(f\"With the overlap_vocabulary, only the following {len(topic_model.word_list)} words remain:\")\n",
    "print(topic_model.word_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Excluding words\n",
    "\n",
    "Imagine instead of our artificial 4 sentence demo dataset, we have a large real dataset with hotel reviews. It is highly likely that the word \"hotel\" becomes one of the top topics. Yet, we might be more interested in less obvious topics. One way to deal with this would be to ignore the topic \"hotel\". However, you should keep in mind that the topic model algorithm tries to find topics that cover different areas. That is, once a topic is found that covers a certain area, other topics that cover the same or a similar areas will be suppressed. Therefore, the better approach is to tell the topic model that you do not want a certain word to be a topic. To this end, the TopicModel class has the argument `excluded_words`. We demonstrate this by excluding the word \"beach\" from the topics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "aa5c327a83724225991f972a6da4c558",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10201 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n",
      "Optimizing...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "03fc1f99e951418a8a2e9474f1687afa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Maximal time to finish:   0%|          | 0/4000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 98.01% of the score points which could have been achieved by selecting all 10214 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic   |   score   |  solo score  |  solo score / score\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  1   |  ocean   |  36.46%   |    36.46%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "      |          |  * Best beach I've ever seen !!!.\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  2   |  buffet  |  35.21%   |    35.21%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * The selection at the buffet was not very large but everything was delicious.\n",
      "------+----------+-----------+--------------+----------------------\n",
      "  3   |  pool    |  28.33%   |    28.33%    |        1.00\n",
      "      |          |...........!..............!......................\n",
      "      |          |  * Our kids loved the pool.\n",
      "-------------------------------------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# restore the vocabulary and undo previous code\n",
    "topic_model.change_settings(main_vocabulary=small_vocab, additional_vocabulary=[\"buffet\"], overlap_vocabulary=[])\n",
    "# Now we exclude \"beach\"\n",
    "topic_model.change_settings(excluded_words=[\"beach\"])\n",
    "# Lets see the impact of our change\n",
    "topic_model.search_topics(nb_topics=3)\n",
    "topic_model.show_results()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, since we explicitly removed `\"beach\"` from our vocabulary, the word `\"ocean\"` is now taking over."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mandatory topics\n",
    "\n",
    "We might also imagine a situation where we already decided for a bunch of topics and like the topic model to find some more. Again, due to the fact that the algorithm tries to find topics that cover different areas, it is advisable to tell the topic model about this choice by using the argument `mandatory_topics`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "633a5ba49f3549c187f535a7be0a873d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculate vector representations for the vocabulary\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e600edf788454e49afb5af433ec1a5a3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimizing...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "abd6a4ec8e7e4181934e3c9b118b82a8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Maximal time to finish:   0%|          | 0/4000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 68.56% of the score points which could have been achieved by selecting all 10216 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic  |   score   |  solo score  |  solo score / score\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  1   |  beach  |  58.45%   |    58.45%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Best beach I've ever seen !!!.\n",
      "      |         |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  2   |  pool   |  32.45%   |    32.45%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Our kids loved the pool.\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  3   |  meals  |   9.10%   |     9.10%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * The selection at the buffet was not very large but everything was delicious.\n",
      "------------------------------------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# undo the last change\n",
    "topic_model.change_settings(excluded_words=[])\n",
    "# Force the word \"meals\" to become a topic\n",
    "topic_model.change_settings(mandatory_topics=[\"meals\"])\n",
    "# Let's see the impact of our change\n",
    "topic_model.search_topics(nb_topics=3)\n",
    "topic_model.show_results()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the mandatory topic `\"meals\"` now overshadows the originally found topic `\"buffet\"`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving and Loading\n",
    "\n",
    "A topic model can be saved by calling its `save` method with a string argument which determines the path plus filename. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We save the model in the current directory\n",
    "filename = \"dummy_save_demo.pickle\"  # internally, the data is pickled\n",
    "topic_model.save(filename)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load the data of a topic model into an already existing topic model, we can either use the `load` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "topic_model.load(filename)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or alternatively, we can use the `change_settings` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "topic_model.change_settings(dataset_or_path=filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the saved data to initialize a new topic model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "According to the internal scoring system, the following 3 topic achieve 68.56% of the score points which could have been achieved by selecting all 10216 words in the vocabulary as topics. \n",
      "\n",
      "rank  |  topic  |   score   |  solo score  |  solo score / score\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  1   |  beach  |  80.48%   |    80.48%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Best beach I've ever seen !!!.\n",
      "      |         |  * Blue ocean, warm water and sometimes waves. Perfect!\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  2   |  pool   |  44.69%   |    44.69%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * Our kids loved the pool.\n",
      "------+---------+-----------+--------------+----------------------\n",
      "  3   |  meals  |  12.53%   |    12.53%    |        1.00\n",
      "      |         |...........!..............!......................\n",
      "      |         |  * The selection at the buffet was not very large but everything was delicious.\n",
      "------------------------------------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "topic_model_2 = TopicModel(dataset_or_path=filename)\n",
    "# By loading the data of the old topic model, the new topic model knows the dataset, all vocabulary settings and the \n",
    "# results of the \"search_topics\" run. Let's check the results.\n",
    "topic_model_2.show_results()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A word on real life datasets\n",
    "\n",
    "In our demo dataset, each sample was a single sentence. If possible, we would advise you to split your dataset into sentences, as well. However, if this is not possible or if you feel that this makes no sense for your data, you can also process longer documents. Still, you should be aware of the possibility that the quality of the found topics might drop when the samples become too long."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "b52785b88087dfd29eb037878fad5aaf7da2bc6a97c5ae8546c63153fb4442d1"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}