Topic Modelling

AutoNLU offers the possibility to automatically extract good topics from a corpus of texts. This can be very helpful when having to decide on which classes to use for class or classlabel tasks.

Have a look at the tutorial “Topic modeling - finding topics for a dataset” to see how this can be used in practice.

TopicModel

class autonlu.topic_model.TopicModel(dataset_or_path, language=None, sentence_transformer=None, main_vocabulary=None, additional_vocabulary=None, overlap_vocabulary=None, excluded_words=None, mandatory_topics=None, verbose=None, device=None)

The class TopicModel provides the tool to find topics for any given dataset. Ideally, the samples in the dataset are single samples. Longer documents work as well but the quality of the topic might drop.

The main function of this class is search_topics. Here is a simplified demo how to use the class:

>>> from autonlu.topic_model import TopicModel
>>> # We need a dataset (list of strings). For the demonstration, we use a dummy dataset.
>>> dummy_dataset = ["Best beach I've ever seen !!!.",
>>>                  "Our kids loved the pool.",
>>>                  "The selection at the buffet was not very large but everything was delicious.",
>>>                  "Blue ocean, warm water and sometimes waves. Perfect!"]
>>> # Instantiate the class
>>> topic_model = TopicModel(dataset_or_path=dummy_dataset, language="en")
>>> return_dict = topic_model.search_topics(nb_topics=3)
>>> # Instead of "digesting" the return_dict, we call the "show_results" method of the class to get a quick overview.
>>> topic_model.show_results(print_explanation=True)
>>> # Prints the three topics "beach", "buffet" and "pool" nicely together with some extra information
Parameters
  • dataset_or_path (Union[str, List[str]]) – Either a sting containing the path and name of a saved model or a dataset, where the samples are given as a list of strings, for which the topics should be found. For the latter case, the samples are ideally single sentences. Longer documents work as well but the quality of the topics might drop. In case you pass a filename of a saved model plus further arguments, the model will be loaded first and after that any given further argument will be used to replace the loaded setting.

  • language (Optional[str]) – The language of the samples. Currently, only “en” and “de” are supported. If dataset_or_path is a dataset, language is mandatory.

  • sentence_transformer (Union[str, List[str], None]) – One or more model-names of sentence-transformers. The sentence-transformer build the heart of the entire algorithm. If no name(s) are provided, a suitable standard sentence-transformer is chosen by the computer.

  • main_vocabulary (Optional[List[str]]) – Samples are matched with all words found in its vocabulary. Different vocabularies with different functionalities can be defined. However, the main_vocabulary is the core vocabulary. If no main_vocabulary is provided, it will be derived from the dataset.

  • overlap_vocabulary (Optional[List[str]]) – If an overlap_vocabulary is provided, it is used to filter the words in the main_vocabulary: Only words which are found in both vocabularies are taken. An overlap_vocabulary should be large and contain virtually all words of the language. Using an overlap_dictionary makes e.g. sense when your main_vocabulary might contain incorrect words due to spelling mistakes or other reasons. This is e.g. likely when your main_vocabulary is derived from the dataset.

  • additional_vocabulary (Optional[List[str]]) – Allows to add extra words/expressions that are not in the main_vocabulary. Vocables from the additional_vocabulary are not filtered by the overlap_vocabulary - so you might also add longer expressions consisting of several words as a single vocable, which is usually not found in a dictionary (e.g. “Fun with Flags”).

  • excluded_words (Optional[List[str]]) – Allows to explicitly exclude certain words as potential topics. For example in a dataset about hotels, the algorithm might find that the word “hotel” is a very good topic. However, since this is a trivial topic, one might be interested in less obvious topics and hence, exclude “hotel” as topic.

  • mandatory_topics (Optional[List[str]]) – Words in the mandatory_topic list have to be among the chosen topics. In case you already know that you like to have certain words as topics, we recommend using mandatory_topics to inform the computer about your choice instead of simply adding your topics by hand. Topics are in competition with each other - i.e. the computer tries to avoid selecting two topics which describe the same subject.

  • device (Optional[str]) – Either "cpu" or "cuda". If not specified, the best device will be determined automatically.

  • verbose (Optional[bool]) – if True (standard), the computer informs you about the progress of its calculations.

change_settings(dataset_or_path=None, language=None, sentence_transformer=None, main_vocabulary=None, additional_vocabulary=None, overlap_vocabulary=None, excluded_words=None, mandatory_topics=None, verbose=None, device=None)

This function allows changing the settings of most variables initialized during the instantiation of the class.

Parameters
  • dataset_or_path (Union[str, List[str], None]) – Either a sting containing the path and name of a saved model or a dataset, where the samples are given as a list of strings, for which the topics should be found. For the latter case, the samples are ideally single sentences. Longer documents work as well but the quality of the topics might drop. In case you pass a filename of a saved model plus further arguments, the model will be loaded first and after that any given further argument will be used to replace the loaded setting.

  • language (Optional[str]) – The language of the samples. Currently, only “en” and “de” are supported.

  • sentence_transformer (Union[str, List[str], None]) – One or more model-names of sentence-transformers. The sentence-transformer build the heart of the entire algorithm. If no name(s) are provided, a suitable standard sentence-transformer is chosen by the computer.

  • main_vocabulary (Optional[List[str]]) – The computer tries to match the samples with all words found in its vocabulary. Different vocabularies with different functionalities can be defined. However, the main_vocabulary is the core vocabulary. If no main_vocabulary is provided, it will be derived from the dataset.

  • overlap_vocabulary (Optional[List[str]]) –

    If an overlap_vocabulary is provided, it is used to filter the words in the main_vocabulary: Only words which are found in both vocabularies are taken. An overlap_vocabulary should be large and contain virtually all words of the language. Using an overlap_dictionary makes e.g.

    sense when your main_vocabulary might contain incorrect words due to spelling mistakes or other reasons.

    This is e.g. likely when your main_vocabulary is derived from the dataset.

  • additional_vocabulary (Optional[List[str]]) – Allows to add extra words/expressions that are not in the main_vocabulary. Vocables from the additional_vocabulary are not filtered by the overlap_vocabulary - so you might also add longer expressions consisting of several words as a single vocable, which is usually not found in a dictionary (e.g. “Fun with Flags”).

  • excluded_words (Optional[List[str]]) – Allows to explicitly exclude certain words as potential topics. For example in a dataset about hotels, the algorithm might find that the word “hotel” is a very good topic. However, since this is a trivial topic, one might be interested in less obvious topics and hence, exclude “hotel” as topic.

  • mandatory_topics (Optional[List[str]]) – Words in the mandatory_topic list have to be among the chosen topics. In case you already know that you like to have certain words as topics, we recommend using mandatory_topics to inform the computer about your choice instead of simply adding your topics by hand. Topics are in competition with each other - i.e. the computer tries to avoid selecting two topics which describe the same subject.

  • device (Optional[str]) – Either "cpu" or "cuda". If not specified, the best device will be determined automatically.

  • verbose (Optional[bool]) – if True (standard), the computer informs you about the progress of its calculations.

Return type

None

Returns

nothing - all new information is stored in “self.”

search_topics(nb_topics=20)

search_topics is the main function of the class TopicModel. It searches for topics in a given dataset. A core element for such a search are language models known as “sentence-transformer”. Such a sentence-transformer is used to find similarities between the samples in the dataset and the words of a given vocabulary. The algorithm behind search_topics tries to balance two demands:

1: Find topic-words which have high similarities with the samples. 2: Find topic-words which cover dissimilar subjects, respectively avoid having two words describing the

same subject. Hence, potential topic-words are in competition with each other, since similar words suppress each other.

Parameters

nb_topics (int) – The number of topics to be found. If not specified, 20 topics will be returned.

Return type

dict

Returns

A dictionary with the following entries

  • topics: list of words (strings) which were selected as topics

  • topic_gain: 1D array of the gain of each topics (The algorithm distributes a kind of score to the words to decide which word is a good topic)

  • independent_topic_gain: 1D array of float. The independent gain is the gain a topic would achieve without the competition of the other topics (or if it were the sole topic)

  • total_gain: Float. Sum of the gains of all topics

  • max_possible_gain: Float. The maximal possible gain, which is achieved when all words in the vocabulary become topics

  • topic_sample_id: list of arrays - one array for each topic. The array contains the id (position) of the samples which are described by the topic in question.

  • topic_sample_match: list of arrays - one array for each topic. The array contains the a similarity measure for the topic in question and the samples given by topic_sample_id.

show_results(nb_examples=5, print_explanation=False)

Prints the found topics with statistics and optional example samples.

Parameters
  • nb_examples (int) – Number of examples shown per topic

  • print_explanation (bool) – If True an explanatory text for the table header is printed

Return type

None

VocabularyTools

class autonlu.topic_model.VocabularyTools

A collection of static methods used to obtain and manipulate vocabulary.

static get_large_english_vocab()

This method uses the word corpus of nltk and might need a download

Return type

List[str]

Returns

A List with over 235,000 English words.

static get_small_english_vocab(large_vocab=None)

A small English vocabulary of roughly 10,000 is returned as a string-list. The vocabulary is extracted from the BERT tokenizer vocabulary and a large_vocab. The latter is needed since the BERT tokenizer also contains word fractions, which we need to filter.

Parameters

large_vocab (Optional[List[str]]) – A larger control vocabulary. If None the method get_large_english_vocab is used to create one

Return type

List[str]

Returns

A List with over 10,000 English words.

static extract_vocabulary_from_dataset(dataset, language, verbose=True)

All words found in the dataset are extracted, lemmatized and returned as word_list

Parameters
  • dataset (List[str]) – A list of sample strings, from which the words are extracted

  • language (str) – Currently, only “en” and “de” are supported values

  • verbose (bool) – if “True” (standard), the computer informs you about the progress of its calculations.

Return type

List[str]

Returns

A list of words (string)

static filter_case_duplicates(word_set)

A set of words might contains some words which appear with capital letters and with lower case letters We like to remove these double words. In case a word appears in lower and upper case, we only take the lower.

Remark: This method is not perfect: In case we have the three words “USA”, “Usa” and “usa”, only “usa” survives. However, if we just have “USA” and “Usa”, both will be taken, since the dominant lower case is missing. If you don’t need to distinguish between lower and upper case, a simpler and more effective method is simply using a lower case set which definitely removes all duplicates:

>>>  lower_set = set(word.lower() for word in word_set)
Parameters

word_set (Set[str]) – A set of words (i.e. string)

Return type

Set[str]

Returns

A set of words, without trivial duplicates.