Topic Modelling¶
AutoNLU offers the possibility to automatically extract good topics from a corpus of texts. This can be very helpful when having to decide on which classes to use for class or classlabel tasks.
Have a look at the tutorial “Topic modeling - finding topics for a dataset” to see how this can be used in practice.
TopicModel¶
- class autonlu.topic_model.TopicModel(dataset_or_path, language=None, sentence_transformer=None, main_vocabulary=None, additional_vocabulary=None, overlap_vocabulary=None, excluded_words=None, mandatory_topics=None, verbose=None, device=None)¶
The class TopicModel provides the tool to find topics for any given dataset. Ideally, the samples in the dataset are single samples. Longer documents work as well but the quality of the topic might drop.
The main function of this class is
search_topics
. Here is a simplified demo how to use the class:>>> from autonlu.topic_model import TopicModel >>> # We need a dataset (list of strings). For the demonstration, we use a dummy dataset. >>> dummy_dataset = ["Best beach I've ever seen !!!.", >>> "Our kids loved the pool.", >>> "The selection at the buffet was not very large but everything was delicious.", >>> "Blue ocean, warm water and sometimes waves. Perfect!"] >>> # Instantiate the class >>> topic_model = TopicModel(dataset_or_path=dummy_dataset, language="en") >>> return_dict = topic_model.search_topics(nb_topics=3) >>> # Instead of "digesting" the return_dict, we call the "show_results" method of the class to get a quick overview. >>> topic_model.show_results(print_explanation=True) >>> # Prints the three topics "beach", "buffet" and "pool" nicely together with some extra information
- Parameters
dataset_or_path (
Union
[str
,List
[str
]]) – Either a sting containing the path and name of a saved model or a dataset, where the samples are given as a list of strings, for which the topics should be found. For the latter case, the samples are ideally single sentences. Longer documents work as well but the quality of the topics might drop. In case you pass a filename of a saved model plus further arguments, the model will be loaded first and after that any given further argument will be used to replace the loaded setting.language (
Optional
[str
]) – The language of the samples. Currently, only “en” and “de” are supported. Ifdataset_or_path
is a dataset,language
is mandatory.sentence_transformer (
Union
[str
,List
[str
],None
]) – One or more model-names of sentence-transformers. The sentence-transformer build the heart of the entire algorithm. If no name(s) are provided, a suitable standard sentence-transformer is chosen by the computer.main_vocabulary (
Optional
[List
[str
]]) – Samples are matched with all words found in its vocabulary. Different vocabularies with different functionalities can be defined. However, themain_vocabulary
is the core vocabulary. If nomain_vocabulary
is provided, it will be derived from the dataset.overlap_vocabulary (
Optional
[List
[str
]]) – If an overlap_vocabulary is provided, it is used to filter the words in themain_vocabulary
: Only words which are found in both vocabularies are taken. Anoverlap_vocabulary
should be large and contain virtually all words of the language. Using anoverlap_dictionary
makes e.g. sense when yourmain_vocabulary
might contain incorrect words due to spelling mistakes or other reasons. This is e.g. likely when yourmain_vocabulary
is derived from the dataset.additional_vocabulary (
Optional
[List
[str
]]) – Allows to add extra words/expressions that are not in themain_vocabulary
. Vocables from theadditional_vocabulary
are not filtered by theoverlap_vocabulary
- so you might also add longer expressions consisting of several words as a single vocable, which is usually not found in a dictionary (e.g. “Fun with Flags”).excluded_words (
Optional
[List
[str
]]) – Allows to explicitly exclude certain words as potential topics. For example in a dataset about hotels, the algorithm might find that the word “hotel” is a very good topic. However, since this is a trivial topic, one might be interested in less obvious topics and hence, exclude “hotel” as topic.mandatory_topics (
Optional
[List
[str
]]) – Words in themandatory_topic
list have to be among the chosen topics. In case you already know that you like to have certain words as topics, we recommend usingmandatory_topics
to inform the computer about your choice instead of simply adding your topics by hand. Topics are in competition with each other - i.e. the computer tries to avoid selecting two topics which describe the same subject.device (
Optional
[str
]) – Either"cpu"
or"cuda"
. If not specified, the best device will be determined automatically.verbose (
Optional
[bool
]) – ifTrue
(standard), the computer informs you about the progress of its calculations.
- change_settings(dataset_or_path=None, language=None, sentence_transformer=None, main_vocabulary=None, additional_vocabulary=None, overlap_vocabulary=None, excluded_words=None, mandatory_topics=None, verbose=None, device=None)¶
This function allows changing the settings of most variables initialized during the instantiation of the class.
- Parameters
dataset_or_path (
Union
[str
,List
[str
],None
]) – Either a sting containing the path and name of a saved model or a dataset, where the samples are given as a list of strings, for which the topics should be found. For the latter case, the samples are ideally single sentences. Longer documents work as well but the quality of the topics might drop. In case you pass a filename of a saved model plus further arguments, the model will be loaded first and after that any given further argument will be used to replace the loaded setting.language (
Optional
[str
]) – The language of the samples. Currently, only “en” and “de” are supported.sentence_transformer (
Union
[str
,List
[str
],None
]) – One or more model-names of sentence-transformers. The sentence-transformer build the heart of the entire algorithm. If no name(s) are provided, a suitable standard sentence-transformer is chosen by the computer.main_vocabulary (
Optional
[List
[str
]]) – The computer tries to match the samples with all words found in its vocabulary. Different vocabularies with different functionalities can be defined. However, themain_vocabulary
is the core vocabulary. If nomain_vocabulary
is provided, it will be derived from the dataset.overlap_vocabulary (
Optional
[List
[str
]]) –If an overlap_vocabulary is provided, it is used to filter the words in the
main_vocabulary
: Only words which are found in both vocabularies are taken. Anoverlap_vocabulary
should be large and contain virtually all words of the language. Using anoverlap_dictionary
makes e.g.sense when your
main_vocabulary
might contain incorrect words due to spelling mistakes or other reasons.This is e.g. likely when your
main_vocabulary
is derived from the dataset.additional_vocabulary (
Optional
[List
[str
]]) – Allows to add extra words/expressions that are not in themain_vocabulary
. Vocables from theadditional_vocabulary
are not filtered by theoverlap_vocabulary
- so you might also add longer expressions consisting of several words as a single vocable, which is usually not found in a dictionary (e.g. “Fun with Flags”).excluded_words (
Optional
[List
[str
]]) – Allows to explicitly exclude certain words as potential topics. For example in a dataset about hotels, the algorithm might find that the word “hotel” is a very good topic. However, since this is a trivial topic, one might be interested in less obvious topics and hence, exclude “hotel” as topic.mandatory_topics (
Optional
[List
[str
]]) – Words in themandatory_topic
list have to be among the chosen topics. In case you already know that you like to have certain words as topics, we recommend usingmandatory_topics
to inform the computer about your choice instead of simply adding your topics by hand. Topics are in competition with each other - i.e. the computer tries to avoid selecting two topics which describe the same subject.device (
Optional
[str
]) – Either"cpu"
or"cuda"
. If not specified, the best device will be determined automatically.verbose (
Optional
[bool
]) – ifTrue
(standard), the computer informs you about the progress of its calculations.
- Return type
None
- Returns
nothing - all new information is stored in “self.”
- search_topics(nb_topics=20)¶
search_topics
is the main function of the classTopicModel
. It searches for topics in a given dataset. A core element for such a search are language models known as “sentence-transformer”. Such a sentence-transformer is used to find similarities between the samples in the dataset and the words of a given vocabulary. The algorithm behindsearch_topics
tries to balance two demands:1: Find topic-words which have high similarities with the samples. 2: Find topic-words which cover dissimilar subjects, respectively avoid having two words describing the
same subject. Hence, potential topic-words are in competition with each other, since similar words suppress each other.
- Parameters
nb_topics (
int
) – The number of topics to be found. If not specified, 20 topics will be returned.- Return type
dict
- Returns
A dictionary with the following entries
topics
: list of words (strings) which were selected as topicstopic_gain
: 1D array of the gain of each topics (The algorithm distributes a kind of score to the words to decide which word is a good topic)independent_topic_gain
: 1D array of float. The independent gain is the gain a topic would achieve without the competition of the other topics (or if it were the sole topic)total_gain
: Float. Sum of the gains of all topicsmax_possible_gain
: Float. The maximal possible gain, which is achieved when all words in the vocabulary become topicstopic_sample_id
: list of arrays - one array for each topic. The array contains the id (position) of the samples which are described by the topic in question.topic_sample_match
: list of arrays - one array for each topic. The array contains the a similarity measure for the topic in question and the samples given bytopic_sample_id
.
- show_results(nb_examples=5, print_explanation=False)¶
Prints the found topics with statistics and optional example samples.
- Parameters
nb_examples (
int
) – Number of examples shown per topicprint_explanation (
bool
) – IfTrue
an explanatory text for the table header is printed
- Return type
None
VocabularyTools¶
- class autonlu.topic_model.VocabularyTools¶
A collection of static methods used to obtain and manipulate vocabulary.
- static get_large_english_vocab()¶
This method uses the word corpus of nltk and might need a download
- Return type
List
[str
]- Returns
A List with over 235,000 English words.
- static get_small_english_vocab(large_vocab=None)¶
A small English vocabulary of roughly 10,000 is returned as a string-list. The vocabulary is extracted from the
BERT
tokenizer vocabulary and a large_vocab. The latter is needed since the BERT tokenizer also contains word fractions, which we need to filter.- Parameters
large_vocab (
Optional
[List
[str
]]) – A larger control vocabulary. IfNone
the methodget_large_english_vocab
is used to create one- Return type
List
[str
]- Returns
A List with over 10,000 English words.
- static extract_vocabulary_from_dataset(dataset, language, verbose=True)¶
All words found in the dataset are extracted, lemmatized and returned as word_list
- Parameters
dataset (
List
[str
]) – A list of sample strings, from which the words are extractedlanguage (
str
) – Currently, only “en” and “de” are supported valuesverbose (
bool
) – if “True” (standard), the computer informs you about the progress of its calculations.
- Return type
List
[str
]- Returns
A list of words (string)
- static filter_case_duplicates(word_set)¶
A set of words might contains some words which appear with capital letters and with lower case letters We like to remove these double words. In case a word appears in lower and upper case, we only take the lower.
Remark: This method is not perfect: In case we have the three words “USA”, “Usa” and “usa”, only “usa” survives. However, if we just have “USA” and “Usa”, both will be taken, since the dominant lower case is missing. If you don’t need to distinguish between lower and upper case, a simpler and more effective method is simply using a lower case set which definitely removes all duplicates:
>>> lower_set = set(word.lower() for word in word_set)
- Parameters
word_set (
Set
[str
]) – A set of words (i.e. string)- Return type
Set
[str
]- Returns
A set of words, without trivial duplicates.