Model¶

Model implements the easiest to use interface into our framework and uses SimpleModel internally for training, inference, etc. Currently, three different text classification tasks and two sequence labeling tasks are supported.

Text classification¶

Label: Exactly one label from a set of possible labels is assigned to each segment of text. For example, this might be used for detecting sentiment where each sentence either has a sentiment (e.g. negative, neutral, or positive) or not, indicated by the class none. You have this task, if the training target is a list of strings. E.g. ["A", "B", "A"]
Class: An arbitrary number of classes (from none to all possible classes) is assigned to each segment of text. For example, this might be used for topic detection, where each sentence can have none or multiple topics. You have this task, if the training target is a list of lists of strings E.g. [["A", "B"], [], ["A"]]
Class Label: An arbitrary number of classes (from none to all possible classes ) is assigned to each segment of text and each detected class is assigned exactly one label from a set of possible labels. For example, this can be used for aspect based sentiment detection where each sentence can have multiple topics/aspects and each of them has a sentiment. You have this task, if the training target is a list of lists of pairs of strings. E.g. [[["A", "1"], ["B", "1"]], [], [["A", "2"]]]

Sequence labeling¶

token_classification: Each word in a given text can have its own label, e.g. person, location, organization, etc. Training samples are provided in the form of simple markup language texts.
question_answering: Samples consist of question-context- tuples. The model searches for words in the context which qualify as answer to the question. For training, the context is provided in a markup language, where the correct words are tagged.

Every trained model that can be loaded has (implicitly) one of these tasks associated with it. For text classification, the specific task is automatically deduced from the data. More precisely, it depends on whether a list of all classes is given and how the list of labels looks like in the meta.json file in the model folder. As a user, you do not have to concern yourself with this. For the sequence labeling tasks, it suffice to instantiate the model with the argument task=”token_classification” or task=”question_answering”.

This makes for a very easy to use interface to our whole system and many tasks can be solved with < 5 lines of code.

Model types¶

Currently, three different model types are supported by AutoNLU:

Standard Models: Support all tasks and are the main model to be used for label tasks and sequence labeling tasks. A standard model is used if no postfix is attached to a base model name. E.g. Model("bert-base-uncased") would create a standard model. We recommend this model for label tasks.
OMI Models: Is a newly introduced type of model which is highly effective for class and classlabel tasks. For these tasks it generally trains and predicts much faster and achieves slightly higher accuracies that standard models. You are using this kind of model type if you append #omi to any base model name (e.g. Model("bert-base-uncased#omi")). OMI models do not support, and are not useful for, label tasks. They have some additional disadvantages. For example, standard models generalize to unseen classes to a certain degree (i.e. they can be used to predict classes they were not seen on previously). This is not possible for OMI. We recommend this model in most cases for class as well as classlabel tasks.
CNN Models: Is a highly efficient model type which supports all three tasks. It can be more than 10 times faster than even an OMI model for prediction. You are using this kind of model type if you append #cnn to any base model name (e.g. Model("albert-base-v2#cnn")). The base model here mainly specifies which embeddings will be used for the tokens. We recommend using albert-base-v2 in most cases. The speed comes at a price and CNN models usually achieve slightly lower accuracies than the two other types of models. In addition, some functionality is not currently supported for CNN models (e.g. autonlu.Model.finetune(), autonlu.Model.select_to_label()) We recommend CNN models if prediction speed is very important and as a student model target for distillation.

The Model Class¶

class autonlu.Model(model_folder, all_labels=None, standard_label=None, task=None, **kwargs)¶

The class Model implements a versatile model that can be trained on (one of) various tasks. Prediction results are returned in an easy to use format depending on the task the model was trained for.

To determine which task the model shall learn, you can set the task argument. If the argument is not set, a "classification" task is assumed (for more information, see the task argument).

Most of the arguments are forwarded directly to the constructor of SimpleModel.

Parameters

model_folder (str) –
A path or name of the model that should be used. Will be sent through get_model_dir() and can therefore be:
- The path to a model
- The name of a model available in Studio
- The name of a model available in the Huggingface model repo
task (Optional[str]) –
Determines which task the model should learn. This is only relevant when loading a base model to be trained. When an already trained model is loaded this value is not needed! Possible values are:

”classification”
The standard value which is also chosen when task is not set. The "classification" task comprises the subtasks classlabel, class and label, which are automatically derived from the target labels during training.

”token_classification”
Each word in a sample has a label. A typical example is named entity recognition (NER).

”question_answering”
Each sample consists of a question-context-tuple. The model seeks one or more passages in the context which qualify as answers to the question.
all_labels (Optional[List[str]]) – A list of all labels the model should use for a "classification" task. If None, the list of labels will be determined automatically from the training and evaluation data. Setting this manually is mostly useful if a new model is to be trained and the training data does not contain an example of each possible label.
standard_label (Optional[str]) – The standard label to use for classlabel (sub)tasks. If the standard label is set, each class that is not explicitly mentioned for a sample of a classlabel task will be assumed to have this label. During prediction, classes that get assigned this label will also not be mentioned explicitly in the results.
key – A JSON web token which is used for authentication. If no key is given, the key is alternatively taken from the environment variable DO_PRODUCT_KEY.
baseurl – Base url of the studio instance. If None, the environment variable DO_BASEURL will be used. If DO_BASEURL is not defined, the standard base-url of the official DeepOpinion Studio server will be used. In most cases this has not to be changed unless you are working with an on-premise version of Studio
state_callback –
Something callable (function or class with __call__ member function) taking one keyword argument progress. The callback is called with the current progress in percent after each batch. E.g.:
```
>>> # Print current progress after each batch
>>> def callback(progress):
>>>     print(f"Current progress = {progress}")
```

stop_callback –

Something callable (function or class with __call__ member function) taking no arguments. Is called after each batch and prediction or training is stopped if True is returned. E.g.:

>>> # Stop after 10 batches
>>> i = 0
>>> def callback():
>>>     nonlocal i
>>>     i += 1
>>>     if i >= 10:
>>>         return True
>>>     return False

encrypt – If True, the model is encrypted on save.
device – Which device the model should be used on ("cpu" or "cuda"). If None, a device will be automatically selected: If a CUDA capable GPU is available, it will automatically be used, otherwise the CPU will be used. This behavior can be overwritten by specifically setting the environment variable DO_DEVICE to either "cpu" or "cuda".
log_dir – Specifies in which directory Tensorboard logs should be written. Default is tensorboard_logs. If None, no logs will be written. The logs for individual runs will be put into subdirectories named after the current timestamp.
use_samplehash – If True (the default), a hash for all trained samples will be saved. These hashes are used during active learning (select_to_label()) to exclude sentences that were already seen during trained. If False, these hashes will not be saved, which speeds up some processes, saves memory, and reduces the size of the saved model. It can be useful to disable the sample hash if huge amounts of training data are being used.
trial – A trial represents a single setup for automatic hyperparameter optimization. See also https://optuna.readthedocs.io/en/stable/reference/trial.html

modeltype()¶

Returns which modeltype we have loaded.

Return type: str
Returns: One of "base", "label", "class", "classlabel", "token_classification", and "question_answering".

predict(X, classes_to_analyze=None, return_extras=False, recommend_manual_check=False, **kwargs)¶

Predicts the correct results for a list of samples X, depending on the task the model was trained for.

All arguments from SimpleModel.predict() can also be used and are included in the following list of arguments

Parameters

X (List) –
A list of samples for which we do the prediction. The format of the list elements depends on the specific task the model was trained on. Generally, one should use the same data format that was used for training. In the following, we review the data format as determined by Model.task:

”classification”
X is a list of strings.

”token_classification”
X is a list of strings. The strings might also contain label information in the form of markup language tags (in case one likes to test the prediction of samples for which the correct labels are already known). This label information is ignored and the markup language tags are removed before the prediction is done.

”question_answering”
X is a list of pairs of strings, consisting of a question string and a context string in which the answer might be found. As for "token_classification", the context string may contain markup tags, which are ignored for the prediction.
classes_to_analyze (Optional[List[str]]) – Used for the "classification" task in case of a class or classlabel subtask. Specifies a list of classes that should be considered during prediction. If None, the full list of classes learned by the model will be used.
batchsize – Number of samples to predict in one inference step. A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data, but the switching takes time. Higher values than 128 (the default) usually does not increase the performance by much.
dynamic_quantization – If True, forward propagation is done with lower precision to speed up predictions. This is only supported on the CPU. Warning: This feature could reduce the accuracy of your model.
return_extras (bool) – If True, the system returns, in addition to the processed results, extra information about the prediction. Currently, this entails the raw samples and results of the underlying SimpleModel, the label probabilities and entropies for all the predictions, and information about whether a sample should be checked manually and the probability of the prediction being incorrect in case the human correction system is in use. In case of a "token_classification" or "question_answering" task, word-lists that show how the text was split into words are returned, in addition to the labels of all words.
recommend_manual_check (bool) – If True, the predictions consist of (label, bool) pairs where bool indicates whether this sample should be checked manually if the human correction system is in use.
verbose – If True, a progress bar is shown during prediction.

Returns

“classification”

A list of strings for a label subtask, e.g.
["POS", "NEG", "NEG", "POS"]

A list of lists of strings for a class subtask, e.g.
[["service"], [], ["support", "sales"]]

A list of lists of lists of two strings (class and label) for a classlabel subtask, e.g.
[[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]

”token_classification”

A list of markup language texts, e.g.
["<person>Tom</person> was in <location>London</location>.", "<person>Lisa</person> loves <location>Paris</location>."]

”question_answering”
A list of markup language texts.
>>> X = [("What color do bananas have?", "Tomatoes are red and bananas are yellow."),
>>>      ("What color do tomatoes have?", "Tomatoes are red and bananas are yellow.")]
>>> prediction = model.predict(X=X)
The result (prediction) should look as follows: ["Tomatoes are red and bananas are <answer>yellow</answer>.", "Tomatoes are <answer>red</answer> and bananas are yellow."]

If return_extras, a Tuple (result, extras) is returned where result is in the previously described format and extras is a dictionary, containing additional information. The dictionary contains the following keys and values:

”raw_samples”
Contains the actual samples that were sent through the SimpleModel.

”label_probabilities”
Contains the probabilities for all possible labels.

”entropies”
Contains the entropies of all predictions (a high entropy indicates that the system was less sure in its prediction).

”mistake_probabilities”
If the human correction system is set up (i.e. func:Model.calculate_human_correction_data was called) Contains a probability for each prediction giving an estimated upper bound on the probability (range [0, 1]) that the prediction might be incorrect. For class and classlabel tasks the probabilities relate to the individual samples found in "raw_samples"

”word_lists”
Only for the tasks "token_classification" and "question_answering". Returns a list (over all samples) of lists of words showing how the model split the text into words (tokens).

”word_labels”
Only for the tasks "token_classification" and "question_answering". Returns a list over lists. The outer list is over all samples, the inner list contains the predicted label of each word in word_lists

Return type

The output format is determined by the task the current model was trained for

Raises

ValueError – If the product key could not be authorized

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> m = Model(model_folder="DeepOpinion/hotels_absa_en")
>>> segments = ["The room was nice, but the staff was unfriendly"]
>>> res, extras = m.predict(segments, return_extras=True)
res = [[['Room', 'POS'], ['Staff', 'NEG']]]
extras["raw_samples"] = [[('The room was nice, but the staff was unfriendly', 'Activities'),
                          ('The room was nice, but the staff was unfriendly', 'Ambiance'),
                          ('The room was nice, but the staff was unfriendly', 'Amenities'),
                          ...
                        ]
extras["label_probabilities"] = [array([[9.9923420e-01, 2.1814957e-04, 2.0768745e-04, 3.3991490e-04],
                                        [9.9462909e-01, 3.1983077e-03, 1.2891486e-03, 8.8348583e-04],
                                        [9.9342984e-01, 3.9202035e-03, 1.5014511e-03, 1.1485954e-03],
                                        ...
                               ]
extras["entropies"] = [[['Activities', 0.0070808344],
                        ...
                        ['View', 0.010186324],
                        ['WiFi', 0.005671744]]]
extras["manual_check_recommended"] = [[['Activities', False],
                                       ['Ambiance', False],
                                       ['Amenities', False],
                                       ...
                                       ['WiFi', False]]]

train(X, Y=None, valX=None, valY=None, valsplit=0.1, do_evaluation=True, label_probabilities={}, all_classes=None, val_all_classes=None, all_labels=None, learning_rate=None, mindatasetsize=None, patience_epochs=None, lr_reduction_patience=None, lr_reduction_factor=None, epsilon=None, rawX=None, rawY=None, do_early_stopping=None, decay_func_name=None, nb_opti_steps=None, total_lr_decay=None, *, calculate_human_correction_data=True, **kwargs)¶

Trains a model on a specific task. If you did not specify a task during the initialization, the standard task is "classification". In this case, the model can be trained on one of the three subtasks classlabel, class or label. The subtask is automatically deduced from the format of the training data. For more details, see explanations for the parameters X and Y.

Model.train() offers two different methods of training, which differ in the way the learning rate is adjusted and under which conditions the training is stopped. The two methods are selected with the argument do_early_stopping. When set to True, the model will be tested on the validation data in regular intervals. Depending on the test results, the learning rate might be reduced or the training might be stopped if the model does not improve anymore. If do_early_stopping is set to False, the training runs nb_opti_steps optimization steps and proceeds independently of the evaluation. After each optimization step, the learning_rate is slightly reduced. If do_early_stopping is not specified by the user, do_early_stopping is set to False for all OMI models and for label tasks. It is set to True for class and classlabel tasks that do not use an OMI model. Both training methods come with specific arguments.

All arguments from SimpleModel.train() can also be used and are included in the following list of arguments

Parameters

X (List) –
A list of training samples. The format depends on the specific training task:
”classification”
X are the input text samples as a list of strings

”token_classification”
X is a list of strings in a simple markup language format. Words can be associated with a label. Example:
>>> X = ["<person>Tom Miller</person> was in <location>London</location>.", >>> "<person>Lisa</person> loves <location>Paris</location>."]
”question_answering”
X is a list of pairs of strings, consisting of a question string and a context markup language string in which the correct answer(s) are marked by start and end tags. Example:
>>> X = [("What color do bananas have?", >>> "Tomatoes are red and bananas are <answer>yellow</answer>."), >>> ("What color do tomatoes have?", >>> "Tomatoes are <answer>red</answer> and bananas are yellow."))]
Y (Optional[List]) –
Training targets. Only needed for the "classification" task. For "token_classification" and "question_answering", the training targets are already contained in X and Y can be set to None (or ignored). The "classification" task knows the three subtasks class, label and classlabel. The correct subtask is automatically derived from the format of Y, which can be:
label subtask
A list of strings, e.g.

>>> Y = ["POS", "NEG", "NEG", "POS"]
class subtask
A list of lists of strings, e.g.

>>> Y = [["service"], [], ["support", "sales"]]
classlabel subtask
A list of lists of lists of two strings (class and label), e.g.

>>> Y = [[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]
valX (Optional[List[Union[str, Tuple[str, str]]]]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as X. If None, a part (10%) of X will be split off.
valY (Optional[List[str]]) – Training target used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as Y. A part (10%) of Y will be split off if None and we are training a classification task.
rawX (Optional[List[str]]) – Input text samples that should be used for raw training targets. Currently only supported for classification tasks.
rawY (Optional[List[Union[str, Tuple[str, str], Tuple[str, bool]]]]) – Training targets for rawX. They consist of single targets per sample in rawX. I.e. a string for a label task, a tuple of (class, label) for a classlabel task and a tuple (class, bool) for a class task where bool indicates whether class is contained in the text or not. Note that for the classlabel task the label always has to be given explicitly since standard_label is ignored for raw data. This raw format is mainly intended to be used when data is obtained during active learning, or as a side product of manually checking samples suggested by the human correction system. Currently only supported for classification tasks.
valsplit (float) – If valX or valY is not given, specifies how much of the training data should be split off for validation. Default is 10%.
do_evaluation (bool) – If set to False no evaluation is done. This also means that early stopping is automatically deactivated.
label_probabilities (Dict[str, float]) – A dictionary, mapping label names to the probability (a number between 0.0 and 1.0) of that label being used for training. All labels not mentioned in label_probabilities are assumed to have a probability of 1.0. Can be used to subsample certain labels if they are overrepresented. Currently only supported for classification tasks.
all_classes (Union[List[str], List[List[str]], None]) – Only used for class or classlabel tasks. Either a list of all possible classes or a list of lists of all possible classes if the possible classes are different for each samples. The latter is useful when using the standard_label and certain classes should not be generated for specific samples, which happens when using active learning via select_to_label(). Alternatively use the rawX and rawY arguments. If None, the list of possible classes will be determined automatically from Y and valY
val_all_classes (Union[List[str], List[List[str]], None]) – Same as all_classes, just for the validation data
all_labels (Optional[List[str]]) – A list of all possible labels. If None, the list of possible labels will be determined from Y and valY. Can be set explicitly in cases where not all possible labels do occur in the training and validation set (e.g. because they will only be used in a later training session)
seed – Fix the random seed to make training deterministic (i.e. with the same seed and the same input data in the same order, the resulting model should be identical). Warning! Setting a seed can slow down training.
learning_rate (Optional[float]) – The learning rate to be used at the start of training. Higher learning rates will lead to faster convergence, but might lead to worse overall accuracy and if the learning rate is set too high, the system might not learn anything. If None, an appropriate learning rate for the given task is being selected. 2e-4 for label tasks and 2e-5 for class and classlabel tasks.
batchsize – The number of samples to use in one training step. This also sets the number of samples to accumulate for one weight update if the number is bigger than 32 (at a minimum, 32 samples are always accumulated). A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data.
autobatchsize – Deprecated! This option should not be used anymore and will be removed. With the new dynamic batchsize, lowering on CUDA memory error, this is not needed anymore! If True the batchsize will be determined automatically. If True, the parameter batchsize gives the maximal batchsize to use.
metric_callback –
Something callable (function or class with __call__ function) that takes two keyword arguments Y_true (containing true label numbers from the validation dataset) and Y_pred (containing the label numbers predicted by the currently trained model) and returns a metric, which will be passed as an argument to score_callback. The format of Y_true and Y_pred is the one used by SimpleModel(). Used to define the metric (e.g. accuracy) to use for the reported score E.g.:
```
>>> # Return accuracy as a metric
>>> import numpy as np
>>> def callback(Y_true, Y_pred):
>>>     return np.sum(Y_true == Y_pred) / len(Y_true)
```
score_callback –
Something callable (function or class with __call__ function) taking one keyword argument score that is filled with the output of metric_callback and evaluated in regular intervals during training. E.g.:
```
>>> # Print current score
>>> def callback(score):
>>>     print(f"Current score = {score}")
```
verbose – If True, information about the training progress will be shown on the terminal.
do_early_stopping (Optional[bool]) –
If True, early stopping will be used. I.e. the model will be tested on the validation data in regular intervals and training will be stopped if the model does not improve anymore. If False, a preset schedule of nb_opti_steps optimization steps is used, combined with a decaying learning rate. do_early_stopping is False by default, with the exception of class and classlabel tasks that are trained without an OMI model (postfix #omi for the base model)
Arguments used when do_early_stopping is False:
- decay_func_name: Describes the kind of learning rate decay to use. Options are: - “linear” - “exp” - “exp_sqr”
- nb_opti_steps: The number of optimization steps after which the training is stopped.
- total_lr_decay: Sets the factor by which the initial learning_rate will be reduced by the end of the training.
Arguments used when do_early_stopping is True:
- epochs: The maximum number of epochs used for training.
- mindatasetsize: Early stopping assumes the datasets size to be at least mindatasetsize. A large mindatasetsize in essence means that the patience for early stopping will be increased. Default is 4,000 for label tasks, 0 for class tasks, and 70,000 for classlabel tasks, to train small datasets longer since this works better in practice. A value of 70,000 in essence means that datasets with less than 70,000 samples will be trained for as long as a data set with 70,000 samples
- maxdatasetsize: Early stopping assumes the datasets size to be at most maxdatasetsize. A small maxdatasetsize in essence means that the patience for early stopping will be decreased. Default is 200,000 to train large datasets for a shorter time. A value of 200,000 in essence means that datasets with more than 200,000 samples will only be trained for as long as a data set with 200,000 samples. This does NOT mean that only 200,000 of the samples will be used. All the data is still being utilized. This only influences at which point early stopping decides that a model does not improve anymore!
- val_metric: The validation metric to use to determine if a model is better than another one. Generally this should not be changed from val_accuracy if you don’t know exactly what you are doing.
- val_maximize: If True, a higher value of val_metric is considered better, if False, a smaller value is considered better. Has to fit the specific metric used in val_metric.
- patience_epochs: Defines how many epochs are waited without the model improving before the training is stopped.
- lr_reduction_patience: Proportion of one epoch to wait without improvement until the learning rate is reduced.
- lr_reduction_factor: The factor with which the learning rate is multiplied if the patience runs out.
- epsilon: Maximal difference in the metric for which should still be considered identical.
calculate_human_correction_data (bool) – If True, the human correction system is automatically set up using the validation data (if present)

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> # The standard value for Model.task is "classification". Hence, instead of
>>> # m = Model("albert-base-v2", standard_label = "NONE", task="classification")
>>> # it suffices to do:
>>> m = Model("albert-base-v2", standard_label = "NONE")
>>> segments = ["The room was nice, but the staff was unfriendly!",
>>>             "They served great food and the drinks were ok."]
>>> Y = [[["room", "POS"], ["staff", "NEG"]],
>>>      [["food", "POS"], ["drinks", "NEU"]]]
>>> m.train(X = segments, Y=Y, valX=segments, valY=Y)

evaluate(X, Y=None, all_classes=None, batchsize=128, dynamic_quantization=False, verbose=False, **kwargs)¶

Evaluates a model on given data and returns different performance metrics.

Parameters

X (List) –
A list of samples to be evaluated. The format of the list elements depends on the specific task the model was trained on and is determined by the value Model.task (set in initialization). Model.task can have the following values:
”classification”
X is a list of strings, where each string is a sample of text.

”token_classification”
X is a list of strings in a simple markup language format. In "token_classification", words can be associated with a label. Example:

>>> X = ["<person>Tom Miller</person> was in <location>London</location>.", >>> "<person>Lisa</person> loves <location>Paris</location>."]
”question_answering”
X is a list of pairs of strings, consisting of a question string and a context markup language string in which the correct answer(s) are marked by start and end tags. Example:

>>> X = [("What color do bananas have?", >>> "Tomatoes are red and bananas are <answer>yellow</answer>."), >>> ("What color do tomatoes have?", >>> "Tomatoes are <answer>red</answer> and bananas are yellow."))]
Y (Optional[List]) –
Correct answers. Only needed for the "classification" task. For "token_classification" and "question_answering", the correct answers are already contained in X. Hence, Y can be set to None (respectively you can omit Y, since None is the default value). The "classification" task knows the three subtasks class, label and classlabel and data has to be provided in the following format:
label subtask
A list of strings, e.g.

>>> Y = ["POS", "NEG", "NEG", "POS"]
class subtask
A list of lists of strings, e.g.

>>> Y = [["service"], [], ["support", "sales"]]
classlabel subtask
A list of lists of lists of two strings (class and label), e.g.

>>> Y = [[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]
all_classes (Union[List[str], List[List[str]], None]) – Only for the "classification" task. Either a list of all possible classes or a list of lists of all possible classes if the possible classes are different for each sample. This is useful when using the standard_label and certain classes should not be generated for specific samples, which happens when using active learning via select_to_label(). If None, the list of possible classes associated with the trained model will be used automatically.
batchsize (int) – Number of samples to predict in one inference step. A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data, but the switching takes time. Higher values than 128 (the default) usually does not increase the performance by much.
dynamic_quantization (bool) – If True, forward propagation is done with lower precision to speed up predictions. This is only supported on the CPU. Warning: This feature could reduce the accuracy of your model.
verbose (bool) – If True, information about the evaluation progress will be printed to the terminal.

Returns

A dictionary containing accuracy, f1_weighted, precision_weighted, and recall_weighted. For word labeling, question answering, and class tasks, as well as for classlabel tasks with a set standard_label, f1_binary is returned as well.

Example

>>> X = ["The room was very nice, but the staff was bad."]
>>> Y = [[["Room", "POS"], ["Staff", "NEG"]]]
>>> model = autonlu.Model("DeepOpinion/hotels_absa_en")
>>> metrics = model.evaluate(X, Y)
>>> # metrics == {'accuracy': 1.0, 'f1_weighted': 1.0, 'f1_binary': 1.0, 'precision_weighted': 1.0,
>>> #             'recall_weighted': 1.0}

distill(student_model, X, Y=None, unlabelledX=None, unlabelled_epochs=2, chunk_size=5000, valX=None, valY=None, valsplit=0.1, label_probabilities={}, all_classes=None, val_all_classes=None, learning_rate=None, mindatasetsize=None, patience_epochs=None, lr_reduction_patience=None, lr_reduction_factor=None, epsilon=None, verbose=False, **kwargs)¶

Distills the model into a student model. It requires a labeled training dataset and optionally an unlabelled dataset (it’s highly recommended to use a large unlabelled dataset). Distillation currently only works for "classification" tasks - i.e. Model.task == "classification" (the standard value - see autonlu.Model).

All arguments from SimpleModel.distill() can also be used and are included in the following list of arguments

Parameters

student_model (Union[str, Model]) – either a string representing a name or path of a model, or an instance of Model.
X – Input text samples as a list oft strings
Y –
Training target. List containing the correct output. The input format depends on the subtask
label subtask
A list of strings, e.g.

>>> Y = ["POS", "NEG", "NEG", "POS"]
class subtask
A list of lists of strings, e.g.

>>> > Y= [["service"], [], ["support", "sales"]]
classlabel subtask
A list of lists of lists of two strings (class and label), e.g.

>>> Y = [[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]
unlabelledX (Optional[List[str]]) – Input text samples as a list oft strings, optional. Default: None.
chunk_size (int) – Size of the chunks to use for unlabelled distillation
unlabelled_epochs (int) – int; Number of epochs to train over on the unlabelled dataset. In unlabelled distillation do_early_stopping is False, and the learning rate scheduler defaults to a linear decay
valX (Optional[List[Union[str, Tuple[str, str]]]]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as X. If None, a part of X will be split off.
valY (Optional[List[str]]) – Training target used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as Y. If None, a part of Y will be split off.
valsplit (float) – If valX or valY is not given, specifies how much of the training data should be split off for validation. Default is 10%.
label_probabilities (Dict[str, float]) – A dictionary, mapping label names to the probability (number between 0 and 1) of that label being used for training. All labels not mentioned in label_probabilities are assumed to have a probability of 1. Can be used to subsample certain labels if they are overrepresented.
all_classes (Union[List[str], List[List[str]], None]) – Either a list of all possible classes or a list of lists of all possible classes if the possible classes are different for each samples. This is useful when using the standard_label and certain classes should not be generated for specific samples, which happens when using active learning via select_to_label(). If None, the list of possible classes will be determined automatically from Y and valY
val_all_classes (Union[List[str], List[List[str]], None]) – Same as all_classes, just for the validation data
all_labels – A list of all possible labels. If None, the list of possible labels will be determined from Y and valY. Can be set explicitly in cases where not all possible labels do occur in the training and validation set (e.g. because they will only be used in a later training session)
temperature – Defines the temperature factor used in the distillation loss calculation (student and teacher logits are divided by temperature before being passed to softmax functions). Therefore, the higher the temperature, the smoother the probability distributions get. Typically temperatures between 1.0 and 5.0 give the best results. Defaults to 1.0
alpha – a factor determining the relative proportion of CrossEntropy in the total distillation loss. Has to be between 0.0 and 1.0.
seed – Fix the random seed to make training deterministic (i.e. with the same seed and the same input data in the same order, the resulting model should be identical). Warning! Setting a seed can slow down training.
learning_rate (Optional[float]) – The learning rate to be used during training. Higher learning rates will lead to faster convergence, but might lead to worse overall accuracy and if the learning rate is set too high, the system might not learn anything. If None, an appropriate learning rate for the given task is being selected. 2e-4 for label tasks and 2e-5 for class and classlabel tasks.
batchsize – The number of samples to use in one training step. This also sets the number of samples to accumulate for one weight update if the number is bigger than 32 (at a minimum, 32 samples are always accumulated). A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data. The wrong batch size might also inhibit proper training.
mindatasetsize (Optional[int]) – Early stopping assumes the datasets size to be at least mindatasetsize. A large mindatasetsize in essence means that the patience for early stopping will be increased. Default is 0 for label tasks and 70,000 otherwise to train small datasets longer since this works better in practice. A value of 70.000 in essence means that datasets with less than 70.000 samples will be trained for as long as a dataset with 70.000 samples
maxdatasetsize – Early stopping assumes the datasets size to be at most maxdatasetsize. A small maxdatasetsize in essence means that the patience for early stopping will be decreased. Default is 200,000 to train large datasets for a shorter time. A value of 200,000 in essence means that datasets with more than 200,000 samples will only be trained for as long as a dataset with 200,000 samples. This does NOT mean that only 200,000 of the samples will be used. All the data is still being utilized. This only influences at which point early stopping decides that a model does not improve anymore!
val_metric – The validation metric to use for the BestModelKeeper (i.e. which metric should be used to determine if a model is better than another one). Generally this should not be changed from val_loss.
val_maximize – If True, a higher value of val_metric is considered better, if False, a smaller value is considered better. Has to fit the specific metric in val_metric
cache_dir –

Directory used to cache the teacher logits (if the teacher model is saved on disk, a subdirectory named
precomp_logits in the model folder will be used)

verbose: If True, information about the training progress will be shown on the terminal.

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> m = Model("albert-base-v2", standard_label = "NONE")
>>> segments = ["The room was nice, but the staff was unfriendly!",
>>>             "They served great food and the drinks were ok."]
>>> Y = [[["room", "POS"], ["staff", "NEG"]],
>>>      [["food", "POS"], ["drinks", "NEU"]]]
>>> student = m.distill("albert-base-v2#cnn", X = segments, Y=Y, valX=segments, valY=Y)

select_to_label(X, classes_to_analyze=None, **kwargs)¶

Selects sentences that the current model would like as additional training data to maximally improve performance. Currently only supported for “classification” tasks.

All arguments from SimpleModel.select_to_label() can also be used and are included in the following list of arguments

Parameters

X – A list of segments or segment pairs the system can select to be added to the training data. Usually this is data that is available, but not yet labelled.
classes_to_analyze (Optional[List[str]]) – Used in case of a class or classlabel task. Specifies a list of classes that should be considered when selecting sentences for labeling. If None, the list of all known classes is used automatically. This can be useful if certain classes are underrepresented in the training data and we would like to concentrate our selection on those classes.
acquisitionsize – The number of samples the system should select. The higher the number, the more data can be labelled in one go. More iterations, with smaller acquisitionsizes will be able to learn more from fewer manually labelled samples though. Values from 50 to 100 are generally a good compromise.
modelsamples – How often different variants from the current model should be used to sample the given segments. Higher numbers will lead to more accurate results, but will also take more time.
al_samples – During selection of the requested segments, a probability distribution has to be approximated. al_samples specifies how many samples should be taken from this distribution as an approximation. Higher values lead to more accurate results, but the runtime increases.
preselectionsize – Especially when X is getting very big, the selection process can become slow. preselectionsize specifies how many samples should be pre-selected using a much faster method. Higher values lead to a better selection, but increase the runtime. If None, the preselectionsize is 10 * acquisitionsize
verbose – If True, information about the active learning process is shown, also shows progress bars

Returns

A Tuple (samples_to_label, scores) where samples_to_label are the samples that the system would like to see labeled. In case of a class or classlabel task, the samples are (segment, class) tuples and score is how unsure the model was about the given samples. The score is not the only criteria that is used to select samples so the scores are not necessarily monotonically decreasing.

Example:

>>> m = Model("DeepOpinion/hotels_absa_en")
>>> X = ["The room was horrible", "The food was quite nice", ...]
>>> samples_to_label, scores = m.select_to_label(X=X, acquisitionsize=2)
samples_to_label = [("The room was horrible", "room"), ("We really enjoyed the stay", "satisfaction")]
scores = [1.34, 0.561]

save(model_dir)¶

Saves the current model.

If only a language model is present (meaning only finetuning was called), it will be saved in the appropriate format so it can be used as a base model for training of an actual task. A base model can also be loaded and finetuning can be continued.

Parameters: modeldir – The path where the model should be saved. If the folder does not exist yet, it will be created
Raises: autonlu.core.ModelSaveException – If saving the model fails

finetune(corpus_filename, batchsize=4, burnin_epochs=0.01, burnin_timelimit=None, burnin_lr=0.002, training_epochs=1, training_timelimit=None, training_lr=2e-05, lm_tasks=['NSP', 'combinedMLM'], loss_weights=[], length=500, teacher=None, verbose=False)¶

Performs language model fine tuning on a given text corpus. Only available for "classification" tasks (standard value, if not set elsewise in the initialization).

This command will also automatically generate a tensorboard-log, visualizing the different losses over time. The logs are saved in a “runs” directory and can be displayed by using tensorboard --rundirs=runs

Parameters

corpus_filename (str) – The text file to be used for language model fine tuning. This should be a standard text file where documents are separated by two new-lines.
batchsize (int) – The number of sequences to be used for one pass during fine tuning. The batchsize for the burn in phase is automatically four times higher. If multiple GPUs are being used, the batchsize is multiplied by the number of available GPUs. If the batch size is too big, the system will automatically half the batch size until the batches fit on the GPUs without loss of data.
burnin_epochs (float) – Number of epochs to be used for the burn in phase. In the burn in phase, the language model is kept fixed and only the prediction heads are trained. This lets the whole system stabilize without messing up the actual language model. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model. The number of burn in epochs should be selected so this phase takes around 10 minutes. More is usually not necessary.
burnin_timelimit (Optional[float]) – Number of seconds after which the burnin phase will be ended. If the number of epochs is reached before, the burnin phase will end earlier than that. If None, the burnin will proceed until the epochs are finished.
burnin_lr (float) – Learning rate to be used for the burnin phase
training_epochs (float) – Number of epochs to be used for language model finetuning. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model.
training_timelimit (Optional[float]) – Number of seconds after which the training will be ended. If the number of epochs is reached before, the training will end earlier than that. If None, the burnin will proceed until the epochs are finished.
training_lr (float) – Learning rate to be used for the language model fine tuning
lm_tasks (List[str]) –
Describes the task to be learned. Possible list elements are: SO: Sentence Ordering NSP: Next Sentence Prediction SONSP: SO & NSP prelabeled: uses a trainer to label sentences prelabeled_words: uses a trainer to label sentences, where “sentences” are just consecutive words

(i.e. not sentences in the grammatical sense)

soloMLM: independent Mask Language Model combinedMLM: a MLMtask which is trained together with the other tasks on the same data
loss_weights (List[float]) – Gives a particular weight to the losses of the lm_tasks. If empty, each loss has the weight 1
length (Union[int, List[int]]) – Determines the number of tokens per sentence in a batch. If length is a list of two integers, the number of tokens per sentence in a batch takes a random value within the two integers ([low, high]). If length is an integer, this is the number tokens per sentence. Remark: Currently, for all lm_tasks except prelabeled, a “sentence” is just a sequence of consecutive words/tokens of a given length. For prelabeled, grammatical sentences are used. Here, the length is defined by the sentence itself.
teacher (Optional[LMTeacher]) – An instance of autonlu.finetuning.LMTeacher. Needed for the tasks prelabeled and prelabeled_words, where labels are provided by a teacher.
verbose (bool) – If True, progress bars with additional information will be shown during training

upload(name, display_name=None, short_description='', long_description='', language='en', verbose=False)¶

Uploads this model to Studio

Parameters

name (str) – The internal name that should be used for the model in Studio (e.g. this is the name you can use to later download the model from Studio again). Has to be unique. If a model with the same name already exists on Studio, a ModelNameExists exception will be thrown.
display_name (Optional[str]) – The name which should be displayed for this model in Studio. Does not have to be unique.
short_description (str) – The description that is shown below the model name in the model list. If empty, the content of long_description will be used.
long_description (str) – The description that is shown when the model is opened. If empty, the content of short_description will be used.
language (str) – A language identifier (e.g. "en", "de"). https://en.wikipedia.org/wiki/ISO_639-1
verbose (bool) – If True, some information is printed when compressing the model is finished etc.

Raises

ModelNameExists if the chosen name is already used in Studio –

prune(layers_to_prune)¶

Set the layers of a model which should be pruned (i.e. not used and removed during training).

Only call this function if you want to prune specific layers and know their number. In most cases you will want to use auto_prune().

Parameters: layers_to_prune (List[int]) – A list of integers, containing all layer_ids that should be pruned. Therefore, layer_id ∈ [0, num_hidden_layers].

auto_prune(X, Y, valX=None, valY=None, valsplit=0.1, num_layers_to_prune=6, always_prune=None, max_num_samples=40000, epochs=3, verbose=False)¶

Only for "classification" tasks (standard value, if not set elsewise during initialization). Automatically selects the best layers to prune from the current model by using a greedy search strategy: Each layer is left out and the highest accuracy after pruning a layer is then selected. For pruning more layers it is assumed that this previous selection is also a good starting point for pruning more layers. Note that this method internally remembers the layers to be pruned so calling train() after auto_prune() is sufficient for the pruned layers to be ignored.

Parameters

X (List[Union[str, Tuple[str, str]]]) – Input samples. Either a list of strings for text classification or a list of pairs of strings for text pair classification.
Y (List[str]) – Training target. List containing the correct labels as strings.
valX (Optional[List[Union[str, Tuple[str, str]]]]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as X. If None, a part of X will be split off.
valY (Optional[List[str]]) – Training target used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback. Same format as Y. If None, a part of Y will be split off.
valsplit (float) – If valX or valY is not given, specifies how much of the training data should be split off for validation. Default is 10%.
num_layers_to_prune (int) – How many layers should be pruned from the given architecture.
always_prune (Optional[List[int]]) – A list of layer ids that should always be pruned, independent what the greedy heuristic selects
max_num_samples (int) – The maximum number of training-samples to use by the greedy heuristic for training different candidates. Decreasing this number increases the speed, but decreases the accuracy of the final pruned model.
epochs (int) – Number of epochs to train a candidate before evaluating the accuracy. Decreasing this value increases the speed to find layers to prune, but also decreases the accuracy of the final pruned model.
verbose (bool) – If True, information about the overall progress of finding layers to prune is shown.