SimpleModel¶
Warning! SimpleModel is only documented for reference purposes and should not be used directly for production anymore!
- class autonlu.SimpleModel(model_folder, key=None, baseurl=None, state_callback=None, stop_callback=None, encrypt=True, device=None, log_dir='tensorboard_logs', use_samplehash=True, trial=None, task=None)¶
A class implementing a versatile model that can be trained on (one of) various tasks. To determine which task the model shall learn, you can set the
task
argument. If the argument is not set, a"classification"
task is assumed (for more information, see list of parameters).The model is loaded the first time some operation is performed on it (e.g.
predict()
,train()
,finetune()
, …)- Parameters
model_folder (
str
) –A path or name of the model that should be used. Will be sent through
get_model_dir()
and can therefore be:The path to a model
The name of a model available in Studio
The name of a model available in the Huggingface model repo
task (
Optional
[str
]) –Determines which task the model shall learn. Possible values are:
- ”classification”
The standard value, which is chosen when
task
is not set elsewise. The “classification” task comprises the subtasksclasslabel
,class
andlabel
, which are automatically derived from the target labels during training.- ”token_classification”
Each word in a sample has a label. A typical example is named entity recognition (NER)
- ”question_answering”
Each sample consists of a
question
-context
-tuple. The model seeks one or more passages in the context which qualify as answer to the question.
key (
Optional
[str
]) – A JSON web token which is used for authentication. If no key is given, the key is alternatively taken from the environment variableDO_PRODUCT_KEY
.baseurl (
Optional
[str
]) – Base url of the studio instance used for this call. If None, the environment variableDO_BASEURL
will be used. IfDO_BASEURL
is not defined, the standard base-url will be used. In most cases this has not to be changed unless you are working with an on-premise version of Studiostate_callback (
Optional
[Callable
]) –Something callable (function or class with
__call__
member function) taking one keyword argumentprogress
, which is called with the current progress in percent after each batch. E.g.:>>> # Print current progress after each batch >>> def callback(progress): >>> print(f"Current progress = {progress}")
stop_callback (
Optional
[Callable
]) –Something callable (function or class with
__call__
member function) taking no arguments. Is called after each batch and prediction or training is stopped if True is returned. E.g.:>>> # Stop after 10 batches >>> i = 0 >>> def callback(): >>> nonlocal i >>> i += 1 >>> if i >= 10: >>> return True >>> return False
encrypt (
bool
) – If True, the model is encrypted on save.device (
Optional
[str
]) – Which device the model should be used on ("cpu"
or"cuda"
). If None, a device will be automatically selected: If a CUDA capable GPU is available, it will automatically be used, otherwise the cpu. This behavior can be overwritten by specifically setting the environment variableDO_DEVICE
to either"cpu"
or"cuda"
.autonlu.utils.get_best_device()
is used to select the device.log_dir (
Optional
[str
]) – Specifies in which directory Tensorboard logs should be written. If None, no logs will be written. The logs for individual runs will be put into subdirectories named after the current timestamp.use_samplehash (
bool
) – If True (the default), a hash for all trained samples will be saved. These hashes are used during active learning (select_to_label()
) to exclude sentences that have already been trained on. If False, these hashes will not be saved, which speeds up some processes, saves memory, and reduces the size of the saved model. It can be useful to disable the sample hash if huge amounts of training data are being used.trial (
Optional
[Trial
]) – A trial represents a single setup for automatic hyperparameter optimization. See also https://optuna.readthedocs.io/en/stable/reference/trial.html
- Variables
device – The device the model is running on (
"cpu"
or"cuda"
(when running on a GPU))
- predict(X, batchsize=128, verbose=False, dynamic_quantization=False, markup_dict=None, is_markup=False)¶
Predicts the correct results for a list of samples
X
, depending on the task the model was trained for.- Parameters
X (
List
) –A list of samples for which we do the prediction. The format of the list elements depends on the specific task the model was trained on respectively on
SimpleModel.task
(set in initialization). Generally, one should use the same data format that was used for training. In the following, we review the data format as determined bySimpleModel.task
:- ”classification”
X
a list of samples, where each sample is a string or tuple of strings.- ”token_classification”
Each single word is associated with its own label. To this end, it’s important to know what qualifies as a word (is e.g. “555 2131” one word or two?). To provide clarity, individual samples and the returned results can have two distinct formats:
lists of words
>>> X = [["Tom", "was", "in", "London", "."], ["Lisa", "loves", "Paris", "."]]
markup language texts
>>> X = ["<person>Tom</person> was in <place>London</place>.", >>> "<person>Lisa</person> loves <place>Paris</place>."]
You have to choose one method. Mixtures of both methods are not allowed. A few more words to the usage of markup language: If you decide to use markup language, the argument
is_markup
has to be set to True. However, when you do prediction, you are usually not aware of the correct labels. Therefore, a plain text (string) without label tags counts as markup text input. If the input contains label tags anyway, they are ignored. The results are returned as markup language texts - this time with label tags. Further, when you intend to use prediction with a markup language text, amarkup_dict
is needed. If the model was trained with markup language, the model remembers the markup_dict of the training. Elsewise, you have to provide it explicitly as argument.- ”question_answering”
Each sample is a tuple consisting of a question and a context in which the answer can be found. As for “token_classification”, questions and contexts can be either given entirely as list of words or as markup language texts. If markup language is used, the argument
is_markup
has to be set to True. Further, if you use markup language here but did not train the model on markup language, anmarkup_dict
has to be provided. Examples:list of words
>>> X = [(["What", "color", "do", "bananas", "have", "?"], >>> ["Tomatoes", "are", "red", "and", "bananas", "are", "yellow", "."])]
markup language texts
>>> X = [("What color do bananas have?", >>> "Tomatoes are red and bananas are <answer>yellow</answer>.")]
batchsize (
int
) – Number of samples to predict in one inference step. A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out the system will automatically switch to smaller batch sizes without loss of data, but the switching takes time. Higher values than 128 (the default) usually does not increase the performance by much.dynamic_quantization (
bool
) – If enabled, forward propagation is executed with lower precision (int8) to speed up predictions. This is only supported on the CPU. Warning: This feature could reduce the accuracy of your model.is_markup (
bool
) – For the tasks"token_classification"
and"question_answering"
, results can be returned in the form of markup language texts (see explanation forX
). In this case,is_markup
has to be set to True. The standard value is False.markup_dict (
Optional
[dict
]) –Optional. A markup_dict is only needed when
X
contains markup language texts. When the model was trained on markup language, the model remembers the markup_dict of the training. Only when you like to use another markup_dict than the one used for training, you need to provide it explicitly. A markup_dict is a dictionary whose keys are the label numbers > 0 (i.e. 1, 2, 3, …). The 0 label corresponds to theunmarked
label - that is the label which includes everything with no special label (hence, it needs no translation into markup tags). The values of the dictionary are tuples of the start and end tags of the label. Example:>>> markup_dict = {1: ("<person>", "</person>"), 2: ("<place>", "</place>"), 3: ("<animal>", "</animal>")}
verbose (
bool
) – If True, a progress bar is shown during prediction
- Return type
Dict
- Returns
A dictionary containing
probabilities
,labels
,entropies
andlogits
. If the human correction system was set up (i.e. func:SimpleModel.calculate_human_correction_data was called) the dictionary also containsmistake_probabilities
, which gives an estimated upper bound on the probability (range [0, 1]) that the prediction might be incorrect. In case of a “token_classification” or “question_answering” task,word_lists
andmarkup_text
are included, too.probabilities
, as well aslogits
, are a 2D numpy arrays containing the probabilities/logits for all possible labels (in the order ofself.all_labels
), andentropies
is a 1D numpy array containing the entropies for all the predicted samples.labels
contain the labels with the highest probability for each predicted sample.manual_check_recommended
indicates whether the prediction should ideally be checked by a human for maximal accuracy. If no threshold was set using the human correction system, AutoNLU will not recommend any samples for manual checkup.- Raises
ValueError – If the product key could not be authorized
Example
Assumes the environment variable
DO_PRODUCT_KEY
is correctly set>>> m = SimpleModel(model_folder="DeepOpinion/hotels_absa_en") >>> segments = [("The room was nice, but the staff was unfriendly", "Room"), >>> ("The room was nice, but the staff was unfriendly", "Staff")] >>> res = m.predict(segments) res = {'labels': ['POS', 'NEG'], probabilities': array([[0.00974869, 0.00836688, 0.09056191, 0.89132255], [0.01662647, 0.9558573 , 0.01873435, 0.00878185]], dtype=float32), 'logits': array([[-1.8422663, -1.9951179, 0.3866345, 2.6733072], [-1.0473781, 3.004235 , -0.9280151, -1.6856858]], dtype=float32), 'entropies': array([0.40521544, 0.227365 ], dtype=float32), 'manual_check_recommended': [False, False]} m.all_labels = ['NONE', 'NEG', 'NEU', 'POS']
- prune(layers_to_prune)¶
Set the layers of a model which should be pruned (i.e. not used and removed during training).
Only call this function if you want to prune specific layers and know their number. In most cases you will want to use
auto_prune()
.- Parameters
layers_to_prune (
List
[int
]) – A list of integers, containing all layer_ids that should be pruned. Therefore, layer_id ∈ [0, num_hidden_layers].
- auto_prune(X, Y, valX=None, valY=None, valsplit=0.1, num_layers_to_prune=6, always_prune=None, max_num_samples=40000, epochs=3, verbose=False)¶
Only for
"classification"
tasks (standard value, if not set elsewise during initialization). Automatically selects the best layers to prune from the current model by using a greedy search strategy: Each layer is left out and the highest accuracy after pruning a layer is then selected. For pruning more layers it is assumed that this previous selection is also a good starting point for pruning more layers. Note that this method internally remembers the layers to be pruned so callingtrain()
afterauto_prune()
is sufficient for the pruned layers to be ignored.- Parameters
X (
List
[Union
[str
,Tuple
[str
,str
]]]) – Input samples. Either a list of strings for text classification or a list of pairs of strings for text pair classification.Y (
List
[str
]) – Training target. List containing the correct labels as strings.valX (
Optional
[List
[Union
[str
,Tuple
[str
,str
]]]]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via thescore_callback
. Same format asX
. If None, a part ofX
will be split off.valY (
Optional
[List
[str
]]) – Training target used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via thescore_callback
. Same format asY
. If None, a part ofY
will be split off.valsplit (
float
) – IfvalX
orvalY
is not given, specifies how much of the training data should be split off for validation. Default is 10%.num_layers_to_prune (
int
) – How many layers should be pruned from the given architecture.always_prune (
Optional
[List
[int
]]) – A list of layer ids that should always be pruned, independent what the greedy heuristic selectsmax_num_samples (
int
) – The maximum number of training-samples to use by the greedy heuristic for training different candidates. Decreasing this number increases the speed, but decreases the accuracy of the final pruned model.epochs (
int
) – Number of epochs to train a candidate before evaluating the accuracy. Decreasing this value increases the speed to find layers to prune, but also decreases the accuracy of the final pruned model.verbose (
bool
) – If True, information about the overall progress of finding layers to prune is shown.
- train(X, Y=None, valX=None, valY=None, valsplit=0.1, do_evaluation=True, label_probabilities={}, epochs=2000, do_early_stopping=None, seed=None, learning_rate=2e-05, batchsize=32, autobatchsize=False, metric_callback=None, score_callback=None, mindatasetsize=70000, maxdatasetsize=200000, val_metric='val_accuracy', val_maximize=True, patience_epochs=2, lr_reduction_patience=1, lr_reduction_factor=0.1, epsilon=0.0001, decay_func_name='exp_sqr', nb_opti_steps=625, total_lr_decay=0.0625, verbose=False, is_markup=False, markup_dict=None, *, calculate_human_correction_data=True)¶
Trains a model on a specific task determined by
SimpleModel.task
. If you did not specify SimpleModel.task in the initialization, SimpleModel.task is set to"classification"
.SimpleModel.train()
offers two different methods of training, which differ in the way the learning rate isadjusted and under which conditions the training is stopped.
The “switch” to choose between the two methods is the argument
do_early_stopping
. When set toTrue
, the model will be tested on the validation data in regular intervals. Depending on the test results, the learning rate might be reduced or the training might be stopped if the model does not improve anymore. Ifdo_early_stopping
is set toFalse
, the training runsnb_opti_steps
optimization steps and proceeds independently of the evaluation. After each optimization step, thelearning_rate
is slightly reduced. Ifdo_early_stopping
is not specified by the user,do_early_stopping
is set to False for OMI models and True for other models. Both training methods come with specific arguments.- Parameters
X (
List
) –A list of training samples. The format of the list elements depends on the specific training task given by
SimpleModel.task
(set in initialization). SimpleModel.task can have the following values:- ”classification”
X
is a list of samples, where each sample is a string or a pair of strings.- ”token_classification”
Each single word is associated with its own label. To this end, it’s important to know what qualifies as a word (is e.g. “555 2131” one word or two?). To provide clarity, individual training samples can be provided in two distinct ways:
lists of words
>>> X = [["Tom", "was", "in", "London", "."], ["Lisa", "loves", "Paris", "."]]
markup language texts
>>> X = ["<person>Tom</person> was in <place>London</place>.", >>> "<person>Lisa</person> loves <place>Paris</place>."]
You have to choose one method. Mixtures of both methods are not allowed. If you decide to use markup language, the argument
is_markup
has to be set to True.- ”question_answering”
List of samples, where each sample is a tuple consisting of a question and a context, in which the answer can be found. As for “token_classification”, questions and contexts can be either given entirely as list of words or as markup language texts. If markup language is used, the argument
is_markup
has to be set to True. Examples:list of words
>>> X = [(["What", "color", "do", "bananas", "have", "?"], >>> ["Tomatoes", "are", "red", "and", "bananas", "are", "yellow", "."])]
markup language texts
>>> X = [("What color do bananas have?", >>> "Tomatoes are red and bananas are <answer>yellow</answer>.")]
Y (
Optional
[List
]) –Training targets. List containing the correct answers. As for
X
, the format ofY
depends on the value ofModel.task
. Model.task can have the following values:- ”classification”
Y
is a list containing the correct labels as strings.- ”token_classification”
If
X
is given as a list of markup language texts, the label information is already encoded intoX
. Hence, noY
is needed (i.e.Y = None
). If the samples inX
are provided as lists of words,Y
must be provided as lists of word-labels. A word-label can be a number or a string, e.g.:Y = [[1, 0, 0, 2, 0], [1, 0, 2, 0]]
Y = [["1", "0", "0", "2", "0"], ["1", "0", "2", "0"]]
Y = [["person", "", "", "place", ""], ["person", "", "place", ""]]
- ”question_answering”
If
X
is given as markup language (i.e. a list of tuples of markup language texts),Y
has to be None. If the question(s) and context(s) inX
are provided as list of words, the answer(s) must be provided as list(s) of word-label with exactly one label for each word in thecontext
. Each label can take one of two values denoting “word is part of the answer” or “word is not part of the answer”. You can name these two labels as you like, but you have to stick to these names for all samples. e.g:Y = [[0, 0, 0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 1, 0, 0]]
Y = [["0", "0", "0", "0", "0", "0", "0", "1", "1", "0"], ["0", "0", "1", "0", "0"]]
Y = [["", "", "", "", "", "", "", "answer", "answer", ""], ["", "", "answer", "", ""]]
Remark: In case your dataset contains several answers, you can also provide a list of answers to each question (so Y is a list (over samples) of lists (over answers) of lists of labels). When you provide a list of answers, you need to do it for every sample (a list with one element (list) is okay, too).
valX (
Optional
[List
]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via thescore_callback
. Same format asX
. If None anddo_evaluation
is True, a part ofX
will be split off.valY (
Optional
[List
]) – Training target used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via thescore_callback
. Same format asY
. If None anddo_evaluation
is True, a part ofY
will be split off.valsplit (
float
) – Ifdo_evaluation
is True andvalX
orvalY
is not given, specifies how much of the training data should be split off for validation. Default is 10%.do_evaluation (
bool
) – If set to False no evaluation is done. This also means that early stopping is automatically deactivatedis_markup (
bool
) – For the tasks"token_classification"
and"question_answering"
the training data can be provided as markup language text (see explanation forX
). In this case,is_markup
has to be set to True. The standard value is False.markup_dict (
Optional
[dict
]) –Optional. A markup_dict is only needed when
X
contains markup language texts. But even in that case, the markup_dict is automatically generated from the input data if not provided explicitly. markup_dict is a dictionary whose keys are the label numbers > 0 (i.e. 1, 2, 3, …). The 0 label corresponds to theunmarked
label - that is the label which includes everything with no special label (hence, it needs no translation into markup tag). The values of the dictionary are tuples of the start and end tags of the label. Example:>>> markup_dict = {1: ("<person>", "</person>"), 2: ("<place>", "</place>"), 3: ("<animal>", "</animal>")}
When you train a model with markup language and also like to be able to use the model with word-lists, you need to know which label number (key) corresponds to which label tags (value). In this case, it’s recommended to provide an explicit markup_dict where you control the key-value-mapping. Providing an explicit markup_dict would also allow to omit certain label types by simply not mentioning them in the markup_dict. This can be useful when your markup text contains more label types than you like to train. Labels which are not mentioned in the markup_dict are not trained.
label_probabilities (
Dict
[str
,float
]) – Only for the “classification” task (the standard value of Model.task). A dictionary, mapping label names to the probability (number between 0 and 1) of that label being used for training. All labels not mentioned inlabel_probabilities
are assumed to have a probability of 1. Can be used to subsample certain labels if they are overrepresented.seed (
Optional
[int
]) – Fix the random seed to make training deterministic (i.e. with the same seed and the same input data in the same order, the resulting model should be identical). Warning! Setting a seed can slow down training.learning_rate (
float
) – The learning rate to be used during training. Higher learning rates will lead to faster convergence, but might lead to worse overall accuracy and if the learning rate is set too high, the system might not learn anything.batchsize (
int
) – The number of samples to use in one training step. This also sets the number of samples to accumulate for one weight update if the number is bigger than 32 (at a minimum, 32 samples are always accumulated). A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data. The wrong batch size might also inhibit proper training.autobatchsize (
bool
) – Deprecated! This option should not be used anymore and will be removed. With the new dynamic batchsize lowering on CUDA memory error, this is not needed anymore! If True the batchsize will be determined automatically. If True, the parameter batchsize gives the maximal batchsize to use.metric_callback (
Optional
[Callable
]) –Something callable (function or class with
__call__
function) that takes two keyword argumentsY_true
(containing true label numbers from the validation dataset) andY_pred
(containing the label numbers predicted by the currently trained model) and returns a metric, which which will be passed as an argument toscore_callback
. Used to define the metric (e.g. accuracy) to use for the reported score E.g.:>>> # Return accuracy as a metric >>> import numpy as np >>> def callback(Y_true, Y_pred): >>> return np.sum(Y_true == Y_pred) / len(Y_true)
score_callback (
Optional
[Callable
]) –Something callable (function or class with
__call__
function) taking one keyword argumentscore
that is filled with the output ofmetric_callback
and evaluated in regular intervals during training. E.g.:>>> # Print current score >>> def callback(score): >>> print(f"Current score = {score}")
verbose (
bool
) – If True, information about the training progress will be shown on the terminal.do_early_stopping (
Optional
[bool
]) –If True, early stopping will be used. I.e. the model will be tested on the validation data in regular intervals and training will be stopped if the model does not improve anymore. If False, a preset schedule of
nb_opti_steps
optimization steps is used combined with a decaying learning rate.- Arguments used when
do_early_stopping is True
: epochs: The maximum number of epochs used for training.
mindatasetsize: Early stopping assumes the datasets size to be at least
mindatasetsize
. A largemindatasetsize
in essence means that the patience for early stopping will be increased. Default is 70.000 to train small datasets longer since this works better in practice. A value of 70.000 in essence means that datasets with less than 70.000 samples will be trained for as long as a data set with 70.000 samplesmaxdatasetsize: Early stopping assumes the datasets size to be at most
maxdatasetsize
. A smallmaxdatasetsize
in essence means that the patience for early stopping will be decreased. Default is 200.000 to train large datasets for a shorter time. A value of 200.000 in essence means that datasets with more than 200.000 samples will only be trained for as long as a data set with 200.000 samples. This does NOT mean that only 200.000 of the samples will be used. All the data is still being utilized. This only influences at which point early stopping decides that a model does not improve anymore!val_metric: The validation metric to use for the BestModelKeeper (i.e. which metric should be used to determine if a model is better than another one). Generally this should not be changed from
val_accuracy
.val_maximize: If True, a higher value of
val_metric
is considered better, if False, a smaller value is considered better. Has to fit the specific metric inval_metric
patience_epochs: Defines how many epochs are waited without the model improving before the training is stopped.
lr_reduction_patience: Proportion of one epoch to wait without improvement until the learning rate is reduced.
lr_reduction_factor: The factor with which the learning rate is multiplied if the patience runs out.
epsilon: Maximal difference in the metric for early stopping that should be considered identical.
- Arguments used when
do_early_stopping is False
: decay_func_name: Describes the kind of learning rate decay. Options are: - “linear” - “exp” - “exp_sqr”
nb_opti_steps: The number of optimization steps after which the training is stopped.
total_lr_decay: Describes the factor by which the initial
learning_rate
has been reduced when the training ends. This total learning rate reduction is the result (product) of many small reductions made after each.optimization step. These small reductions are calculated fromtotal_lr_decay
,nb_opti_steps
anddecay_func_name
.
- Arguments used when
calculate_human_correction_data (
bool
) – IfTrue
, the human correction system is automatically set up using the validation data (if present)
Example
Assumes the environment variable
DO_PRODUCT_KEY
is correctly set>>> m = SimpleModel(model_folder="albert-base-v2") >>> segments = [("The room was nice, but the staff was unfriendly", "Room"), >>> ("The room was nice, but the staff was unfriendly", "Staff")] >>> valsegments = [("The room was beautiful, but the staff was rude", "Room"), >>> ("The room was beautiful, but the staff was rude", "Staff")] >>> m.train(X = segments, Y=["POS", "NEG"], valX=valsegments, valY=["POS", "NEG"])
- evaluate(X, Y=None, batchsize=128, dynamic_quantization=False, verbose=False, markup_dict=None, is_markup=False, none_label=None)¶
Evaluates a model on given data and returns different performance metrics
- Parameters
X (
List
) –A list of samples to be evaluated. The format of the list elements depends on the specific task the model was trained on and is determined by the value
SimpleModel.task
(set in initialization). SimpleModel.task can have the following values:- classification
X
is either a list of strings for text classification or a list of pairs of strings for text pair classification.- token_classification
Each single word is associated with its own label. To this end, it’s important to know what qualifies as a word (is e.g. “555 2131” one word or two?). To provide clarity, individual training samples can be provided in two distinct ways:
lists of words
>>> X = [["Tom", "was", "in", "London", "."], ["Lisa", "loves", "Paris", "."]]
markup language texts
>>> X = ["<person>Tom</person> was in <place>London</place>.", >>> "<person>Lisa</person> loves <place>Paris</place>."]
You have to choose one method. Mixtures of both methods are not allowed. If you decide to use markup language, the argument
is_markup
has to be set to True. Further, when you intend to do evaluation with a markup language text, amarkup_dict
is needed. If the model was trained with markup language, the model remembers the markup_dict of the training. Elsewise, you have to provide it explicitly as argument.- question_answering
Each sample is a tuple consisting of a question and a context in which the answer can be found. As for “token_classification”, questions and contexts can be either given entirely as list of words or as markup language texts. If markup language is used, the argument
is_markup
has to be set to True. Further, if you use markup language here but did not train the model on markup language, amarkup_dict
has to be provided. Examples:list of words
>>> X = [(["What", "color", "do", "bananas", "have", "?"], >>> ["Tomatoes", "are", "red", "and", "bananas", "are", "yellow", "."])]
markup language
>>> X = [("What color do bananas have?", >>> "Tomatoes are red and bananas are <answer>yellow</answer>.")]
Y (
Optional
[List
]) –List containing the correct answers. As for
X
, the format ofY
depends on the value ofModel.task
. Model.task can have the following values:- ”classification”
Y
is a list containing the correct labels as strings.- ”token_classification”
If
X
is given as a list of markup language texts, the label information is already encoded intoX
. Hence, noY
is needed (i.e.Y = None
). If the samples inX
are provided as lists of words,Y
must be provided as lists of word-labels. A word-label can be a number or a string, e.g.:Y = [[1, 0, 0, 2, 0], [1, 0, 2, 0]]
Y = [["1", "0", "0", "2", "0"], ["1", "0", "2", "0"]]
Y = [["person", "", "", "place", ""], ["person", "", "place", ""]]
- ”question_answering”
If
X
is given as markup language (i.e. a list of tuples of markup language texts),Y
has to be None. If the question(s) and context(s) inX
are provided as word- lists, the answer(s) must be provided as list(s) of word-label with exactly one label for each word in thecontext
. Each label can take one of two values denoting “word is part of the answer” or “word is not part of the answer”. You can name these two labels as you like, but you have to stick to these names for all samples. e.g:Y = [[0, 0, 0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 1, 0, 0]]
Y = [["0", "0", "0", "0", "0", "0", "0", "1", "1", "0"], ["0", "0", "1", "0", "0"]]
Y = [["", "", "", "", "", "", "", "answer", "answer", ""], ["", "", "answer", "", ""]]
Remark: In case your dataset contains several answers, you can also provide a list of answers to each question (so Y is a list (over samples) of lists (over answers) of lists of labels). When you provide a list of answers, you need to do it for every sample (a list with one element (list) is okay, too).
is_markup (
bool
) – For the tasks “token_classification” and “question_answering” the data can be provided as markup language text (see explanation forX
). In this case,is_markup
has to be set to True. The standard value is False.markup_dict (
Optional
[dict
]) –Optional. A markup_dict is only needed when
X
contains markup language texts. When the model was trained on markup language, the model remembers the markup_dict of the training. Only when you like to use another markup_dict than the one used for training, you need to provide it explicitly. A markup_dict is a dictionary whose keys are the label numbers > 0 (i.e. 1, 2, 3, …). The “0” label corresponds to theunmarked
label - that is the label which includes everything with no special label (hence, it needs no translation into markup tags). The values of the dictionary are tuples of the start and end tags of the label. Example:>>> markup_dict = {1: ("<person>", "</person>"), 2: ("<place>", "</place>"), 3: ("<animal>", "</animal>")}
none_label (
Optional
[str
]) – Optional. Defines which label should be considered as the NONE label. Will be used forf1_binary
. If this isNone
(the default), Thef1_binary
metric does not make much sense and will not be returned.batchsize (
int
) – Number of samples to predict in one inference step. A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out the system will automatically switch to smaller batch sizes without loss of data, but the switching takes time. Higher values than 128 (the default) usually does not increase the performance by much.dynamic_quantization (
bool
) – If enabled, forward propagation is executed with lower precision (int8) to speed up predictions. This is only supported on the CPU. Warning: This feature could reduce the accuracy of your model.verbose (
bool
) – If True, information about the evaluation progress will be shown on the terminal.
- Return type
Dict
- Returns
A dictionary containing
accuracy
,f1_weighted
,precision_weighted
, andrecall_weighted
. Ifnone_label
is notNone
,f1_binary
is returned as well
Example
>>> X = [("The room was very nice, but the staff was bad.", "Room"), >>> ("The room was very nice, but the staff was bad.", "Staff")] >>> Y = ["POS", "NEG"] >>> model = autonlu.SimpleModel("DeepOpinion/hotels_absa_en") >>> metrics = model.evaluate(X, Y) >>> # metrics == {'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}
- select_to_label(X, acquisitionsize=50, modelsamples=3, al_samples=100, preselectionsize=None, verbose=False)¶
Selects sentences that the current model would like as additional training data to maximally improve performance
- Parameters
X (
List
[Union
[str
,Tuple
[str
,str
]]]) – A list of segments or segment pairs the system can select to be added to the training data. Usually this is data that is available, but not yet labelled.acquisitionsize (
int
) – The number of samples the system should select. The higher the number, the more data can be labelled in one go. More iterations, with smaller acquisitionsizes will be able to learn more from fewer manually labelled samples though. Values from 50 to 100 are generally a good compromise.modelsamples (
int
) – How often different variants from the current model should be used to sample the given segments. Higher numbers will lead to more accurate results, but will also take more time.al_samples (
int
) – During selection of the requested segments, a probability distribution has to be approximated.al_samples
specifies how many samples should be taken from this distribution as an approximation. Higher values lead to more accurate results, but the runtime increases.preselectionsize (
Optional
[int
]) – Especially whenX
is getting very big, the selection process can become slow.preselectionsize
specifies how many samples should be pre-selected using a much faster method. Higher values lead to a better selection, but increase the runtime. If None, the preselectionsize is10 * acquisitionsize
verbose (
bool
) – If True, information about the active learning process is shown, also shows progress bars
- Return type
Tuple
[List
[int
],List
[float
]]- Returns
A Tuple
(idxs, scores)
whereidxs
are the indices of the selected samples andscore
is how unsure the model was about those samples. The score is not the only criteria that is used to select samples so the scores are not necessarily monotonically decreasing.
Example:
>>> m = SimpleModel(model_folder="DeepOpinion/hotels_absa_en") >>> X = [("The room was horrible", "Room"), ("The room was horrible", "Staff"), ...] >>> idxs, scores = m.select_to_label(X=X, acquisitionsize=3) idxs = [1234, 453, 112] scores = [3.45, 2.67, 2.55]
- save(modeldir)¶
Saves the current model.
If only a language model is present (meaning only finetuning was called), it will be saved in the appropriate format so it can be used as a base model for training of an actual task. A base model can also be loaded and finetuning can be continued.
- Parameters
modeldir (
str
) – The path where the model should be saved. If the folder does not exist yet, it will be created- Raises
autonlu.core.ModelSaveException – If saving the model fails
- finetune(corpus_filename, batchsize=4, burnin_epochs=0.01, burnin_timelimit=None, burnin_lr=0.002, training_epochs=1, training_timelimit=None, training_lr=2e-05, lm_tasks=['NSP', 'combinedMLM'], loss_weights=[], length=500, teacher=None, verbose=False)¶
Performs language model fine tuning on a given text corpus. Only available for
"classification"
tasks (standard value, if not set elsewise in the initialization).This command will also automatically generate a tensorboard-log, visualizing the different losses over time. The logs are saved in a “runs” directory and can be displayed by using
tensorboard --rundirs=runs
- Parameters
corpus_filename (
str
) – The text file to be used for language model fine tuning. This should be a standard text file where documents are separated by two new-lines.batchsize (
int
) – The number of sequences to be used for one pass during fine tuning. The batchsize for the burn in phase is automatically four times higher. If multiple GPUs are being used, the batchsize is multiplied by the number of available GPUs. If the batch size is too big, the system will automatically half the batch size until the batches fit on the GPUs without loss of data.burnin_epochs (
float
) – Number of epochs to be used for the burn in phase. In the burn in phase, the language model is kept fixed and only the prediction heads are trained. This lets the whole system stabilize without messing up the actual language model. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model. The number of burn in epochs should be selected so this phase takes around 10 minutes. More is usually not necessary.burnin_timelimit (
Optional
[float
]) – Number of seconds after which the burnin phase will be ended. If the number of epochs is reached before, the burnin phase will end earlier than that. If None, the burnin will proceed until the epochs are finished.burnin_lr (
float
) – Learning rate to be used for the burnin phasetraining_epochs (
float
) – Number of epochs to be used for language model finetuning. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model.training_timelimit (
Optional
[float
]) – Number of seconds after which the training will be ended. If the number of epochs is reached before, the training will end earlier than that. If None, the burnin will proceed until the epochs are finished.training_lr (
float
) – Learning rate to be used for the language model fine tuninglm_tasks (
List
[str
]) –Describes the task to be learned. Possible list elements are:
SO
: Sentence OrderingNSP
: Next Sentence PredictionSONSP
: SO & NSPprelabeled
: uses a trainer to label sentencesprelabeled_words
: uses a trainer to label sentences, where “sentences” are just consecutive words(i.e. not sentences in the grammatical sense)
soloMLM
: independent Mask Language ModelcombinedMLM
: a MLMtask which is trained together with the other tasks on the same dataloss_weights (
List
[float
]) – Gives a particular weight to the losses of the lm_tasks. If empty, each loss has the weight 1length (
Union
[int
,List
[int
]]) – Determines the number of tokens per sentence in a batch. If length is a list of two integers, the number of tokens per sentence in a batch takes a random value within the two integers ([low, high]). If length is an integer, this is the number tokens per sentence. Remark: Currently, for all lm_tasks exceptprelabeled
, a “sentence” is just a sequence of consecutive words/tokens of a given length. Forprelabeled
, grammatical sentences are used. Here, the length is defined by the sentence itself.teacher (
Optional
[LMTeacher
]) – An instance ofautonlu.finetuning.LMTeacher
. Needed for the tasksprelabeled
andprelabeled_words
, where labels are provided by a teacher.verbose (
bool
) – If True, progress bars with additional information will be shown during training
- upload(name, display_name=None, short_description='', long_description='', language='en', verbose=False)¶
Uploads this model to Studio
- Parameters
name (
str
) – The internal name that should be used for the model in Studio (e.g. this is the name you can use to later download the model from Studio again). Has to be unique. If a model with the same name already exists on Studio, aModelNameExists
exception will be thrown.display_name (
Optional
[str
]) – The name which should be displayed for this model in Studio. Does not have to be unique.short_description (
str
) – The description that is shown below the model name in the model list. If empty, the content oflong_description
will be used.long_description (
str
) – The description that is shown when the model is opened. If empty, the content ofshort_description
will be used.language (
str
) – A language identifier (e.g."en"
,"de"
). https://en.wikipedia.org/wiki/ISO_639-1verbose (
bool
) – If True, some information is printed when compressing the model is finished etc.
- Raises
ModelNameExists if the chosen name is already used in Studio –