DocumentModel

The Document Format

The autonlu.DocumentModel class is built around the concept of documents, which are in essence dictionaries with a specified structure.

Supported Document Format for Prediction

During inference/prediction, the document will be annotated to look like the format which is also expected for training (this format will be explained in the following section). In addition to the required key/value pairs, the dictionary can also contain arbitrary, additional key/value pairs which are simply ignored and preserved during annotation.

[
    {
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
            },
            {
                "text": "The bathroom was dirty.",
            }
        ]
    }
]

The following structure would for example also be allowed and the additional key/value pairs will be preserved during the prediction process.

[
    {
        "document_text": "The room was very nice, but the staff was bad. \
              The bathroom was dirty.",
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
                "id": 12234,
            },
            {
                "text": "The bathroom was dirty.",
                "id": 12234,
            }
        ]
    }
]

After sending this this document through the prdiction process, we might get back an annotated document that looks like the following:

[
    {
        "document_text": "The room was very nice, but the staff was bad. \
              The bathroom was dirty.",
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
                "id": 12234,
                "tags": [{"class": "Room", "label": "POS"}, {"class": "Staff", "label": "NEG"}],
                "classes": ["Cleanliness", "Room", "Staff"],
                "standard_label": "NONE",
            },
            {
                "text": "The bathroom was dirty.",
                "id": 12234,
                "tags": [{"class": "Cleanliness", "label": "NEG"}],
                "classes": ["Cleanliness", "Room", "Staff"],
                "standard_label": "NONE",
            }
        ]
    }
]

Document Format Supported for Training and Produced from Prediction

Depending on the task, the document format supported by DocumentModel looks slightly different.

Class Task

[
    {
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
                "tags": [{"class": "Room"}, {"class": "Staff"}],
                "classes": ["Cleanliness", "Room", "Staff"],
            },
            {
                "text": "The bathroom was dirty.",
                "tags": [{"class": "Cleanliness"}],
                "classes": ["Cleanliness", "Room", "Staff"],
            }
        ]
    }
]

For training, classes can be missing, in which case a global list of all classes, specified in train() will be used.

Label Task

[
    {
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
                "tags": [{"label": "NEU"}],
            },
            {
                "text": "The bathroom was dirty.",
                "tags": [{"label": "NEG"}],
            }
        ]
    }
]

Class-Label Task

[
    {
        "segments": [
            {
                "text": "The room was very nice, but the staff was bad.",
                "tags": [{"class": "Room", "label": "POS"}, {"class": "Staff", "label": "NEG"}],
                "classes": ["Cleanliness", "Room", "Staff"],
                "standard_label": "NONE",
            },
            {
                "text": "The bathroom was dirty.",
                "tags": [{"class": "Cleanliness", "label": "NEG"}],
                "classes": ["Cleanliness", "Room", "Staff"],
                "standard_label": "NONE",
            }
        ]
    }
]

For training, classes can be missing, in which case a global list of all classes, specified in train() will be used. For training, standard_label is also allowed to be missing and can either be supplied when calling train() or if left out generally, classes will also be ignored and no class label pairs, which are not explicitly mentioned in tags, will be generated.

DocumentModel

class autonlu.DocumentModel(model_folder, **kwargs)

A class, representing a segment classification model, built around the concept of documents

Currently only supports the classlabel task Arguments are forwarded directly to the constructor of Model.

Parameters
  • model_folder (str) –

    A path or name of the model that should be used. Will be sent through get_model_dir() and can therefore be:

    • The path to a model

    • The name of a model available in Studio

    • The name of a model available in the Huggingface model repo

  • all_labels – A list of all labels the model should use. If None, the list of labels will be determined automatically from the training and evaluation data. Is mostly useful if a new model is to be trained and the training data does not contain an example of each possible label.

  • standard_label – The standard label to use for classlabel tasks. If the standard label is set, each class that is not explicitly mentioned for a sample of a classlabel task will be assumed to have this label. During prediction, classes that get assigned this label will also not be mentioned explicitly in the results.

  • key – A JSON web token which is used for authentication. If no key is given, the key is alternatively taken from the environment variable DO_PRODUCT_KEY.

  • state_callback

    Something callable (function or class with __call__ member function) taking one keyword argument progress, which is called with the current progress in percent after each batch. E.g.:

    >>> # Print current progress after each batch
    >>> def callback(progress):
    >>>     print(f"Current progress = {progress}")
    

  • stop_callback

    Something callable (function or class with __call__ member function) taking no arguments. Is called after each batch and prediction or training is stopped if True is returned. E.g.:

    >>> # Stop after 10 batches
    >>> i = 0
    >>> def callback():
    >>>     nonlocal i
    >>>     i += 1
    >>>     if i >= 10:
    >>>         return True
    >>>     return False
    

  • encrypt – If True, the model is encrypted on save.

  • device – Which device the model should be used on ("cpu" or "cuda"). If None, a device will be automatically selected: If a CUDA capable GPU is available, it will automatically be used, otherwise the cpu. This behavior can be overwritten by specifically setting the environment variable DO_DEVICE to either "cpu" or "cuda". autonlu.utils.get_best_device() is used to select the device.

  • log_dir – Specifies in which directory Tensorboard logs should be written. If None, no logs will be written. The logs for individual runs will be put into subdirectories named after the current timestamp.

Variables

device – The device the model is running on ("cpu" or "cuda" (when running on a GPU))

predict(documents, **kwargs)

Annotates given documents with predicted results

Parameters
  • documents (List[Dict]) – A list of documents to annotate with predictions

  • classes_to_analyze – classes_to_analyze: Used in case of a class or classlabel task. Specifies a list of classes that should be considered during prediction. If None the all classes (listed in self.all_classes) will be used.

  • batchsize – Number of samples to predict in one inference step. A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out the system will automatically switch to smaller batch sizes without loss of data, but the switching takes time. Higher values than 128 (the default) usually does not increase the performance by much.

  • verbose – If True, a progress bar is shown during prediction

Returns

Documents, annotated with detected classes and/or labels (see the full documentation for a description of the format)

Raises

ValueError – If the product key could not be authorized

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> m = DocumentModel(model_folder="DeepOpinion/hotels_absa_en")
>>> X = [{"segments": [{"text": "The room was very nice, but the staff was bad."}]}]
>>> res = m.predict(X)
res = [{'segments': [{'classes': ['Activities',
                'Ambiance',
                'Amenities',
                ...
                'WiFi'],
    'standard_label': 'NONE',
    'tags': [{'class': 'Room', 'label': 'POS'},
             {'class': 'Staff', 'label': 'NEG'}],
    'text': 'The room was very nice, but the staff was bad.'}]}]
train(documents, validation_documents, all_classes=None, metric_callback_constructor=None, **kwargs)

Trains a model with given, annotated documents.

Most arguments from Model.train() can also be used and are included in the following list of arguments

Parameters
  • documents (List[Dict]) – Documents to be used for training. For the exact format, have a look at the full documentation

  • validation_documents (List[Dict]) – Input samples used for validation of the model during training. E.g. for stopping training early if there is no progress anymore or to report the current score via the score_callback.

  • label_probabilities – A dictionary, mapping label names to the probability (number between 0 and 1) of that label being used for training. All labels not mentioned in label_probabilities are assumed to have a probability of 1. Can be used to subsample certain labels if they are overrepresented.

  • all_classes (Optional[List[str]]) – A list of all possible classes. If None, the list of possible classes will be determined automatically from the documents and validation_documents.

  • all_labels – A list of all possible labels. If None, the list of possible labels will be determined from documents and validation_documents. Can be set explicitly in cases where not all possible labels do occur in the training and validation set (e.g. because they will only be used in a later training session).

  • epochs – The maximum number of epochs used for training if do_early_stopping is True, otherwise the exact number of epochs to be trained. One epoch means that the system has seen each sample exactly once.

  • do_early_stopping – If True, early stopping will be used. I.e. the model will be tested on the validation data in regular intervals and training will be stopped if the model does not improve anymore. If False, a sensible amount of epochs should be specified.

  • seed – Fix the random seed to make training deterministic (i.e. with the same seed and the same input data in the same order, the resulting model should be identical). Warning! Setting a seed can slow down training.

  • learning_rate – The learning rate to be used during training. Higher learning rates will lead to faster convergence, but might lead to worse overall accuracy and if the learning rate is set too high, the system might not learn anything.

  • batchsize – The number of samples to use in one training step. This also sets the number of samples to accumulate for one weight update if the number is bigger than 32 (at a minimum, 32 samples are always accumulated). A higher value generally means higher throughput, but the system might run out of memory if this is set too high. In cases where the GPU memory is running out, the system will automatically switch to smaller batch sizes without loss of data. The wrong batch size might also inhibit proper training.

  • autobatchsize – Deprecated! This option should not be used anymore and will be removed. With the new dynamic batchsize lowering on CUDA memory error, this is not needed anymore! If True the batchsize will be determined automatically. If True, the parameter batchsize gives the maximal batchsize to use.

  • metric_callback

    Something callable (function or class with __call__ function) that takes two keyword arguments Y_true (containing true label numbers from the validation dataset) and Y_pred (containing the label numbers predicted by the currently trained model) and returns a metric, which which will be passed as an argument to score_callback. Used to define the metric (e.g. accuracy) to use for the reported score E.g.:

    >>> # Return accuracy as a metric
    >>> import numpy as np
    >>> def callback(Y_true, Y_pred):
    >>>     return np.sum(Y_true == Y_pred) / len(Y_true)
    

  • score_callback

    Something callable (function or class with __call__ function) taking one keyword argument score that is filled with the output of metric_callback and evaluated in regular intervals during training. E.g.:

    >>> # Print current score
    >>> def callback(score):
    >>>     print(f"Current score = {score}")
    

  • mindatasetsize – Early stopping assumes the datasets size to be at least mindatasetsize. A large mindatasetsize in essence means that the patience for early stopping will be increased. Default is 70.000 to train small datasets longer since this works better in practice. A value of 70.000 in essence means that datasets with less than 70.000 samples will be trained for as long as a dataset with 70.000 samples

  • maxdatasetsize – Early stopping assumes the datasets size to be at most maxdatasetsize. A small maxdatasetsize in essence means that the patience for early stopping will be decreased. Default is 200.000 to train large datasets for a shorter time. A value of 200.000 in essence means that datasets with more than 200.000 samples will only be trained for as long as a dataset with 200.000 samples. This does NOT mean that only 200.000 of the samples will be used. All the data is still being utilized. This only influences at which point early stopping decides that a model does not improve anymore!

  • val_metric – The validation metric to use for the BestModelKeeper (i.e. which metric should be used to determine if a model is better than another one). Generally this should not be changed from val_loss.

  • val_maximize – If True, a higher value of val_metric is considered better, if False, a smaller value is considered better. Has to fit the specific metric in val_metric

  • verbose – If True, information about the training progress will be shown on the terminal.

Raises

UnsupportedDocumentFormat – if the documents to not match the format expected for one of the three tasks

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> m = DocumentModel("albert-base-v2", standard_label = "NONE")
>>> documents = [{
>>>         "segments": [
>>>             {
>>>                 "text": "The room was very nice, but the staff was bad.",
>>>                 "tags": [{"class": "Room", "label": "POS"}, {"class": "Staff", "label": "NEG"}],
>>>             },
>>>             {
>>>                 "text": "The bathroom was dirty.",
>>>                 "tags": [{"class": "Cleanliness", "label": "NEG"}],
>>>             }
>>>         ]
>>>     }]
>>> m.train(documents=documents, validation_documents=documents)
evaluate(documents, batchsize=128, classes_to_analyze=None)

Evaluates the model on already annotated documents and returns performance metrics

Parameters
  • documents (List[Dict]) – Annotated documents (in the same format as is expected for training)

  • batchsize (int) – Number of samples to predict in one inference step.

  • classes_to_analyze (Optional[List[str]]) – Classes to be analyzed for classlabel and class tasks. If None, all known classes will be used

Return type

Dict

Returns

A dictionary, containing all calculated evaluation metrics. Which metrics are returned depends on the task

Raises

UnsupportedDocumentFormat – if the documents to not match the format expected for one of the three tasks

save(model_dir)

Saves the current model.

If only a language model is present (meaning only finetuning was called), it will be saved in the appropriate format so it can be used as a base model for training of an actual task. A base model can also be loaded and finetuning can be continued.

Parameters

modeldir – The path where the model should be saved. If the folder does not exist yet, it will be created

Raises

autonlu.core.ModelSaveException – If saving the model fails

finetune(corpus_filename, batchsize=4, burnin_epochs=0.01, burnin_timelimit=None, burnin_lr=0.002, training_epochs=1, training_timelimit=None, training_lr=2e-05, lm_tasks=['NSP', 'combinedMLM'], loss_weights=[], length=500, teacher=None, verbose=False)

Performs language model fine tuning on a given text corpus. Only available for "classification" tasks (standard value, if not set elsewise in the initialization).

This command will also automatically generate a tensorboard-log, visualizing the different losses over time. The logs are saved in a “runs” directory and can be displayed by using tensorboard --rundirs=runs

Parameters
  • corpus_filename (str) – The text file to be used for language model fine tuning. This should be a standard text file where documents are separated by two new-lines.

  • batchsize (int) – The number of sequences to be used for one pass during fine tuning. The batchsize for the burn in phase is automatically four times higher. If multiple GPUs are being used, the batchsize is multiplied by the number of available GPUs. If the batch size is too big, the system will automatically half the batch size until the batches fit on the GPUs without loss of data.

  • burnin_epochs (float) – Number of epochs to be used for the burn in phase. In the burn in phase, the language model is kept fixed and only the prediction heads are trained. This lets the whole system stabilize without messing up the actual language model. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model. The number of burn in epochs should be selected so this phase takes around 10 minutes. More is usually not necessary.

  • burnin_timelimit (Optional[float]) – Number of seconds after which the burnin phase will be ended. If the number of epochs is reached before, the burnin phase will end earlier than that. If None, the burnin will proceed until the epochs are finished.

  • burnin_lr (float) – Learning rate to be used for the burnin phase

  • training_epochs (float) – Number of epochs to be used for language model finetuning. The number of epochs can be given as floating point numbers. When set to 1.0, on average, the whole text of the training corpus will have been seen once by the model.

  • training_timelimit (Optional[float]) – Number of seconds after which the training will be ended. If the number of epochs is reached before, the training will end earlier than that. If None, the burnin will proceed until the epochs are finished.

  • training_lr (float) – Learning rate to be used for the language model fine tuning

  • lm_tasks (List[str]) –

    Describes the task to be learned. Possible list elements are: SO: Sentence Ordering NSP: Next Sentence Prediction SONSP: SO & NSP prelabeled: uses a trainer to label sentences prelabeled_words: uses a trainer to label sentences, where “sentences” are just consecutive words

    (i.e. not sentences in the grammatical sense)

    soloMLM: independent Mask Language Model combinedMLM: a MLMtask which is trained together with the other tasks on the same data

  • loss_weights (List[float]) – Gives a particular weight to the losses of the lm_tasks. If empty, each loss has the weight 1

  • length (Union[int, List[int]]) – Determines the number of tokens per sentence in a batch. If length is a list of two integers, the number of tokens per sentence in a batch takes a random value within the two integers ([low, high]). If length is an integer, this is the number tokens per sentence. Remark: Currently, for all lm_tasks except prelabeled, a “sentence” is just a sequence of consecutive words/tokens of a given length. For prelabeled, grammatical sentences are used. Here, the length is defined by the sentence itself.

  • teacher (Optional[LMTeacher]) – An instance of autonlu.finetuning.LMTeacher. Needed for the tasks prelabeled and prelabeled_words, where labels are provided by a teacher.

  • verbose (bool) – If True, progress bars with additional information will be shown during training

select_to_label(X, classes_to_analyze=None, **kwargs)

Selects sentences that the current model would like as additional training data to maximally improve performance. Currently only supported for “classification” tasks.

All arguments from SimpleModel.select_to_label() can also be used and are included in the following list of arguments

Parameters
  • X – A list of segments or segment pairs the system can select to be added to the training data. Usually this is data that is available, but not yet labelled.

  • classes_to_analyze (Optional[List[str]]) – Used in case of a class or classlabel task. Specifies a list of classes that should be considered when selecting sentences for labeling. If None, the list of all known classes is used automatically. This can be useful if certain classes are underrepresented in the training data and we would like to concentrate our selection on those classes.

  • acquisitionsize – The number of samples the system should select. The higher the number, the more data can be labelled in one go. More iterations, with smaller acquisitionsizes will be able to learn more from fewer manually labelled samples though. Values from 50 to 100 are generally a good compromise.

  • modelsamples – How often different variants from the current model should be used to sample the given segments. Higher numbers will lead to more accurate results, but will also take more time.

  • al_samples – During selection of the requested segments, a probability distribution has to be approximated. al_samples specifies how many samples should be taken from this distribution as an approximation. Higher values lead to more accurate results, but the runtime increases.

  • preselectionsize – Especially when X is getting very big, the selection process can become slow. preselectionsize specifies how many samples should be pre-selected using a much faster method. Higher values lead to a better selection, but increase the runtime. If None, the preselectionsize is 10 * acquisitionsize

  • verbose – If True, information about the active learning process is shown, also shows progress bars

Returns

A Tuple (samples_to_label, scores) where samples_to_label are the samples that the system would like to see labeled. In case of a class or classlabel task, the samples are (segment, class) tuples and score is how unsure the model was about the given samples. The score is not the only criteria that is used to select samples so the scores are not necessarily monotonically decreasing.

Example:

>>> m = Model("DeepOpinion/hotels_absa_en")
>>> X = ["The room was horrible", "The food was quite nice", ...]
>>> samples_to_label, scores = m.select_to_label(X=X, acquisitionsize=2)
samples_to_label = [("The room was horrible", "room"), ("We really enjoyed the stay", "satisfaction")]
scores = [1.34, 0.561]
modeltype()

Returns which modeltype we have loaded.

Return type

str

Returns

One of "base", "label", "class", "classlabel", "token_classification", and "question_answering".

Helper Functions

autonlu.get_segment_class_pairs_doc(documents, all_classes)

Returns segment/class pairs from given documents and a list of all possible classes

Parameters
  • documents (List[Dict]) – Documents for which the segment/class pairs should be generated

  • all_classes (List[str]) – List of all possible classes that will be used to generate segment/class pairs.

Return type

List[Tuple[str, str]]

Returns

List of (segment, class) tuples

autonlu.documents2simplemodelformat(documents, all_classes=None, standard_label=None, label_probabilities={})

Takes a list of documents and returns the format expected by SimpleModel

The different document formats for the three supported tasks are automatically accounted for

Parameters
  • documents (List[Dict]) – Documents that should be converted into the SimpleModel format

  • all_classes (Optional[List[str]]) – List of all possible classes Only used in case the document does not give specific classes for each segment and only used for class and classlabel tasks.

  • standard_label (Optional[str]) – If given, all classes are assumed to have this label if no specific label was given in classlabels. To work, also needs all_classes. Only used in classlabel tasks.

  • label_probabilities (Dict[str, float]) – Dictionary mapping label names to the probability of that label occurring in the generated data. Only used for classlabel and label tasks.

Return type

Tuple[List[Tuple[str, str]], List[Union[str, int]]]

Returns

A tuple (X, Y) that can be directly passed to a SimpleModel

Raises

UnsupportedDocumentFormat – if the documents to not match the format expected for one of the three tasks

autonlu.get_all_labels_from_documents(documents)

Returns a list of all labels that occur in tags of a list of documents

Parameters

documents (List[Dict]) – List of documents to extract the labels from

Return type

List[str]

Returns

An alphabetically sorted list of all labels found in the documents