Getting Started

When using AutoNLU in a script, or from Jupyter, it is recommended to use the autonlu.Model class which supports all functionality and settings that autonlu.SimpleModel offers, but with a much easier and comfortable interface. If you are integrating AutoNLU in an existing software system, you might want to have a look at autonlu.DocumentModel which offers a document centric API.

We will only give an overview over the most important features and use cases of AutoNLU in this document. For more in-depth information and for executable examples please have a look at the API reference and the tutorials section respectively.

A Minimal Working Example

Before being able to use AutoNLU at all, you will need to prove that you have a valid licence. The easiest way to do this is to use autonlu.login(), but this can also be achieved by setting the environment variable DO_PRODUCT_KEY. See Before First Use for more information. A minimal working example might therefore look like this:

import autonlu
autonlu.login()  # Will prompt for user and password
model = autonlu.Model("DeepOpinion/hotels_absa_en")
model.predict(["The room was nice, but the staff was unfriendly"])
# Returns [[['Room', 'POS'], ['Staff', 'NEG']]]

Environment Variables

AutoNLU uses different environment variables to influence its behavior. When using Jupyter, the environment variables can be set in the notebook with %env VARNAME=VALUE or with os.environ["VARNAME"]="VALUE" in scripts and notebooks.

  • DO_PRODUCT_KEY will be used as the bearer authentication token if no other key is provided on function calls. The token is needed to verify that the user has a valid license for AutoNLU and as login credentials to interact with Studio (downloading models, downloading annotation data, …)

  • DO_LOGGING can be used to switch certain logging messages on or off. By default, all logging messages will be generated by the system (the default logging level is DEBUG). Other possible values are WARN, and INFO (all the normal Python logging levels can be used). For day to day operation it is recommended to set this to WARN

  • DO_DEVICE can be used to overrule the automatic device selection of AutoNLU. In general, AutoNLU will use the GPU if one could be found and the CPU otherwise. If a CUDA capable GPU is present, but you want to use the CPU regardless, set this environment variable to cpu. Setting it to cuda if no CUDA capable GPU is present will likely lead to errors and crashes

  • CUDA_VISIBLE_DEVICES is not an AutoNLU specific environment variable, but changes which devices are visible to CUDA and therefore AutoNLU. If this variable is not set, all available GPUs will be visible to CUDA, and all will be used by AutoNLU in parallel. If only certain GPUs should be used, set this variable to either a single number, or a comma separated list of device numbers. To see what GPUs are available on your system and what IDs they have, call nvidia-smi on the command line. E.g. export CUDA_VISIBLE_DEVICES=0 will force AutoNLU to only use the first GPU.

Supported Tasks

AutoNLU currently supports three different text classification tasks:

  • Label task: You want to predict exactly one label for each piece of text. A typical use case would be sentiment analysis where you would like to assign one sentiment to each piece of text.

  • Class task: You want to predict an arbitrary number of classes (from a given list) for each piece of text. A typical use case would be topic detection, since a text can have none of the possible topics, but it can also have multiple topics.

  • ClassLabel task: You want to predict one label from a list of possible labels for each of a list of classes. A typical use case would be aspect based sentiment analysis, where we want to predict exactly one sentiment for a number of different aspects.

Loading Models

Models can be loaded by specifying what model to load when creating the autonlu.Model class. You can specify either a base model that is used as a starting point for training or a model that was already trained on a specific task to predict new data. The following options are supported:

  1. A path to a folder that contains a valid huggingface or Studio model

  2. A name of a Studio model that your user account has access to. You can get a list of models that are available to you with autonlu.list_models()

  3. A name of a base-model that is available on the Huggingface model store (https://huggingface.co/models)

You should be aware, that the model is only loaded once you call predict/train/finetune on it.

Examples

  • Path to model: model = autonlu.Model("/path/to/my/own/model")

  • Name of Studio model: model = autonlu.Model("DeepOpinion/hotels_absa_en")

  • Name of huggingface model: model = autonlu.Model("albert-base-v2")

Predicting Data

Once an already trained model has been loaded, or a model has been trained, new texts can be predicted with it using the autonlu.Model.predict() method. A model automatically knows what task it was trained on and will give its results in the same format that is used as the training target Y and depends on the task the model was trained for.

X = ["The room was very clean.",
     "The food was good, but the guys at the reception were bad."]
model = autonlu.Model("DeepOpinion/hotels_absa_en")
res = model.predict(X)
# res == [[['Cleanliness', 'POS'], ['Room', 'POS']],
#         [['Food', 'POS'], ['Reception', 'NEG'], ['Staff', 'NEG']]]

To show a progress bar during prediction, the option verbose=True can be passed to predict (and to basically all other methods of autonlu.Model).

If you would like to explicitly know the type of a model, you can use the autonlu.check_model.modeltype() function.

Training a Model

A model can be trained for a specific task by loading a base model and calling autonlu.Model.train(). The train command takes at minimum two parameters (X and Y), where X is a list of texts to train on and Y is the training target.

The training target can have three different formats, depending on what task (label-, class-, or classlabel-task) you want to solve:

Label Task

You have a label task if you want to predict exactly one label for each piece of text. A typical use case would be sentiment analysis where you would like to assign one sentiment to each text.

You are training for a label task if the given Y consists of a list of strings, giving one label name for each piece of text in X.

model = Model("albert-base-v2")
X = ["This was bad.", "This was great!"]
Y = ["negative", "positive"]
model.train(X, Y)

Class Task

You have a class task if you want to predict an arbitrary number of classes (from a given list) for each piece of text. A typical use case would be topic detection. A text can have none of the possible topics, but it can also have multiple topics.

You are training for a class task if the given Y consists of a list of lists of strings, giving a list of class names for each piece of text in X

model = Model("albert-base-v2")
X = ["I want a refund!",
     "The bill I got is not correct and I also have technical issues",
     "All good"]
Y = [["billing"],
     ["billing", "tech support"],
     []]
model.train(X, Y)

Class Label Task

You have a class label task if you want to predict one label from a list of possible labels for each of a list of classes for each piece of text. A typical use case would be aspect based sentiment analysis, where we want to predict a sentiment for a number of different aspects.

You are training for a class label task if the given Y consists of a list of lists of class label lists (pairs) for each given text in X

model = Model("albert-base-v2")
X = ["The room was nice.",
"The food was great, but the staff was unfriendly.",
"The room was horrible, but the waiters were welcoming"]
Y = [[["room", "POS"], ["food", "NONE"], ["staff", "NONE"]],
[["room", "NONE"], ["food", "POS"], ["staff", "NEG"]],
[["room", "NEG"], ["food", "NONE"], ["staff", "POS"]]
]
model.train(X, Y)

Since it is very often the case that there will be a certain label which should be selected if no class is being mentioned (e.g. the NONE label in the previous example) you can specify a standard_label when creating a autonlu.Model. This standard label will be used for all classes that are not explicitly mentioned in Y. In addition, during inference, predictions with this label will be omitted from the result. With a standard label, the previous example can be rewritten as follows:

model = Model("albert-base-v2", standard_label="NONE")
X = ["The room was nice.",
     "The food was great, but the staff was unfriendly.",
     "The room was horrible, but the waiters were welcoming"]
Y = [[["room", "POS"]],
     [["food", "POS"], ["staff", "NEG"]],
     [["room", "NEG"],["staff", "POS"]]
    ]
model.train(X, Y)

Training and Evaluation Set

By default, AutoNLU measures the current performance of the trained model on an evaluation dataset in regular intervals and uses this information to decide when to stop training and which model should be used in the end. For this, the system needs an evaluation dataset, containing data that is not being used for training. If only X and Y are given, AutoNLU randomly splits off 10% of the training data for evaluating the model.

If you would like more control, you can manually specify an evaluation dataset using the valX and valY parameters of the train method.

To split a dataset, the function autonlu.split_dataset() can be used. If the input data is the same, it is guaranteed that the dataset will always be split in the same way. Multiple calls to the function can be used if an evaluation and testing set should be created.

X = # ...
Y = # ...
X, Y, valX, valY = autonlu.split_dataset(X, Y, split_at=0.1)
model = Model("albert-base-v2")
model.train(X, Y, valX, valY)

Visualization of the Training Process

During training, AutoNLU automatically produces tensorboard logs that can be visualized. On the terminal go to the directory you started your script/notebook from. You should see a directory called tensorboard_logs, containing the tensorboard logs. Call tensorboard --logdir=tensorboard_logs and you should be given a url that you can open in your browser (usually http://localhost:6006/). You will be able to see different metrics, and how they change over time (e.g. training loss, validation loss and accuracy, used GPU memory, …) while the model is training.

Tensorboard screenshot during training

Language Model Fine Tuning

If there is enough text data available from the domain (or a closely related domain) of the task to be solved, an existing language model can be fine tuned further on this data. The data is required in the simple form of a text file where individual documents (e.g. reviews) are separated by two newlines.

A model can be fine tuned on this text file by using the autonlu.Model.finetune() method. Fine tuning has multiple steps:

  1. The given text file is tokenized and a .tokens file is produced in the same directory as the original text file

  2. In a burn in phase, the language model is kept fixed and the prediction heads for the different language model losses are trained on the new data to initialize them and get them used to the new domain. This phase is usually relatively short (a few minutes to half an hour)

  3. The actual language model is trained. This phase can take quite a long time if a lot of training data is available (up to multiple days)

How long the burn in and language model training should be performed can be controlled either via a number of epochs (one epoch means that on average, each part of the text will have been seen by the model one time). The number of epochs can be decimal values. A number of epochs of 0.1 for example will mean that on average around 10% of the training text will have been seen by the model.

Alternatively, one can specify a maximal time the model should be trained. This can be especially useful when using large text corpora and the time for even one epoch would exceed what is feasible.

Visualization of the Finetuning Progress

Tensorboard logs are also written during finetuning with the difference that the logs are saved in the directory runs. You will be able to see how the language modeling loss changes over time. As long as the loss in the graph (especially the Cumulative_Loss) is getting smaller, the language model is probably still improving (as long as the amount of text is sufficient so the model does not overfit).

Tensorboard screenshot during language model fine tuning

Active Learning

AutoNLU supports a process called active learning. Using active learning, you can ask an already trained model to select pieces of text that would provide the maximal amount of information to the system if labels for them would be available.

The general workflow using active learning looks like this:
  1. Train a model on existing training data

  2. Use the trained model to select data that should be labeled from a corpus of currently unlabeled data

  3. Label the selected data and add it to the training dataset

  4. Repeat step 2 and 3 until the model becomes good enough

Active learning is supported using the autonlu.Model.select_to_label() method. Have a look at the doc string of this method for a more in depth explanation of all the parameters.

Saving a Model

A model can be saved at any time using the save method. If a model was only fine tuned, the saved model will be a base model that can be used to train a variety of tasks. If a model is saved after training, a task-specific model will be saved which can be trained further on the same task or used to predict data.

Getting Annotations from Studio

Sometimes, one would like to train a model using AutoNLU, testing different setting and base-models, but use Studio for labeling of the data.

To get annotations from Studio, you will have to know the project id that contains the annotations. The easiest way to find out the project id is to click on your custom project in Studio and look at the url. It should in part look like http://studio.deepopinion.ai/projects/33/ and the number after /projects/ is the project id.

You can get a list of all available annotations for this project by calling autonlu.list_annotations() and giving the project id as an argument.

To get annotated data from Studio, use the autonlu.get_annotations() function and pass the project id. The data is already returned in a format that can directly be used to train a autonlu.Model.

If you would like to exclude some annotations, you can pass a list of the ids to exclude with the exclude parameter. The annotation ids can be determined using autonlu.get_annotations()