Token classification tasks

In this tutorial, we will have a look at token classification, where each word can have a label. As a demonstration, we will teach a rudimentary model to identify all words which are persons or locations in a given text. In the literature, such a task is also referred to as named entity recognition (NER).

Markup language

For token classification, we face new demands for the representation of our data, which we solve by using a form of markup language. Let us look at an example: “Today <person>Mary Louise</person> flies to <location>Vienna</location>.” Here, “Mary Louise” is marked as a person and “Vienna” is marked as a location by placing them between a start tag <{label_name}> and an end tag </{label_name}>. For this to work properly, every “<” or “>” in the original text has to be escaped (i.e. replaced by “\>” and “\>”).

Be aware that nested labels are not allowed

Import model

[ ]:
from autonlu import Model

Initialize the model for token classification

We have to tell the model that we intend to do token classification. So we have to pass the argument task="token_classification" in the constructor of Model. (Remark: The standard value for task is "classification", which is used if it’s not set explicitly). This corresponds to all “tasks” shown in the previous tutorials (label, class, and classlabel tasks).

[7]:
model = Model(model_folder="albert-base-v2", task="token_classification")

Prepare data and train the model

For this demonstration, we restrict ourselves to a very simple model, which we train with only 4 markup language sentences joined in a list. For our tiny training set, we reduce the value of nb_opti_steps to 30 to finish the training quickly.

[ ]:
Xtrain = ["Today <person>Mary Louise</person> flies to <location>Vienna</location>.",
          "<person>Anna</person> loves the volcanos in <location>Iceland</location>.",
          "<location>London</location> was visited by <person>Tom</person>.",
          "<person>John Doe</person> lives in <location>Germany</location>."]
model.train(X=Xtrain, do_evaluation=False, learning_rate=1e-3, nb_opti_steps=30)

Test the model

We test our simple model with a single sentence. For prediction, we can either use a plain text or a markup language text with label information (which are ignored for prediction). For evaluation, we need the correct label information and hence, samples in markup language are required.

[4]:
Xtest_predict = ["Yesterday, George Miller came back from Denver."]
Xtest_eval = ["Yesterday, <person>George Miller</person> came back from <location>Denver</location>."]

Prediction

[10]:
prediction = model.predict(Xtest_predict)
print(prediction[0])
Yesterday, <person>George Miller</person> came back from <location>Denver</location>.

Evaluation

[12]:
result_dict = model.evaluate(Xtest_eval)
print(result_dict)
{'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}

Final reminder

In our artificial data, the characters “<” and “>” did not appear. However, in real data, they might appear. As mentioned above, these two characters have to be replaced by “\<” and “\>”. In Python, a simple “\” is represented by “\”, so the correct replacement is “\>” and “\<”. Obviously, this replacement has to be done before we set the markup tags (the “<” and “>” in the markup tags must not be escaped).