In this tutorial, we will have a look at token classification, where each word can have a label. As a demonstration, we will teach a rudimentary model to identify all words which are persons or locations in a given text. In the literature, such a task is also referred to as named entity recognition (NER).

## Markup language¶

For token classification, we face new demands for the representation of our data, which we solve by using a form of markup language. Let us look at an example: “Today <person>Mary Louise</person> flies to <location>Vienna</location>.” Here, “Mary Louise” is marked as a person and “Vienna” is marked as a location by placing them between a start tag <{label_name}> and an end tag </{label_name}>. For this to work properly, every “<” or “>” in the original text has to be escaped (i.e. replaced by “\>” and “\>”).

Be aware that nested labels are not allowed

## Import model¶

[ ]:

from autonlu import Model


## Initialize the model for token classification¶

We have to tell the model that we intend to do token classification. So we have to pass the argument task="token_classification" in the constructor of Model. (Remark: The standard value for task is "classification", which is used if it’s not set explicitly). This corresponds to all “tasks” shown in the previous tutorials (label, class, and classlabel tasks).

[7]:

model = Model(model_folder="albert-base-v2", task="token_classification")


## Prepare data and train the model¶

For this demonstration, we restrict ourselves to a very simple model, which we train with only 4 markup language sentences joined in a list. For our tiny training set, we reduce the value of nb_opti_steps to 30 to finish the training quickly.

[ ]:

Xtrain = ["Today <person>Mary Louise</person> flies to <location>Vienna</location>.",
"<person>Anna</person> loves the volcanos in <location>Iceland</location>.",
"<location>London</location> was visited by <person>Tom</person>.",
"<person>John Doe</person> lives in <location>Germany</location>."]
model.train(X=Xtrain, do_evaluation=False, learning_rate=1e-3, nb_opti_steps=30)


## Test the model¶

We test our simple model with a single sentence. For prediction, we can either use a plain text or a markup language text with label information (which are ignored for prediction). For evaluation, we need the correct label information and hence, samples in markup language are required.

[4]:

Xtest_predict = ["Yesterday, George Miller came back from Denver."]
Xtest_eval = ["Yesterday, <person>George Miller</person> came back from <location>Denver</location>."]


## Prediction¶

[10]:

prediction = model.predict(Xtest_predict)
print(prediction[0])

Yesterday, <person>George Miller</person> came back from <location>Denver</location>.


## Evaluation¶

[12]:

result_dict = model.evaluate(Xtest_eval)
print(result_dict)

{'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}


## Final reminder¶

In our artificial data, the characters “<” and “>” did not appear. However, in real data, they might appear. As mentioned above, these two characters have to be replaced by “\<” and “\>”. In Python, a simple “\” is represented by “\”, so the correct replacement is “\>” and “\<”. Obviously, this replacement has to be done before we set the markup tags (the “<” and “>” in the markup tags must not be escaped).