{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Token classification tasks\n", "In this tutorial, we will have a look at `token classification`, where **each word** can have a label. As a demonstration, we will teach a rudimentary model to identify all words which are *persons* or *locations* in a given text. In the literature, such a task is also referred to as `named entity recognition` (NER).\n", "\n", "### Markup language\n", "For token classification, we face new demands for the representation of our data, which we solve by using a form of `markup language`. Let us look at an example:\n", "\"*Today \\Mary Louise\\ flies to \\Vienna\\.*\"\n", "Here, \"*Mary Louise*\" is marked as a **person** and \"*Vienna*\" is marked as a **location** by placing them between a start tag `<{label_name}>` and an end tag ``. For this to work properly, every \"<\" or \">\" in the original text has to be escaped (i.e. replaced by \"\\\\>\" and \"\\\\>\").\n", "\n", "Be aware that nested labels are not allowed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from autonlu import Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize the model for token classification\n", "We have to tell the model that we intend to do token classification. So we have to pass the argument `task=\"token_classification\"` in the constructor of `Model`. (Remark: The standard value for `task` is `\"classification\"`, which is used if it's not set explicitly). This corresponds to all \"tasks\" shown in the previous tutorials (label, class, and classlabel tasks)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "model = Model(model_folder=\"albert-base-v2\", task=\"token_classification\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare data and train the model\n", "For this demonstration, we restrict ourselves to a very simple model, which we train with **only 4** markup language sentences joined in a list. For our tiny training set, we reduce the value of `nb_opti_steps` to 30 to finish the training quickly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Xtrain = [\"Today Mary Louise flies to Vienna.\",\n", " \"Anna loves the volcanos in Iceland.\",\n", " \"London was visited by Tom.\",\n", " \"John Doe lives in Germany.\"]\n", "model.train(X=Xtrain, do_evaluation=False, learning_rate=1e-3, nb_opti_steps=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test the model\n", "We test our simple model with a single sentence. For prediction, we can either use a plain text or a markup language text with label information (which are ignored for prediction). For evaluation, we need the correct label information and hence, samples in markup language are required. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "Xtest_predict = [\"Yesterday, George Miller came back from Denver.\"]\n", "Xtest_eval = [\"Yesterday, George Miller came back from Denver.\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yesterday, George Miller came back from Denver.\n" ] } ], "source": [ "prediction = model.predict(Xtest_predict)\n", "print(prediction[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}\n" ] } ], "source": [ "result_dict = model.evaluate(Xtest_eval)\n", "print(result_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final reminder\n", "In our artificial data, the characters \"<\" and \">\" did not appear. However, in real data, they might appear. As mentioned above, these two characters have to be replaced by \"\\\\<\" and \"\\\\>\". In Python, a simple \"\\\\\" is represented by \"\\\\\\\\\", so the correct replacement is \"\\\\\\\\>\" and \"\\\\\\\\<\". Obviously, this replacement has to be done before we set the markup tags (the \"<\" and \">\" in the markup tags must not be escaped)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 5 }