{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Token classification tasks\n",
    "In this tutorial, we will have a look at `token classification`, where **each word** can have a label. As a demonstration, we will teach a rudimentary model to identify all words which are *persons* or *locations* in a given text. In the literature, such a task is also referred to as `named entity recognition` (NER).\n",
    "\n",
    "### Markup language\n",
    "For token classification, we face new demands for the representation of our data, which we solve by using a form of `markup language`. Let us look at an example:\n",
    "\"*Today \\<person\\>Mary Louise\\</person\\> flies to \\<location\\>Vienna\\</location\\>.*\"\n",
    "Here, \"*Mary Louise*\" is marked as a **person** and \"*Vienna*\" is marked as a **location** by placing them between a start tag `<{label_name}>` and an end tag `</{label_name}>`. For this to work properly, every \"<\" or \">\" in the original text has to be escaped (i.e. replaced by \"\\\\>\" and \"\\\\>\").\n",
    "\n",
    "Be aware that nested labels are not allowed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autonlu import Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize the model for token classification\n",
    "We have to tell the model that we intend to do token classification. So we have to pass the argument `task=\"token_classification\"` in the constructor of `Model`. (Remark: The standard value for `task` is `\"classification\"`, which is used if it's not set explicitly). This corresponds to all \"tasks\" shown in the previous tutorials (label, class, and classlabel tasks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Model(model_folder=\"albert-base-v2\", task=\"token_classification\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare data and train the model\n",
    "For this demonstration, we restrict ourselves to a very simple model, which we train with **only 4** markup language sentences joined in a list. For our tiny training set, we reduce the value of `nb_opti_steps` to 30 to finish the training quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xtrain = [\"Today <person>Mary Louise</person> flies to <location>Vienna</location>.\",\n",
    "          \"<person>Anna</person> loves the volcanos in <location>Iceland</location>.\",\n",
    "          \"<location>London</location> was visited by <person>Tom</person>.\",\n",
    "          \"<person>John Doe</person> lives in <location>Germany</location>.\"]\n",
    "model.train(X=Xtrain, do_evaluation=False, learning_rate=1e-3, nb_opti_steps=30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the model\n",
    "We test our simple model with a single sentence. For prediction, we can either use a plain text or a markup language text with label information (which are ignored for prediction). For evaluation, we need the correct label information and hence, samples in markup language are required. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xtest_predict = [\"Yesterday, George Miller came back from Denver.\"]\n",
    "Xtest_eval = [\"Yesterday, <person>George Miller</person> came back from <location>Denver</location>.\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yesterday, <person>George Miller</person> came back from <location>Denver</location>.\n"
     ]
    }
   ],
   "source": [
    "prediction = model.predict(Xtest_predict)\n",
    "print(prediction[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}\n"
     ]
    }
   ],
   "source": [
    "result_dict = model.evaluate(Xtest_eval)\n",
    "print(result_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final reminder\n",
    "In our artificial data, the characters \"<\" and \">\" did not appear. However, in real data, they might appear. As mentioned above, these two characters have to be replaced by \"\\\\<\" and \"\\\\>\". In Python, a simple \"\\\\\" is represented by \"\\\\\\\\\", so the correct replacement is \"\\\\\\\\>\" and \"\\\\\\\\<\". Obviously, this replacement has to be done before we set the markup tags (the \"<\" and \">\" in the markup tags must not be escaped)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}