Question answering¶

In this tutorial, we will have a look at the question answering task. As for token classification tasks, in question answering the data is given in the form of markup text, as explained in the previous tutorial about token classification.

General structure¶

In question answering, each sample is a tuple of two strings. The first string is the question and the second string is a given context in which the answer might be found as a small text passage. For example, your context might be the entire Wikipedia article about the Sun King Louis XIV and you ask the question: "When was the Sun King born?". This question should be answerable with a small passage of text from the longer article. The question answering model will return the given context with additional markup language tags around his birth date: “… Louis XIV (Louis Dieudonné; <answer>5 September 1638</answer> – 1 September 1715), also known as Louis the Great …”.

Alas, training such a full question answering model requires quite a lot of data and time. So, instead of training a model for the standard question answering task, we demonstrate the versatility of this approach by training an alternative, simpler task. Instead of training the model with questions, we can train the model with simple commands. We will train our model with the three simple commands “Find persons.”, “Find locations.” and “Find animals.”

[ ]:

from autonlu import Model

Initialize the model for question answering¶

In case of question answering, we have to set the argument task="question_answering" in the constructor of Model.

[8]:

model = Model(model_folder="albert-base-v2", task="question_answering")

Prepare data and train the model¶

For our demonstration, we train the model with 15 samples, which are based on 5 artificial context sentences and three commands (questions). To speed up the training, nb_opti_steps is set to 100.

[9]:

Xtrain = [("Find persons.", "In Egypt <answer>Tom Smith</answer> saw dolphins."),
          ("Find persons.", "<answer>John Doe</answer> has an eagle from Germany."),
          ("Find persons.", "Today, <answer>Mary Louise</answer> flies to Vienna."),
          ("Find persons.", "<answer>Anna</answer> loves the volcanos in Iceland. She also loves horses."),
          ("Find persons.", "<answer>Mike</answer> lives with two dogs in a small flat in Berlin."),

          ("Find locations.", "In <answer>Egypt</answer> Tom Smith saw dolphins."),
          ("Find locations.", "John Doe has an eagle from <answer>Germany</answer>."),
          ("Find locations.", "Today, Mary Louise flies to <answer>Vienna</answer>."),
          ("Find locations.", "Anna loves the volcanos in <answer>Iceland</answer>. She also loves horses."),
          ("Find locations.", "Mike lives with two dogs in a small flat in <answer>Berlin</answer>."),

          ("Find animals.", "In Egypt Tom Smith saw <answer>dolphins</answer>."),
          ("Find animals.", "John Doe has an <answer>eagle</answer> from Germany."),
          ("Find animals.", "Today, Mary Louise flies to Vienna."),
          ("Find animals.", "Anna loves the volcanos in Iceland. She also loves <answer>horses</answer>."),
          ("Find animals.", "Mike lives with two <answer>dogs</answer> in a small flat in Berlin.")]

model.train(X=Xtrain, do_evaluation=False, learning_rate=1e-3, nb_opti_steps=100, seed=5)

/home/paethon/git/py39env/lib/python3.9/site-packages/torch/cuda/memory.py:384: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(

Test the model¶

We test our simple model with 4 samples based on two sentences. For prediction, we can either provide the context as plain text or as markup language text with label information (which are ignored for prediction). For evaluation, we need the correct label information and hence, the context has to be in markup language. The questions (commands) are always in plain text.

[5]:

Xtest_predict = [("Find persons.", "Yesterday, George Miller came back from England with a cat."),
                 ("Find locations.", "Yesterday, George Miller came back from England with a cat."),
                 ("Find animals.", "Yesterday, George Miller came back from England with a cat."),
                 ("Find animals.", "Linda loves cats but she is afraid of lions.")]

Xtest_eval = [("Find persons.", "Yesterday, <answer>George Miller</answer> came back from  England with a cat."),
              ("Find locations.", "Yesterday, George Miller came back from <answer> England</answer> with a cat."),
              ("Find animals.", "Yesterday, George Miller came back from  England with a <answer>cat</answer>."),
              ("Find animals.", "Linda loves <answer>cats</answer> but she is afraid of <answer>lions</answer>.")]

Prediction¶

[10]:

prediction = model.predict(Xtest_predict, is_markup=True)
# print all four questions with their answers
for i in range(4):
    print(f"{Xtest_predict[i][0]}:  {prediction[i]}")

Find persons.:  Yesterday, <answer>George Miller</answer> came back from England with a cat.
Find locations.:  Yesterday, George Miller came back from <answer>England</answer> with a cat.
Find animals.:  Yesterday, George Miller came back from England with a <answer>cat</answer>.
Find animals.:  Linda loves <answer>cats</answer> but she is afraid of <answer>lions</answer>.

Evaluation¶

[11]:

result_dict = model.evaluate(Xtest_eval)
print(result_dict)

{'f1_binary': 1.0, 'accuracy': 1.0, 'f1_weighted': 1.0, 'precision_weighted': 1.0, 'recall_weighted': 1.0}

Final remarks¶

Standard question answering models mostly mark the start and the end of an answer. As a consequence, these models can only identify one answer. We deviated from this approach, which allows AutoNLU to identify more than just one answer (see the last test sample, where the correct answers are “cats” and “lions”).
As in the tutorial for token classification, we like to remind you to escape all “<” and “>” in the original context with “\<” and “\>”.
For this example, the quality of the results depends on the random seed. We picked seed=5 in model.train(), which resulted in perfect predictions (with the current torch version). If you change the seed, you might see the model fail certain tasks. Alternatively, you could try more complicated test sentences. A question answering model trained on only 15 samples has its limitations…