Train a model to automatically detect tags of stackoverflow questions

In this tutorial we use the AutoNLU engine to classify tags for questions that are asked on StackOverflow. In contrast to the previous example, we now have for each question an arbitrary number of tags e.g. the questions “what is the difference between java and javascript” can be tagged with “java” and “javascript”. We will now demonstrate how simple it is to train an NLP model on this task using the AutoNLU engine. Let’s start to include all libs:

%load_ext tensorboard

!pip install pandas -q
import autonlu
from autonlu import Model
import pandas as pd
import numpy as np
[ ]:

Download data and prepare dataset

A dataset that contains StackOverflow questions and their corresponding tags already exist. We download this dataset to train our model:

df = pd.read_csv("", index_col=0)

Text Tags
2 aspnet site maps has anyone got experience cre... ['sql', '']
4 adding scripting functionality to net applicat... ['c#', '.net']
5 should i use nested classes in this case i am ... ['c++']
6 homegrown consumption of web services i have b... ['.net']
8 automatically update version number i would li... ['c#']

Before we can use this dataset, we have to convert the tags to real lists as the downloaded set is only a string:

def to_tags(tags):
    tags = tags.replace("[", "").replace("]", "").replace("'", "").replace(" ", "")
    return tags.split(",")

Y = [to_tags(tags) for tags in df.Tags]
X = [text for text in df.Text]

distinct_Y = set(np.concatenate(Y).ravel().tolist())
print(f"Found {str(len(distinct_Y))} classes: {str(distinct_Y)}")
print("\n---- Example x/y pair ----------- ")
Found 20 classes: {'python', 'android', 'jquery', 'c', '.net', 'c#', 'ruby', 'sql', 'objective-c', '', 'java', 'c++', 'html', 'javascript', 'css', 'iphone', 'ruby-on-rails', 'mysql', 'ios', 'php'}

---- Example x/y pair -----------
aspnet site maps has anyone got experience creating sqlbased aspnet sitemap providersi have got the
['sql', '']

It can be seen that each question is tagged with at least one of 20 different tags. Let’s now continue to train our model. Since the dataset is huge, and we don’t want to wait too long for the training to be finished, we will split off a small part of it and use that for training:

X = X[:10000]
Y = Y[:10000]

Training of the model

The AutoNLU engine will automatically detect that we now have a class problem and adjust all parameters, augment the data etc. accordingly. To show how easy it is to use different Huggingface transformer models, we will use AlBERT to solve this task. We additionally set a maximum number of steps we want to train our model for using stop_callback to demonstrate its use. In practice, you would generally not do this and use the epochs argument to control this:

[ ]:
%tensorboard --logdir tensorboard_logs
[ ]:
steps = 10000
def stop_after_n_steps():
    global steps
    steps -= 1
    return steps <= 0

model = Model("albert-base-v2", stop_callback=stop_after_n_steps)
model.train(X, Y, verbose=True)"stackoverflow_labeling")

Although the task was quite different from the previous tutorial, with AutoNLU, the code base is similar and very easy to use! Let’s now manually test some sentences to evaluate the performance of the model on new data:

prediction_model = Model("stackoverflow_labeling")
questions = ["what is typescript",
    "can pytorch also be used for ios or android apps",
    "when should I use javascript and when java"]
tags = prediction_model.predict(questions)

for i, q in enumerate(questions):
    print(f"{q} | {str(tags[i])}")
what is typescript | ['javascript']
can pytorch also be used for ios or android apps | ['android', 'ios']
when should I use javascript and when java | ['java', 'javascript']

In the first example, the trained model associated typescript with javascript and tagged the question correctly.

In this tutorial, we have shown how we can train a model to automatically tag StackOverflow question with only a few lines of code and without expert knowledge in machine-learning 😀