Fine-tune your model on unsupervised data to improve the performance

In this tutorial, we show another method that can be used to improve the performance of your model: Usually, there is lots of unlabeled data available, but only a small portion of the data is labeled. In this case, the model can be fine-tuned at first on the large, unlabeled, and domain-specific data before training on the labeled data starts. This usually further improves the performance of your model. We will demonstrate this again on the GooglePlay Store dataset that we also used in tutorial-02 and tutorial-05.

[ ]:
%load_ext tensorboard

!pip install gdown -q
!pip install pandas -q
!pip install google-play-scraper -q

import autonlu
from autonlu import Model
import pandas as pd
import numpy as np
import gdown
from google_play_scraper import reviews
[ ]:
autonlu.login()

Fine-tune your model

Let’s start to download unlabeled data that can be used to fine-tune our model on app reviews:

[2]:
apps=["com.instagram.android",
    "com.facebook.katana",
    "com.whatsapp",
    "com.king.candycrush4",
    "com.android.chrome",
    "com.google.android.apps.wallpaper",
    "com.linkedin.android",
    "com.twitter.android",
    "com.wolfram.android.alpha",
    "com.microsoft.math",
    "com.spotify.music"]

for app in apps:
    result, continuation_token = reviews(app, count=5000, lang="en")
    with open(".cache/data/googleplay/goolge_play_reviews.txt", "a") as f:
        f.writelines([f"{r}\n\n" for r in [r["content"] for r in result]])

print(f"Downloaded unlabeled reviews from the google play store.")
Downloaded unlabeled reviews from the google play store.

Next, we can use this unlabeled data to finetune our model. Usually, gigabytes of unlabeled data are available because it’s very easy and cheap to crawl unlabeled text data from web-interfaces. In this tutorial, we download “only” about 60k samples to demonstrate how easy it is to finetune your model with AutoNLU. To consider this special case, we also increase the number of epochs during training such that each sample is used several times during the finetuning process. Finetuning of a model can take days if a lot of unlabeled data is available, but can be worth it to get the last bit of performance. AutoNLU also supports training on multiple GPUs at once, which can be very helpful for language model finetuning.

[ ]:
# Load the model we want to use as a base
model = Model("albert-base-v2")
# Finetune on an in-domain text file
model.finetune(".cache/data/googleplay/goolge_play_reviews.txt",
    verbose=True,
    burnin_epochs=5,
    training_epochs=20)
# Save our finetuned model to be used as a base model for tasks at a later point
model.save("albert-google-play-reviews-base")

That’s it! Your model already has some basic knowledge about our target-domain. We now train this model in a supervised fashion on our GooglePlayStore dataset, similar to tutorial 02.

Train our finetuned model in a supervised fashion

Now we have a model that is already fine-tuned on the review task. We can now continue to train the model on our labeled dataset as we have already shown in tutorial 02.

[ ]:
gdown.download("https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV", ".cache/data/googleplay/")
gdown.download("https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv", ".cache/data/googleplay/")


df = pd.read_csv(".cache/data/googleplay/reviews.csv")

def to_label(score):
    return "negative" if score <= 2 else \
           "neutral" if score == 3 else "positive"

X = [x for x in df.content]
Y = [to_label(score) for score in df.score]

model.train(X, Y, verbose=True)
model.save("googleplay_labeling_finetuned")
[9]:
prediction_model = Model("googleplay_labeling")
ret = prediction_model.predict([
    "This app is really cool and helpful.",
    "The app is quite ok, some things could be improved.",
    "The app does not work at all."])
print(ret)
['positive', 'positive', 'negative']