Train a model to label reviews from the GooglePlay store

In this tutorial, we will show you how you can train a model using AutoNLU on a custom dataset. More precisely, we train a model to predict reviews of the Google Play store. This dataset contains reviews by many different users with a star rating out of five possible stars. Our goal is to predict the sentiment (positive, negative or neutral) of a given review.

You can also compare training using AutoNLU to training with other frameworks such as HuggingFace which is shown in https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/ for the same dataset. As you can see, we achieve the same results with only 20 lines of code. Also, no expert machine learning knowledge is needed, as hyperparameters are automatically selected by our AutoNLU engine.

Note: We recommend using a machine with an Nvidia GPU for this tutorial.

[1]:
%load_ext tensorboard

!pip install pandas -q
WARNING: You are using pip version 21.1.1; however, version 21.1.2 is available.
You should consider upgrading via the '/home/paethon/git/py39env/bin/python3.9 -m pip install --upgrade pip' command.
[2]:
import autonlu
from autonlu import Model
import pandas as pd
import numpy as np
import gdown
[3]:
autonlu.login()
User name/Email: admin
Password: ········

Download and prepare dataset

At first, we automatically download and prepare the google play app reviews dataset. Note that this installs gdown in your pip environment.

[4]:
gdown.download("https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV", ".cache/data/googleplay/")
gdown.download("https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv", ".cache/data/googleplay/")

df = pd.read_csv(".cache/data/googleplay/reviews.csv")
df.head()[["content", "score"]]
Downloading...
From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/apps.csv
100%|██████████| 134k/134k [00:00<00:00, 2.02MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/reviews.csv
7.17MB [00:00, 23.6MB/s]
[4]:
content score
0 Update: After getting a response from the deve... 1
1 Used it for a fair amount of time without any ... 1
2 Your app sucks now!!!!! Used to be good but no... 1
3 It seems OK, but very basic. Recurring tasks n... 1
4 Absolutely worthless. This app runs a prohibit... 1

Great, we now downloaded the GooglePlay reviews dataset and displayed the first entries. For this tutorial, we are interested in predicting whether a review was positive or negative (score) based on the content. So lets at first convert the dataset into classes. More precisely, let’s convert our 5-star review into negative, neutral, and positive.

[5]:
def to_label(score):
    return "negative" if score <= 2 else \
           "neutral" if score == 3 else "positive"

X = [x for x in df.content]
Y = [to_label(score) for score in df.score]

print(f"{Y[10]}: {X[10][0:100]}...")
print(f"{Y[920]}: {X[920][0:100]} ...")
negative: This App is useless , it works as a notepad with your tasks in it , it does not notify you of tasks ...
positive: Very helpful for me to list out my works ...

Training and babysitting of the model

Before we start to train our network, we plot the training progress within TensorBoard which is supported out-of-the-box in our AutoNLU engine. Unfortunately, the output of TensorBoard is not preserved with the static versions of the notebook, so you will have to execute it yourself to see the visualization. The train/validation split, hyperparameter selection etc. is done internally. Because of this, the training, including visualization, can easily be started with the following four lines of code:

[17]:
%tensorboard --logdir tensorboard_logs
Reusing TensorBoard on port 6006 (pid 15051), started 4 days, 18:53:58 ago. (Use '!kill 15051' to kill it.)
[18]:
model = Model("bert-base-uncased")
model.train(X, Y)
model.save("googleplay_labeling")
Model bert-base-uncased loaded from Huggingface successfully.

Predict new data with your trained model

That’s it, we trained our model without any manual hyperparameter tuning and still got an accuracy of 87,62% on this dataset. Now we can use this model to predict new sentences:

[19]:
ret = model.predict([
    "I really love this app.",
    "The app is ok but it neets improvement.",
    "Its crashing all the time I can't use it."])
print(ret)
['positive', 'neutral', 'negative']
[ ]: