# Improve our GooglePlay store example with manual labeling

In this tutorial, we extend our tutorial 02. More precisely, we will manually label more reviews to get a better classifier. It is quite challenging to decide which samples we should label in order to get the most information out of it and usually you would just label random sentences and hope that this selection also contains good information for the system. But in all likelihood, most of the samples you will label will be samples that the system is already pretty sure how to correctly process. To make this selection process more efficient, AutoNLU provides a method that selects all (currently unlabeled) samples that should be labeled in order to optimize the training set.

Note: We assume that tutorial 02 was already executed and the trained and saved model is available!

In [1]:
%load_ext tensorboard

!pip install gdown -q
!pip install pandas -q
!pip install google-play-scraper -q

import autonlu
from autonlu import Model
import pandas as pd
import numpy as np
import gdown
from google_play_scraper import app

In [None]:
autonlu.login()

### Load dataset and the trained model (tutorial 02)
At first, we train the model with the data that we have available. Note: If you want to get more information about the dataset etc. please see tutorial 02.

In [2]:
gdown.download("https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV", ".cache/data/googleplay/")
gdown.download("https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv", ".cache/data/googleplay/")

Downloading...
From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/apps.csv
100%|██████████| 134k/134k [00:00<00:00, 1.72MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/reviews.csv
7.17MB [00:00, 19.9MB/s]


'.cache/data/googleplay/reviews.csv'

In [3]:
df = pd.read_csv(".cache/data/googleplay/reviews.csv")

def to_label(score):
 return "negative" if score <= 2 else \
 "neutral" if score == 3 else "positive"

X = [x for x in df.content]
Y = [to_label(score) for score in df.score]

### Active learning

Let's start to extend our dataset. At first, we download additional reviews from the google play store.

In [4]:
data = app("com.instagram.android", lang="en")
comments = data["comments"]

print("Successfully downloaded reviews for app.")

Successfully downloaded reviews for app.


We have access to the comments, but not to the corresponding rating. Therefore, we have to label the reviews manually. To save our time and reduce the manual process of labeling, we want to label only the top 5 reviews that will improve our dataset most when being labeled. To detect those top 5 reviews, we use the AutoNLU function ```select_to_label```.

In [24]:
model = Model("googleplay_labeling")
ret = model.select_to_label(comments)

In [11]:
print("Top 5 reviews")
ret = [ret[0][:5], ret[1][:5]]
for i in range(len(ret[0])):
 print(f"\n{ret[0][i]}")


Top 5 reviews

Instagram is an amazing social networking app formally and informally as well, but today since afternoon I am facing an issue. I cannot login as a face scan mode is getting turned on to check if it's me or no. After the face scan is successfully done it still does not allow me to login to my account by giving an issue " instagram is facing crash issue " If the team can help me to solve this issue it will be great


I do enjoy this app however the so call updates that you have made, being the theme changing on a chat and ect seems to simply not be there or as if they do not exist at all. I have updated instagram to see if there was an issue but overall it says your latest features are not there or if there weren't any at all. It can get a bit annoying sometimes and I would like for this to get fix quickly please. Have a good day.

Hi, this is my personal account and reels option is visible for me but when I'm trying to upload a reel after recording, I'm getting an error "

AutoNLU returned a list of reviews, ordered by importance and we printed the top 5 important reviews. We could now manually label those samples, add them to our training set and train the model again to improve the performance of our classifier, which would give us much better results than just labeling five random samples. This can be done several times until the model has become good enough for our needs. The more unlabeled data you can give to the active learning method to choose from, the better the selected texts will become.