{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Improve our GooglePlay store example with manual labeling\n", "\n", "In this tutorial, we extend our tutorial 02. More precisely, we will manually label more reviews to get a better classifier. It is quite challenging to decide which samples we should label in order to get the most information out of it and usually you would just label random sentences and hope that this selection also contains good information for the system. But in all likelihood, most of the samples you will label will be samples that the system is already pretty sure how to correctly process. To make this selection process more efficient, AutoNLU provides a method that selects all (currently unlabeled) samples that should be labeled in order to optimize the training set.\n", "\n", "Note: We assume that tutorial 02 was already executed and the trained and saved model is available!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext tensorboard\n", "\n", "!pip install gdown -q\n", "!pip install pandas -q\n", "!pip install google-play-scraper -q\n", "\n", "import autonlu\n", "from autonlu import Model\n", "import pandas as pd\n", "import numpy as np\n", "import gdown\n", "from google_play_scraper import app" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "autonlu.login()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load dataset and the trained model (tutorial 02)\n", "At first, we train the model with the data that we have available. Note: If you want to get more information about the dataset etc. please see tutorial 02." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading...\n", "From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\n", "To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/apps.csv\n", "100%|██████████| 134k/134k [00:00<00:00, 1.72MB/s]\n", "Downloading...\n", "From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv\n", "To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/reviews.csv\n", "7.17MB [00:00, 19.9MB/s]\n" ] }, { "data": { "text/plain": [ "'.cache/data/googleplay/reviews.csv'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdown.download(\"https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\", \".cache/data/googleplay/\")\n", "gdown.download(\"https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv\", \".cache/data/googleplay/\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\".cache/data/googleplay/reviews.csv\")\n", "\n", "def to_label(score):\n", " return \"negative\" if score <= 2 else \\\n", " \"neutral\" if score == 3 else \"positive\"\n", "\n", "X = [x for x in df.content]\n", "Y = [to_label(score) for score in df.score]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Active learning\n", "\n", "Let's start to extend our dataset. At first, we download additional reviews from the google play store." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully downloaded reviews for app.\n" ] } ], "source": [ "data = app(\"com.instagram.android\", lang=\"en\")\n", "comments = data[\"comments\"]\n", "\n", "print(\"Successfully downloaded reviews for app.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have access to the comments, but not to the corresponding rating. Therefore, we have to label the reviews manually. To save our time and reduce the manual process of labeling, we want to label only the top 5 reviews that will improve our dataset most when being labeled. To detect those top 5 reviews, we use the AutoNLU function ```select_to_label```." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "model = Model(\"googleplay_labeling\")\n", "ret = model.select_to_label(comments)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 reviews\n", "\n", "Instagram is an amazing social networking app formally and informally as well, but today since afternoon I am facing an issue. I cannot login as a face scan mode is getting turned on to check if it's me or no. After the face scan is successfully done it still does not allow me to login to my account by giving an issue \" instagram is facing crash issue \" If the team can help me to solve this issue it will be great\n", "\n", "For the most part it is awesome. There needs to be a warning for how long you get restricted and what for when following and unfollowing a lot of people in a short time. I'm grounded and have no idea for how long or what for. I am guessing it's because of me trying to get new. Followers for my business. Idk. But a warning and some follow up info would have been appreciated so I would know what not to do in the future. At this point it's a guessing game for me.\n", "\n", "I do enjoy this app however the so call updates that you have made, being the theme changing on a chat and ect seems to simply not be there or as if they do not exist at all. I have updated instagram to see if there was an issue but overall it says your latest features are not there or if there weren't any at all. It can get a bit annoying sometimes and I would like for this to get fix quickly please. Have a good day.\n", "\n", "Hi, this is my personal account and reels option is visible for me but when I'm trying to upload a reel after recording, I'm getting an error \"oops something went wrong\" I've reported the problem so please fix this issue ASAP and make my reel option working. I'm giving 3 starts but I'll change it to 5 once the issue is being resolved. Thanks.\n", "\n", "I had reals but suddenly they disappeared. I had music for stories, now suddenly it's not an option. My app version is current, my best guess is a bug in the latest version, but it makes the experience... frustrating, especially when trying to grow a business. I can't pace my peers.\n" ] } ], "source": [ "print(\"Top 5 reviews\")\n", "ret = [ret[0][:5], ret[1][:5]]\n", "for i in range(len(ret[0])):\n", " print(f\"\\n{ret[0][i]}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "AutoNLU returned a list of reviews, ordered by importance and we printed the top 5 important reviews. We could now manually label those samples, add them to our training set and train the model again to improve the performance of our classifier, which would give us much better results than just labeling five random samples. This can be done several times until the model has become good enough for our needs. The more unlabeled data you can give to the active learning method to choose from, the better the selected texts will become." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 2 }