{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train a model to label reviews from the GooglePlay store\n", "\n", "In this tutorial, we will show you how you can train a model using AutoNLU on a custom dataset.\n", "More precisely, we train a model to predict reviews of the Google Play store. This dataset contains reviews by many different users with a star rating out of five possible stars. Our goal is to predict the sentiment (positive, negative or neutral) of a given review.\n", "\n", "You can also compare training using AutoNLU to training with other frameworks such as HuggingFace which is shown in https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/ for the same dataset. As you can see, we achieve the same results with only 20 lines of code. Also, no expert machine learning knowledge is needed, as hyperparameters are automatically selected by our AutoNLU engine.\n", "\n", "Note: We recommend using a machine with an Nvidia GPU for this tutorial." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.1.1; however, version 21.1.2 is available.\r\n", "You should consider upgrading via the '/home/paethon/git/py39env/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\r\n" ] } ], "source": [ "%load_ext tensorboard\n", "\n", "!pip install pandas -q" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import autonlu\n", "from autonlu import Model\n", "import pandas as pd\n", "import numpy as np\n", "import gdown" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User name/Email: admin\n", "Password: ········\n" ] } ], "source": [ "autonlu.login()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download and prepare dataset\n", "At first, we automatically download and prepare the google play app reviews dataset. Note that this installs gdown in your pip environment." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading...\n", "From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\n", "To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/apps.csv\n", "100%|██████████| 134k/134k [00:00<00:00, 2.02MB/s]\n", "Downloading...\n", "From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv\n", "To: /home/paethon/git/autonlu/tutorials/.cache/data/googleplay/reviews.csv\n", "7.17MB [00:00, 23.6MB/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentscore
0Update: After getting a response from the deve...1
1Used it for a fair amount of time without any ...1
2Your app sucks now!!!!! Used to be good but no...1
3It seems OK, but very basic. Recurring tasks n...1
4Absolutely worthless. This app runs a prohibit...1
\n", "
" ], "text/plain": [ " content score\n", "0 Update: After getting a response from the deve... 1\n", "1 Used it for a fair amount of time without any ... 1\n", "2 Your app sucks now!!!!! Used to be good but no... 1\n", "3 It seems OK, but very basic. Recurring tasks n... 1\n", "4 Absolutely worthless. This app runs a prohibit... 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdown.download(\"https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\", \".cache/data/googleplay/\")\n", "gdown.download(\"https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv\", \".cache/data/googleplay/\")\n", "\n", "df = pd.read_csv(\".cache/data/googleplay/reviews.csv\")\n", "df.head()[[\"content\", \"score\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, we now downloaded the GooglePlay reviews dataset and displayed the first entries. For this tutorial, we are interested in predicting whether a review was positive or negative (score) based on the content. So lets at first convert the dataset into classes. More precisely, let's convert our 5-star review into negative, neutral, and positive." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "negative: This App is useless , it works as a notepad with your tasks in it , it does not notify you of tasks ...\n", "positive: Very helpful for me to list out my works ...\n" ] } ], "source": [ "def to_label(score):\n", " return \"negative\" if score <= 2 else \\\n", " \"neutral\" if score == 3 else \"positive\"\n", "\n", "X = [x for x in df.content]\n", "Y = [to_label(score) for score in df.score]\n", "\n", "print(f\"{Y[10]}: {X[10][0:100]}...\")\n", "print(f\"{Y[920]}: {X[920][0:100]} ...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training and babysitting of the model\n", "Before we start to train our network, we plot the training progress within TensorBoard which is supported out-of-the-box in our AutoNLU engine. Unfortunately, the output of TensorBoard is not preserved with the static versions of the notebook, so you will have to execute it yourself to see the visualization. The train/validation split, hyperparameter selection etc. is done internally. Because of this, the training, including visualization, can easily be started with the following four lines of code:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Reusing TensorBoard on port 6006 (pid 15051), started 4 days, 18:53:58 ago. (Use '!kill 15051' to kill it.)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%tensorboard --logdir tensorboard_logs" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model bert-base-uncased loaded from Huggingface successfully.\n" ] } ], "source": [ "model = Model(\"bert-base-uncased\")\n", "model.train(X, Y)\n", "model.save(\"googleplay_labeling\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict new data with your trained model\n", "That's it, we trained our model without any manual hyperparameter tuning and still got an accuracy of 87,62% on this dataset. Now we can use this model to predict new sentences:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['positive', 'neutral', 'negative']\n" ] } ], "source": [ "ret = model.predict([\n", " \"I really love this app.\", \n", " \"The app is ok but it neets improvement.\", \n", " \"Its crashing all the time I can't use it.\"])\n", "print(ret)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 2 }