{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Cleaning\n", "\n", "In this tutorial, you will see how to use the `DataCleaner` class to identify samples from a dataset that might be problematic because of labeling errors or generally poor quality." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import autonlu\n", "import datasets # If you have not done so yet, you have to install with `pip install datasets`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Dataset\n", "First, we are going to load the Banking 77 dataset and bring it into the correct form" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = datasets.load_dataset(\"banking77\")\n", "X = dataset[\"train\"][\"text\"]\n", "Y = dataset[\"train\"][\"label\"]\n", "# Convert label numbers into readable names\n", "Y = [dataset[\"train\"].features[\"label\"].int2str(y) for y in Y]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at a few examples to get a feel for the dataset" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am still waiting on my card? :: card_arrival\n", "What can I do if my card still hasn't arrived after 2 weeks? :: card_arrival\n", "I have been waiting over a week. Is the card still coming? :: card_arrival\n", "Can I track my card while it is in the process of delivery? :: card_arrival\n", "How do I know if I will get my card, or if it is lost? :: card_arrival\n", "When did you send me my new card? :: card_arrival\n", "Do you have info about the card on delivery? :: card_arrival\n", "What do I do if I still have not received my new card? :: card_arrival\n", "Does the package with my card have tracking? :: card_arrival\n", "I ordered my card but it still isn't here :: card_arrival\n" ] } ], "source": [ "for x, y in zip(X[:10], Y[:10]):\n", " print(x, \"::\", y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Determining which Samples to Check for Data Cleaning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we are creating an instance of the `DataCleaner` class, passing in our dataset and the base model we want to use." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clean_training_data: Detected task is label\n" ] } ], "source": [ "dc = autonlu.DataCleaner(X=X, Y=Y, basemodel=\"roberta-base\", verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can start to collect train/test samples for our dataset by calling `sample_data`. In addition to the arguments directly used by `sample_data` like `repeats` (indicating how many repeated train/test runs for all splits of the dataset should be performed) we can also pass in arbitrary other arguments which will be forwarded to the internal call of `train`. We make use of this to turn off early stopping and use a lower amount of training steps. This speeds up the whole process, and slightly undertrained models.\n", "\n", "Since `sample_data` performs many training runs, this can take quite some time, depending on the size of the dataset and how many `repeats` and `folds` are specified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dc.sample_data(\n", " folds=2, repeats=30,\n", " # The following arguments are directly forwarded to train\n", " # We are deactivating early stopping and are setting the nb_opti_steps so one epoch\n", " # will be trained to speed up the training part of the DataCleaner\n", " do_evaluation=False,\n", " nb_opti_steps=len(X)//32\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now obtain the recommended samples to be checked for labeling errors (`check_label`) and general data quality (`check_data`):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "res = dc.get_splits()\n", "res.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to know what to expect from `DataCleaner`. Although the recommended samples have a higher probability of containing mistakes, the overall percentage of mislabeled samples is in most cases still going to be rather low. From our measurements, we expect on average to be able to detect around 60% of labeling errors by checking around 5% of the original data.\n", "\n", "`get_splits` also returns the indices that the recommended samples have in the original dataset, which can be used to easily merge the corrected samples back into the original dataset once they are corrected.\n", "\n", "Since the Banking 77 dataset is very clean, there are not a lot of labeling mistakes to be corrected." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8224 -> I think I was charged a different exchange rate than what was posted at the time. ::: wrong_exchange_rate_for_cash_withdrawal\n", "8339 -> The exchange rate applied was incorrect when I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal\n", "8368 -> You applied the wrong exchange rate while I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal\n", "8302 -> I believe my exchange rate is incorrect ::: wrong_exchange_rate_for_cash_withdrawal\n", "6839 -> I just got refunded for my purchase over two weeks ago ::: reverted_card_payment?\n", "8313 -> The exchange rate for case abroad is applied wrong. ::: wrong_exchange_rate_for_cash_withdrawal\n", "8262 -> I don't think the exchange rate was right. ::: wrong_exchange_rate_for_cash_withdrawal\n", "8290 -> The exchange rate was wrong? ::: wrong_exchange_rate_for_cash_withdrawal\n", "6750 -> I was refunded the money for something I bought already. ::: reverted_card_payment?\n", "7297 -> I noticed that there was an extra charge fee on my account. Could You explain to me why? ::: transfer_fee_charged\n" ] } ], "source": [ "for i, x, y in zip(res[\"check_label\"][\"idxs\"][:10], res[\"check_label\"][\"X\"][:10], res[\"check_label\"][\"Y\"][:10]):\n", " print(i, \"->\", x,\":::\", y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, the dataset seems to contain quite a few instances of samples which are of generally bad quality. For example, samples where there is not enough information in the text itself to actually decide what label should be given or ambiguous sentences. It would be best to remove such samples from the dataset altogether." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1272 -> WHAT IS THE REASON FOR THAT ::: card_not_working\n", "6971 -> How can I change my Rowlock ? ::: change_pin\n", "4655 -> what is the matter? ::: direct_debit_payment_not_recognised\n", "8802 -> What is this witdrawal ::: cash_withdrawal_not_recognised\n", "4633 -> what is the word? ::: direct_debit_payment_not_recognised\n", "146 -> Was there a number to track that I could get? ::: card_arrival\n", "9898 -> As far as courtries go which ones are supported? ::: country_support\n", "1796 -> Help me to set up contactless payments. ::: contactless_not_working\n", "1793 -> Can I make a contactless payments? ::: contactless_not_working\n", "6513 -> What kind of security protects my money? ::: verify_source_of_funds\n" ] } ], "source": [ "for i, x, y in zip(res[\"check_data\"][\"idxs\"][:10], res[\"check_data\"][\"X\"][:10], res[\"check_data\"][\"Y\"][:10]):\n", " print(i, \"->\", x,\":::\", y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" } }, "nbformat": 4, "nbformat_minor": 4 }