In this tutorial, you will see how to use the
DataCleaner class to identify samples from a dataset that might be problematic because of labeling errors or generally poor quality.
import autonlu import datasets # If you have not done so yet, you have to install with `pip install datasets`
First, we are going to load the Banking 77 dataset and bring it into the correct form
dataset = datasets.load_dataset("banking77") X = dataset["train"]["text"] Y = dataset["train"]["label"] # Convert label numbers into readable names Y = [dataset["train"].features["label"].int2str(y) for y in Y]
Let’s have a look at a few examples to get a feel for the dataset
for x, y in zip(X[:10], Y[:10]): print(x, "::", y)
I am still waiting on my card? :: card_arrival What can I do if my card still hasn't arrived after 2 weeks? :: card_arrival I have been waiting over a week. Is the card still coming? :: card_arrival Can I track my card while it is in the process of delivery? :: card_arrival How do I know if I will get my card, or if it is lost? :: card_arrival When did you send me my new card? :: card_arrival Do you have info about the card on delivery? :: card_arrival What do I do if I still have not received my new card? :: card_arrival Does the package with my card have tracking? :: card_arrival I ordered my card but it still isn't here :: card_arrival
Determining which Samples to Check for Data Cleaning¶
First, we are creating an instance of the
DataCleaner class, passing in our dataset and the base model we want to use.
dc = autonlu.DataCleaner(X=X, Y=Y, basemodel="roberta-base", verbose=True)
clean_training_data: Detected task is label
Now we can start to collect train/test samples for our dataset by calling
sample_data. In addition to the arguments directly used by
repeats (indicating how many repeated train/test runs for all splits of the dataset should be performed) we can also pass in arbitrary other arguments which will be forwarded to the internal call of
train. We make use of this to turn off early stopping and use a lower amount of training steps. This speeds up the whole process, and
slightly undertrained models.
sample_data performs many training runs, this can take quite some time, depending on the size of the dataset and how many
folds are specified.
dc.sample_data( folds=2, repeats=30, # The following arguments are directly forwarded to train # We are deactivating early stopping and are setting the nb_opti_steps so one epoch # will be trained to speed up the training part of the DataCleaner do_evaluation=False, nb_opti_steps=len(X)//32 )
We can now obtain the recommended samples to be checked for labeling errors (
check_label) and general data quality (
res = dc.get_splits() res.keys()
It is important to know what to expect from
DataCleaner. Although the recommended samples have a higher probability of containing mistakes, the overall percentage of mislabeled samples is in most cases still going to be rather low. From our measurements, we expect on average to be able to detect around 60% of labeling errors by checking around 5% of the original data.
get_splits also returns the indices that the recommended samples have in the original dataset, which can be used to easily merge the corrected samples back into the original dataset once they are corrected.
Since the Banking 77 dataset is very clean, there are not a lot of labeling mistakes to be corrected.
for i, x, y in zip(res["check_label"]["idxs"][:10], res["check_label"]["X"][:10], res["check_label"]["Y"][:10]): print(i, "->", x,":::", y)
8224 -> I think I was charged a different exchange rate than what was posted at the time. ::: wrong_exchange_rate_for_cash_withdrawal 8339 -> The exchange rate applied was incorrect when I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal 8368 -> You applied the wrong exchange rate while I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal 8302 -> I believe my exchange rate is incorrect ::: wrong_exchange_rate_for_cash_withdrawal 6839 -> I just got refunded for my purchase over two weeks ago ::: reverted_card_payment? 8313 -> The exchange rate for case abroad is applied wrong. ::: wrong_exchange_rate_for_cash_withdrawal 8262 -> I don't think the exchange rate was right. ::: wrong_exchange_rate_for_cash_withdrawal 8290 -> The exchange rate was wrong? ::: wrong_exchange_rate_for_cash_withdrawal 6750 -> I was refunded the money for something I bought already. ::: reverted_card_payment? 7297 -> I noticed that there was an extra charge fee on my account. Could You explain to me why? ::: transfer_fee_charged
On the other hand, the dataset seems to contain quite a few instances of samples which are of generally bad quality. For example, samples where there is not enough information in the text itself to actually decide what label should be given or ambiguous sentences. It would be best to remove such samples from the dataset alltogether.
for i, x, y in zip(res["check_data"]["idxs"][:10], res["check_data"]["X"][:10], res["check_data"]["Y"][:10]): print(i, "->", x,":::", y)
1272 -> WHAT IS THE REASON FOR THAT ::: card_not_working 6971 -> How can I change my Rowlock ? ::: change_pin 4655 -> what is the matter? ::: direct_debit_payment_not_recognised 8802 -> What is this witdrawal ::: cash_withdrawal_not_recognised 4633 -> what is the word? ::: direct_debit_payment_not_recognised 146 -> Was there a number to track that I could get? ::: card_arrival 9898 -> As far as courtries go which ones are supported? ::: country_support 1796 -> Help me to set up contactless payments. ::: contactless_not_working 1793 -> Can I make a contactless payments? ::: contactless_not_working 6513 -> What kind of security protects my money? ::: verify_source_of_funds