Data Cleaning¶
In this tutorial, you will see how to use the DataCleaner
class to identify samples from a dataset that might be problematic because of labeling errors or generally poor quality.
[1]:
import autonlu
import datasets # If you have not done so yet, you have to install with `pip install datasets`
The Dataset¶
First, we are going to load the Banking 77 dataset and bring it into the correct form
[ ]:
dataset = datasets.load_dataset("banking77")
X = dataset["train"]["text"]
Y = dataset["train"]["label"]
# Convert label numbers into readable names
Y = [dataset["train"].features["label"].int2str(y) for y in Y]
Let’s have a look at a few examples to get a feel for the dataset
[16]:
for x, y in zip(X[:10], Y[:10]):
print(x, "::", y)
I am still waiting on my card? :: card_arrival
What can I do if my card still hasn't arrived after 2 weeks? :: card_arrival
I have been waiting over a week. Is the card still coming? :: card_arrival
Can I track my card while it is in the process of delivery? :: card_arrival
How do I know if I will get my card, or if it is lost? :: card_arrival
When did you send me my new card? :: card_arrival
Do you have info about the card on delivery? :: card_arrival
What do I do if I still have not received my new card? :: card_arrival
Does the package with my card have tracking? :: card_arrival
I ordered my card but it still isn't here :: card_arrival
Determining which Samples to Check for Data Cleaning¶
First, we are creating an instance of the DataCleaner
class, passing in our dataset and the base model we want to use.
[4]:
dc = autonlu.DataCleaner(X=X, Y=Y, basemodel="roberta-base", verbose=True)
clean_training_data: Detected task is label
Now we can start to collect train/test samples for our dataset by calling sample_data
. In addition to the arguments directly used by sample_data
like repeats
(indicating how many repeated train/test runs for all splits of the dataset should be performed) we can also pass in arbitrary other arguments which will be forwarded to the internal call of train
. We make use of this to turn off early stopping and use a lower amount of training steps. This speeds up the whole process, and
slightly undertrained models.
Since sample_data
performs many training runs, this can take quite some time, depending on the size of the dataset and how many repeats
and folds
are specified.
[ ]:
dc.sample_data(
folds=2, repeats=30,
# The following arguments are directly forwarded to train
# We are deactivating early stopping and are setting the nb_opti_steps so one epoch
# will be trained to speed up the training part of the DataCleaner
do_evaluation=False,
nb_opti_steps=len(X)//32
)
We can now obtain the recommended samples to be checked for labeling errors (check_label
) and general data quality (check_data
):
[9]:
res = dc.get_splits()
res.keys()
It is important to know what to expect from DataCleaner
. Although the recommended samples have a higher probability of containing mistakes, the overall percentage of mislabeled samples is in most cases still going to be rather low. From our measurements, we expect on average to be able to detect around 60% of labeling errors by checking around 5% of the original data.
get_splits
also returns the indices that the recommended samples have in the original dataset, which can be used to easily merge the corrected samples back into the original dataset once they are corrected.
Since the Banking 77 dataset is very clean, there are not a lot of labeling mistakes to be corrected.
[18]:
for i, x, y in zip(res["check_label"]["idxs"][:10], res["check_label"]["X"][:10], res["check_label"]["Y"][:10]):
print(i, "->", x,":::", y)
8224 -> I think I was charged a different exchange rate than what was posted at the time. ::: wrong_exchange_rate_for_cash_withdrawal
8339 -> The exchange rate applied was incorrect when I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal
8368 -> You applied the wrong exchange rate while I was traveling outside the country. ::: wrong_exchange_rate_for_cash_withdrawal
8302 -> I believe my exchange rate is incorrect ::: wrong_exchange_rate_for_cash_withdrawal
6839 -> I just got refunded for my purchase over two weeks ago ::: reverted_card_payment?
8313 -> The exchange rate for case abroad is applied wrong. ::: wrong_exchange_rate_for_cash_withdrawal
8262 -> I don't think the exchange rate was right. ::: wrong_exchange_rate_for_cash_withdrawal
8290 -> The exchange rate was wrong? ::: wrong_exchange_rate_for_cash_withdrawal
6750 -> I was refunded the money for something I bought already. ::: reverted_card_payment?
7297 -> I noticed that there was an extra charge fee on my account. Could You explain to me why? ::: transfer_fee_charged
On the other hand, the dataset seems to contain quite a few instances of samples which are of generally bad quality. For example, samples where there is not enough information in the text itself to actually decide what label should be given or ambiguous sentences. It would be best to remove such samples from the dataset altogether.
[19]:
for i, x, y in zip(res["check_data"]["idxs"][:10], res["check_data"]["X"][:10], res["check_data"]["Y"][:10]):
print(i, "->", x,":::", y)
1272 -> WHAT IS THE REASON FOR THAT ::: card_not_working
6971 -> How can I change my Rowlock ? ::: change_pin
4655 -> what is the matter? ::: direct_debit_payment_not_recognised
8802 -> What is this witdrawal ::: cash_withdrawal_not_recognised
4633 -> what is the word? ::: direct_debit_payment_not_recognised
146 -> Was there a number to track that I could get? ::: card_arrival
9898 -> As far as courtries go which ones are supported? ::: country_support
1796 -> Help me to set up contactless payments. ::: contactless_not_working
1793 -> Can I make a contactless payments? ::: contactless_not_working
6513 -> What kind of security protects my money? ::: verify_source_of_funds