Data Cleaning

AutoNLU offers a way to efficiently clean training data with the help of the DataCleaner class.

The cleaning itself still has to be performed manually, but DataCleaner can make the process far more efficient by recommending which samples of the training dataset to check.

The workflow looks as follows:

  1. Create a DataCleaner for the dataset to be cleaned

  2. Call sample_data() to analyze the dataset. At this step, models will be trained on parts of the data and the remaining data will be predicted. During this process, information about the samples of the dataset will be recorded

  3. Call get_splits() to get a dictionary which contains information about which data is probably ok and which data is worth checking

DataCleaner can detect two different kinds of possible problems with the data:

  1. Labeling Problems: check_label from get_splits(), contains samples where the system has detected possible labeling errors. I.e. the labels/classes of these samples might not be correct. This generally works well for samples which were labeled incorrectly by mistake. It works far less well for systematic mistakes (e.g. because different persons understood the meaning of certain labels/classes differently, because there are overlaps in the meaning of labels/classes, …). Samples from this set are candidates for manually correcting the labels/classes.

  2. Data Problems: check_data from get_splits(), contains samples where the system has detected possible problems with the data itself. For example sentences which can’t be classified without context, which are ambigious in general, or are completely nonsensical. Samples from this set are candidates for complete removal of from the dataset. In our experience, check_data is more relevant for label tasks than it is for class and classlabel tasks.

Ideally, the data should be checked in the order it is returned by the system since returned samples are sorted by their probability of being problematic. Be aware that samples can occurr in both check_label as well as check_data!

Also, have a look at the tutorial about “Data Cleaning” to see the whole system in action.

DataCleaner

class autonlu.DataCleaner(X, Y, basemodel, standard_label=None, verbose=False)

Analyze training data and detect samples which are worth checking for labeling errors and generally poor data quality.

Parameters
  • X – Input text samples as a list of strings

  • Y

    Training target. List containing the correct output. The input format can be:

    • A list of strings for a label task. e.g. ["POS", "NEG", "NEG", "POS"]

    • A list of lists of strings for a class task e.g. [["service"], [], ["support", "sales"]]

    • A list of lists of lists of two strings (class and label) for a classlabel task e.g. [[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]

  • basemodel (str) – The base model that should be used for the training part of the procedure. Don’t forget to add prefixes if you want to use them (e.g. #omi is definitely recommended for class and classlabel tasks)

  • standard_label (Optional[str]) – The standard label to be used (if it should be used) for the dataset if it is data in the classlabel format

  • verbose (bool) – If True, more verbose output is produced during sampling (e.g. progress bars)

Examples:

>>> ds = autonlu.DataCleaner(X=X, Y=Y, basemodel="roberta-base", verbose=True)
>>> ds.sample_data(folds=2, repeats=100, do_early_stopping=False, nb_opti_steps=len(X)//32)
>>> res = ds.get_splits()
sample_data(folds=2, repeats=20, checkpoint_filename=None, **kwargs)

Obtain information about samples in the dataset which will be used to decide whether they should be checked or not.

This is achieved by training on part of the data (folds determines the number of parts the dataset should be split into for this), predicting the left out part, recording samples which would have been mispredicted and also recording additional information (like the entropies, etc.). This process is done repeat times. More obtained samples generally result in more reliable results.

sample_data() can be called multiple times and the results of all these calls will be aggregated.

Parameters
  • folds (int) – Number of pieces the dataset should be split into. Implicitly also specifies the number of training runs which will be performed per repeat. The default is 2, which works well for most datasets, a higher number can be beneficial for very small datasets.

  • repeats (int) – Number of times the training/testing procedure should be repeated. A higher number takes longer, but makes the cleaning process more reliable. In all, folds * repeats training/testing runs will be performed. We have observed improved results even above 500 repeats, although improvements tend to slow down after around 20 repeats (the default).

  • checkpoint_filename (Optional[str]) – Optional filename for a checkpoint of the DataCleaner class. A checkpoint will be saved after each repeat if checkpoint_filename is not None. The checkpoint can later be loaded with load(). This can be helpful for very long runs which might be interrupted.

  • **kwargs – Arbitrary additional arguments are passed along to the train function of the model. They can be used to train the model for data analysis in the same way as the final model will be.

Return type

None

get_splits()

Returns which samples of the dataset are recommended for manual checking.

Two separate classes of labels to check will be returned by the system. One set contains samples which might contain labeling errors (i.e. samples which were given the wrong label or class). The other set contains samples which might generally be of bad quality (e.g. samples that can’t be labeled without context, nonsensical or ambiguous sentences, etc.) and might be candidates to be removed from the dataset alltogether. Both of these sets are NOT mutually exclusive (i.e. a sample can occur in both of these two sets).

Returns

A dictionary of the following structure: "check_label": {"X", "Y", "idxs"}, "check_data": {"X", "Y", "idxs"}} where check_label should be checked for labeling errors, and check_data should be checked for general data quality and might contain samples which should be removed from the dataset alltogether. X, Y, and idxs contain the input samples, training target and indices of the selected samples.

Return type

Dictionary

save(filename)

Saves a checkpoint which can later be loaded again with load().

Warning: Saving a checkpoint is not intended for longer term storage. Compatibility of checkpoints is not ensured across different versions of AutoNLU.

Parameters

filename (str) – Filename under which the checkpoint should be saved.

Return type

None

static load(filename)

Loads a previously saved checkpoint

Parameters

filename (str) – Filename of the checkpoint to load

Return type

DataCleaner

Returns

Returns the DataCleaner instance, reconstructed from the specified checkpoint.