Data Cleaning¶
AutoNLU offers a way to efficiently clean training data with the help
of the DataCleaner
class.
The cleaning itself still has to be performed manually, but
DataCleaner
can make the process far more efficient
by recommending which samples of the training dataset to check.
The workflow looks as follows:
Create a
DataCleaner
for the dataset to be cleanedCall
sample_data()
to analyze the dataset. At this step, models will be trained on parts of the data and the remaining data will be predicted. During this process, information about the samples of the dataset will be recordedCall
get_splits()
to get a dictionary which contains information about which data is probably ok and which data is worth checking
DataCleaner
can detect two different kinds of
possible problems with the data:
Labeling Problems:
check_label
fromget_splits()
, contains samples where the system has detected possible labeling errors. I.e. the labels/classes of these samples might not be correct. This generally works well for samples which were labeled incorrectly by mistake. It works far less well for systematic mistakes (e.g. because different persons understood the meaning of certain labels/classes differently, because there are overlaps in the meaning of labels/classes, …). Samples from this set are candidates for manually correcting the labels/classes.Data Problems:
check_data
fromget_splits()
, contains samples where the system has detected possible problems with the data itself. For example sentences which can’t be classified without context, which are ambigious in general, or are completely nonsensical. Samples from this set are candidates for complete removal of from the dataset. In our experience,check_data
is more relevant for label tasks than it is for class and classlabel tasks.
Ideally, the data should be checked in the order it is returned by the
system since returned samples are sorted by their probability of being
problematic. Be aware that samples can occurr in both check_label
as well as check_data
!
Also, have a look at the tutorial about “Data Cleaning” to see the whole system in action.
DataCleaner¶
- class autonlu.DataCleaner(X, Y, basemodel, standard_label=None, verbose=False)¶
Analyze training data and detect samples which are worth checking for labeling errors and generally poor data quality.
- Parameters
X – Input text samples as a list of strings
Y –
Training target. List containing the correct output. The input format can be:
A list of strings for a label task. e.g.
["POS", "NEG", "NEG", "POS"]
A list of lists of strings for a class task e.g.
[["service"], [], ["support", "sales"]]
A list of lists of lists of two strings (class and label) for a classlabel task e.g.
[[["room", "POS"], ["service", "NEG"]], [["cleanliness", "NEU"]]]
basemodel (
str
) – The base model that should be used for the training part of the procedure. Don’t forget to add prefixes if you want to use them (e.g.#omi
is definitely recommended for class and classlabel tasks)standard_label (
Optional
[str
]) – The standard label to be used (if it should be used) for the dataset if it is data in the classlabel formatverbose (
bool
) – IfTrue
, more verbose output is produced during sampling (e.g. progress bars)
Examples:
>>> ds = autonlu.DataCleaner(X=X, Y=Y, basemodel="roberta-base", verbose=True) >>> ds.sample_data(folds=2, repeats=100, do_early_stopping=False, nb_opti_steps=len(X)//32) >>> res = ds.get_splits()
- sample_data(folds=2, repeats=20, checkpoint_filename=None, **kwargs)¶
Obtain information about samples in the dataset which will be used to decide whether they should be checked or not.
This is achieved by training on part of the data (
folds
determines the number of parts the dataset should be split into for this), predicting the left out part, recording samples which would have been mispredicted and also recording additional information (like the entropies, etc.). This process is donerepeat
times. More obtained samples generally result in more reliable results.sample_data()
can be called multiple times and the results of all these calls will be aggregated.- Parameters
folds (
int
) – Number of pieces the dataset should be split into. Implicitly also specifies the number of training runs which will be performed per repeat. The default is2
, which works well for most datasets, a higher number can be beneficial for very small datasets.repeats (
int
) – Number of times the training/testing procedure should be repeated. A higher number takes longer, but makes the cleaning process more reliable. In all,folds * repeats
training/testing runs will be performed. We have observed improved results even above500
repeats, although improvements tend to slow down after around20
repeats (the default).checkpoint_filename (
Optional
[str
]) – Optional filename for a checkpoint of theDataCleaner
class. A checkpoint will be saved after each repeat ifcheckpoint_filename
is notNone
. The checkpoint can later be loaded withload()
. This can be helpful for very long runs which might be interrupted.**kwargs – Arbitrary additional arguments are passed along to the
train
function of the model. They can be used to train the model for data analysis in the same way as the final model will be.
- Return type
None
- get_splits()¶
Returns which samples of the dataset are recommended for manual checking.
Two separate classes of labels to check will be returned by the system. One set contains samples which might contain labeling errors (i.e. samples which were given the wrong label or class). The other set contains samples which might generally be of bad quality (e.g. samples that can’t be labeled without context, nonsensical or ambiguous sentences, etc.) and might be candidates to be removed from the dataset alltogether. Both of these sets are NOT mutually exclusive (i.e. a sample can occur in both of these two sets).
- Returns
A dictionary of the following structure:
"check_label": {"X", "Y", "idxs"}, "check_data": {"X", "Y", "idxs"}}
wherecheck_label
should be checked for labeling errors, andcheck_data
should be checked for general data quality and might contain samples which should be removed from the dataset alltogether.X
,Y
, andidxs
contain the input samples, training target and indices of the selected samples.- Return type
Dictionary
- save(filename)¶
Saves a checkpoint which can later be loaded again with
load()
.Warning: Saving a checkpoint is not intended for longer term storage. Compatibility of checkpoints is not ensured across different versions of AutoNLU.
- Parameters
filename (
str
) – Filename under which the checkpoint should be saved.- Return type
None
- static load(filename)¶
Loads a previously saved checkpoint
- Parameters
filename (
str
) – Filename of the checkpoint to load- Return type
- Returns
Returns the
DataCleaner
instance, reconstructed from the specified checkpoint.