===============
 Data Cleaning
===============

AutoNLU offers a way to efficiently clean training data with the help
of the :class:`~autonlu.DataCleaner` class.

The cleaning itself still has to be performed manually, but
:class:`~autonlu.DataCleaner` can make the process far more efficient
by recommending which samples of the training dataset to check.

The workflow looks as follows:

1. Create a :class:`~autonlu.DataCleaner` for the dataset to be cleaned
2. Call :func:`~autonlu.DataCleaner.sample_data` to analyze the
   dataset. At this step, models will be trained on parts of the data
   and the remaining data will be predicted. During this process,
   information about the samples of the dataset will be recorded
3. Call :func:`~autonlu.DataCleaner.get_splits` to get a dictionary
   which contains information about which data is probably ok and
   which data is worth checking

:class:`~autonlu.DataCleaner` can detect two different kinds of
possible problems with the data:

1. **Labeling Problems**: ``check_label`` from
   :func:`~autonlu.DataCleaner.get_splits`, contains samples where the
   system has detected possible labeling errors. I.e. the
   labels/classes of these samples might not be correct. This
   generally works well for samples which were labeled incorrectly by
   mistake. It works far less well for systematic mistakes (e.g.
   because different persons understood the meaning of certain
   labels/classes differently, because there are overlaps in the
   meaning of labels/classes, ...). Samples from this set are
   candidates for manually correcting the labels/classes.
2. **Data Problems**: ``check_data`` from
   :func:`~autonlu.DataCleaner.get_splits`, contains samples where the
   system has detected possible problems with the data itself. For
   example sentences which can't be classified without context, which
   are ambigious in general, or are completely nonsensical. Samples
   from this set are candidates for complete removal of from the
   dataset. In our experience, ``check_data`` is more relevant for label
   tasks than it is for class and classlabel tasks.

Ideally, the data should be checked in the order it is returned by the
system since returned samples are sorted by their probability of being
problematic. Be aware that samples can occurr in both ``check_label``
as well as ``check_data``!
   
Also, have a look at the tutorial about "Data Cleaning" to see the
whole system in action.
   
----------
DataCleaner
----------

.. autoclass:: autonlu.DataCleaner
   :members: sample_data, get_splits, save, load