=============== Data Cleaning =============== AutoNLU offers a way to efficiently clean training data with the help of the :class:`~autonlu.DataCleaner` class. The cleaning itself still has to be performed manually, but :class:`~autonlu.DataCleaner` can make the process far more efficient by recommending which samples of the training dataset to check. The workflow looks as follows: 1. Create a :class:`~autonlu.DataCleaner` for the dataset to be cleaned 2. Call :func:`~autonlu.DataCleaner.sample_data` to analyze the dataset. At this step, models will be trained on parts of the data and the remaining data will be predicted. During this process, information about the samples of the dataset will be recorded 3. Call :func:`~autonlu.DataCleaner.get_splits` to get a dictionary which contains information about which data is probably ok and which data is worth checking :class:`~autonlu.DataCleaner` can detect two different kinds of possible problems with the data: 1. **Labeling Problems**: ``check_label`` from :func:`~autonlu.DataCleaner.get_splits`, contains samples where the system has detected possible labeling errors. I.e. the labels/classes of these samples might not be correct. This generally works well for samples which were labeled incorrectly by mistake. It works far less well for systematic mistakes (e.g. because different persons understood the meaning of certain labels/classes differently, because there are overlaps in the meaning of labels/classes, ...). Samples from this set are candidates for manually correcting the labels/classes. 2. **Data Problems**: ``check_data`` from :func:`~autonlu.DataCleaner.get_splits`, contains samples where the system has detected possible problems with the data itself. For example sentences which can't be classified without context, which are ambigious in general, or are completely nonsensical. Samples from this set are candidates for complete removal of from the dataset. In our experience, ``check_data`` is more relevant for label tasks than it is for class and classlabel tasks. Ideally, the data should be checked in the order it is returned by the system since returned samples are sorted by their probability of being problematic. Be aware that samples can occurr in both ``check_label`` as well as ``check_data``! Also, have a look at the tutorial about "Data Cleaning" to see the whole system in action. ---------- DataCleaner ---------- .. autoclass:: autonlu.DataCleaner :members: sample_data, get_splits, save, load