Utils¶

autonlu.status(key=None)¶

Return status information about the current system

Parameters: key (Optional[str]) – A JSON web token which is used for authentication. If no key is given, the key is alternatively taken from the environment variable DO_PRODUCT_KEY.
Returns: A dictionary containing information about the currently used key, about reachable sites (huggingface, and Studio, and Google as a general indication of available internet), package versions, and general system information. If no key was given or set as an env variable, the “key” entry will be None.

Examples:

>>> import autonlu, pprint
>>> pprint.pprint(autonlu.status())
{'key': {'info': {'exp': datetime.datetime(2030, 8, 3, 10, 50, 1),
                  'functionality': ['Analysis/*', 'Training/*', 'Models/*'],
                  'iat': datetime.datetime(2020, 8, 5, 10, 50, 1),
                  'iss': 'deepopinion.ai',
                  'languages': ['*'],
                  'sub': 3},
         'message': 'OK',
         'verified': True},
 'package_versions': {'autonlu': '0.2.0',
                      'numpy': '1.18.5',
                      'pynvml': '8.0.4',
                      'requests': '2.24.0',
                      'security': '0.3.5',
                      'sklearn': '0.23.2',
                      'tensorboard': '2.3.0',
                      'torch': '1.7.0',
                      'torch_optimizer': '0.0.1a16',
                      'tqdm': '4.48.1',
                      'transformers': '3.3.1'},
 'reachable': {'api.deepopinion.ai': True,
               'google.com': True,
               'huggingface.co': True},
 'sysinfo': {'architecture': 'x86_64',
             'hostname': '127.0.0.1localhost.localdomainlocalhost',
             'ip-address': '192.168.0.124',
             'mac-address': '60:45:cb:86:22:51',
             'platform': 'Linux',
             'platform-release': '5.9.10-arch1-1',
             'platform-version': '#1 SMP PREEMPT Sun, 22 Nov 2020 14:16:59 '
                                 '+0000',
             'processor': '',
             'python_build': ('default', 'Sep 30 2020 04:00:38'),
             'python_version': '3.8.6',
             'ram': '31 GB'}}

>>> print("Key expires at:", autonlu.status()["key"]["info"]["exp"])
Key expires at: 2030-08-03 10:50:01

autonlu.split_dataset(*sets, split_at, seed=42)¶

Takes lists of elements of a dataset and splits off a random selection. E.g. usable for train/validation splits

Parameters

sets – An arbitrary amount of lists of the same length that should be split together. Elements that have the same position before splitting will be in the same split and will also have the same position afterwards
split_at (float) – Value between 0 and 1. The percent of data to be used for the validation/test-set
seed (int) – Seed to use for the random part of the dataset splitting. If the same seed is provided, it is guaranteed that you will get the same split of the dataset every time. A fixed seed is used by default to prevent errors (a fixed seed is usually what you want). If you would like to not use a fixed seed, set seed to None.

Return type

Tuple

Returns

A tuple containing each of the splits of all lists that were given to the function

Examples

>>> X = ["The room was nice", "We did not enjoy our stay", ...]
>>> Y = ["POS", "NEG", ...]
>>> trainX, trainY, testX, testY = split_dataset(X, Y, split_at=0.2)

autonlu.check_model.is_valid(modelname)¶

Returns True if the specified model is a valid model that can be loaded by AutoNLU

Parameters: modelname (str) – Model to be checked. Can be anything that can also be used by SimpleModel to load a model.
Return type: bool
Returns: True if the specified model is valid, False otherwise

Examples:

>>> if autonlu.is_valid("/path/to/uploaded/model"):
>>>     # Add model to database

autonlu.check_model.modeltype(modelname)¶

Returns the model type of a specified model

Parameters: modelname (str) – Model to be checked. Can be anything that can also be used by SimpleModel to load a model.
Return type: str
Returns: base, class, label, classlabel, or invalid

autonlu.get_best_device()¶

Returns best available device to use for classifier.

Can be overwritten by the environment variable DO_DEVICE

Return type: str
Returns: 'cuda' if CUDA is available and 'cpu' otherwise. If the environment variable DO_DEVICE is set, its content will be returned instead.

Example

>>> device = get_device()
>>> if device == "cuda":
>>>     print("CUDA is available")

autonlu.get_config(config_path)¶

Loads config.json given a filepath (file or directory)

Parameters: config_path (str) – Path to the config.json or a directory containing config.json
Return type: Dict
Returns: Dictionary with the content of config.json
Raises: ValueError – In case the given filename does not exist or the given path does not contain a file named config.json

Example

>>> config = get_config("/path/to/model/config.json")
>>> config = get_config("/path/to/model")

autonlu.get_model_dir(model_name, organization=None, key=None, baseurl=None)¶

Returns the folder for a given model name or returns the name itself if the model seems to be a huggingface model

This ensures that SimpleModel, Model, and DocumentModel can simply use the returned value to load a model from different sources. In practice this method does not have to be called separately since SimpleModel, Model, and DocumentModel use it internally to resolve model names.

Parameters

model_name (str) –
Name of the model to be loaded. Can be:
- A direct path to a model
- The name of a model in Studio
- The name of a model in the Huggingface model repo
organization (Optional[str]) – Name of organization (if any) to use
key (Optional[str]) – Token to authenticate a user. If not given, a token will be looked up in the environment variable DO_PRODUCT_KEY.
baseurl (Optional[str]) – Base url of the studio instance used for this call. If None, the environment variable DO_BASEURL will be used. If DO_BASEURL is not defined, the standard base-url will be used.

Return type

str

Returns

If model_name is a directory, the directory itself.
If model_name is the name of a model in Studio, the model is downloaded if necessary and the directory is returned. If the model was loaded previously, the path to the cached model will be returned
If model_name is the name of a model in the Huggingface model repo, the name itself is returned.

Raises

ValueError – If model is found in DO download API but could not be downloaded

Examples

Load model from directory:

>>> modeldir = get_model_dir("path/to/my/private/model")
>>> model = SimpleModel(modeldir)

Load model from Studio:

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> modeldir = get_model_dir("DeepOpinion/hotels_absa_en")
>>> model = SimpleModel(modeldir)

Load model from Huggingface model repo:

>>> modeldir = get_model_dir("bert-base-uncased")
>>> model = SimpleModel(modeldir)

autonlu.get_classes(sourcedir)¶

Returns a list of all classes a model was trained on

Parameters: sourcedir (str) – Either a model path containing a meta.json, a config.json, a classes.json, or an aspects.json file. If multiple files are available they will be used in the given order.
Return type: List[str]
Returns: List of all classes, sorted alphabetically. If the model does not use classes (e.g. for a label-task) None will be returned
Raises: ValueError – If no classes could be found in any of the mentioned files

Example

>>> modeldir = get_model_dir("en-base", key=TOKEN)
>>> classes = get_classes(modeldir)

autonlu.get_all_labels(sourcedir)¶

Returns a list of all labels a model was trained on

Parameters: sourcedir (str) – Path to a model containing either a meta.json or a config.json. If both are available, meta.json will be preferred.
Return type: List[str]
Returns: List of labels. The order is important, since the model itself is internally trained on label numbers and those are the index of the label in the list. Returns None if no labels were found.
Raises: KeyError – If the labels given in config.json do not have contiguous indices.

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> modeldir = get_model_dir("en-base")
>>> all_labels = get_all_labels(modeldir)

autonlu.get_standard_label(sourcedir)¶

Returns the standard label a model was trained on

It will be assumed that all classes without a specific label implicitly have the standard label (if the standard label is not None).

A standard label will generally be used in cases where one label occurs at a much higher frequency than others. E.g. when solving aspect based sentiment analysis, the labels will generally be ["NONE", "NEG", "NEU", "POS"], but since most aspects will not occur in most sentences, the NONE label will be predominant (often more than 90%) and it makes sense to make NONE the standard label. I.e. if a class/aspect is not mentioned during training or after prediction in the annotated document, it is assumed to have the label NONE (i.e. it did not occur).

The standard label is actually not trained into the model and can be switched after training without any ill effects, but generally it will be fixed for a use case and is therefore associated with the model.

Parameters: sourcedir (str) – Path to a model containing a meta.json
Return type: Optional[str]
Returns: The name of the standard label (which has to occur in the list of all labels)

Example

Assumes the environment variable DO_PRODUCT_KEY is correctly set

>>> modeldir = get_model_dir("en-hotels-absa")
>>> all_labels = get_all_labels(modeldir)

autonlu.get_segment_class_pairs(segments, all_classes)¶

Takes segments and all possible classes and returns a list of tuples containing all segment/class combinations

Parameters

segments (List[str]) – List of all segments
all_classes (List[str]) – List of all possible classes

Return type

List[Tuple[str, str]]

Returns

List of (segment, class) tuples, containing all possible segment/class combinations

Example

>>> segments = ["Room was clean", "Staff was unfriendly"]
>>> all_classes = ["Room", "Staff"]
>>> segclspair = get_segment_class_pairs(segments, all_classes)
segclspair == [("Room was clean", "Room"), ("Room was clean", "Staff"),
               ("Staff was unfriendly", "Room"), ("Staff was unfriendly", "Staff")]

autonlu.get_segment_class_pairs_with_labels(segments, classlabels, standard_label=None, all_classes=None, label_probabilities={})¶

Generates segment class pairs with an associated list of target labels from a list of segments and a list of classlabels

Parameters

segments (List[str]) – Pieces of text for which the segment/class pairs should be generated
classlabels (List[List[Tuple[str, str]]]) – A list of (class, label) tuples that assigns labels to classes for each segment
standard_label (Optional[str]) – If given, all classes are assumed to have this label if no specific label was given in classlabels. To work, also needs all_classes.
all_classes (Optional[List[str]]) – List of all possible classes that will be used to generate segment/class pairs for the standard label if no specific classlabel was given.
label_probabilities (Dict[str, float]) – A dictionary, mapping label names to the probability (number between 0 and 1) of that label occuring in the generated data. All labels not mentioned in label_probabilities are assumed to have a probability of 1. Can be used to subsample certain labels if they are overrepresented.

Return type

Tuple[List[Tuple[str, str]], List[int]]

Returns

A tuple (X, Y) where X is a list of segment/class pairs and Y a list of corresponding labels

Examples

Without standard_label and all_classes:

>>> segments = ["Hello", "World"]
>>> get_segment_class_pairs_with_labels(segments,
>>>     classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]])
X = [("Hello", "C1), ("Hello", "C2"), ("World", "C2")]
Y = ["L1", "L3", "L4"]

With standard_label and all_classes:

>>> segments = ["Hello", "World"]
>>> get_segment_class_pairs_with_labels(segments,
>>>     classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]],
>>>     standard_label="L2",
>>>     all_classes=["C1", "C2", "C3"])
X = [("Hello", "C1), ("Hello", "C2"), ("Hello", "C3"), ("World", "C1"), ("World", "C2"), ("World", "C3")]
Y = ["L1", "L3", "L2", "L2", "L4", "L2"]

With standard_label, all_classes and label_probabilities:

>>> segments = ["Hello", "World"]
>>> get_segment_class_pairs_with_labels(segments,
>>>     classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]],
>>>     standard_label="L2",
>>>     all_classes=["C1", "C2", "C3"],
>>>     label_probabilities={"L2": 0.2})
Possible output (Samples with a label of "L2" will occur with a probability of approx 20%):
X = [("Hello", "C1), ("Hello", "C2"), ("World", "C2"), ("World", "C3")]
Y = ["L1", "L3", "L4", "L2"]

autonlu.fix_seed(seed=None, cudnn_deterministic=True)¶

Fixes the seed for pytorch, numpy, python, etc.

Follows recommendations from https://pytorch.org/docs/stable/notes/randomness.html

Warning: Even all this does not ensure perfect determinism in all cases, since there is no way to make atomic operations from CUDA deterministic!

Parameters

seed (Optional[int]) – The seed that should be used for all random number generators (pytorch, numpy, python). If seed is None, a random seed will be set, used, and returned from the function. This is for example useful if you want to search for a seed for which a bug occurs.
cudnn_deterministic (bool) – If set to True, it also makes cuDNN deterministic (if it is used). Warning: Might slow down training/inference.

Return type

int

Returns

The seed which was used

autonlu.utils.get_basemodel(model)¶: Returns the base model of a huggingface transformer model. (i.e. the pytorch model without the prediction heads etc.)