Utils¶
- autonlu.status(key=None)¶
Return status information about the current system
- Parameters
key (
Optional
[str
]) – A JSON web token which is used for authentication. If no key is given, the key is alternatively taken from the environment variableDO_PRODUCT_KEY
.- Returns
A dictionary containing information about the currently used key, about reachable sites (huggingface, and Studio, and Google as a general indication of available internet), package versions, and general system information. If no key was given or set as an env variable, the “key” entry will be None.
Examples:
>>> import autonlu, pprint >>> pprint.pprint(autonlu.status()) {'key': {'info': {'exp': datetime.datetime(2030, 8, 3, 10, 50, 1), 'functionality': ['Analysis/*', 'Training/*', 'Models/*'], 'iat': datetime.datetime(2020, 8, 5, 10, 50, 1), 'iss': 'deepopinion.ai', 'languages': ['*'], 'sub': 3}, 'message': 'OK', 'verified': True}, 'package_versions': {'autonlu': '0.2.0', 'numpy': '1.18.5', 'pynvml': '8.0.4', 'requests': '2.24.0', 'security': '0.3.5', 'sklearn': '0.23.2', 'tensorboard': '2.3.0', 'torch': '1.7.0', 'torch_optimizer': '0.0.1a16', 'tqdm': '4.48.1', 'transformers': '3.3.1'}, 'reachable': {'api.deepopinion.ai': True, 'google.com': True, 'huggingface.co': True}, 'sysinfo': {'architecture': 'x86_64', 'hostname': '127.0.0.1localhost.localdomainlocalhost', 'ip-address': '192.168.0.124', 'mac-address': '60:45:cb:86:22:51', 'platform': 'Linux', 'platform-release': '5.9.10-arch1-1', 'platform-version': '#1 SMP PREEMPT Sun, 22 Nov 2020 14:16:59 ' '+0000', 'processor': '', 'python_build': ('default', 'Sep 30 2020 04:00:38'), 'python_version': '3.8.6', 'ram': '31 GB'}}
>>> print("Key expires at:", autonlu.status()["key"]["info"]["exp"]) Key expires at: 2030-08-03 10:50:01
- autonlu.split_dataset(*sets, split_at, seed=42)¶
Takes lists of elements of a dataset and splits off a random selection. E.g. usable for train/validation splits
- Parameters
sets – An arbitrary amount of lists of the same length that should be split together. Elements that have the same position before splitting will be in the same split and will also have the same position afterwards
split_at (
float
) – Value between 0 and 1. The percent of data to be used for the validation/test-setseed (
int
) – Seed to use for the random part of the dataset splitting. If the same seed is provided, it is guaranteed that you will get the same split of the dataset every time. A fixed seed is used by default to prevent errors (a fixed seed is usually what you want). If you would like to not use a fixed seed, setseed
toNone
.
- Return type
Tuple
- Returns
A tuple containing each of the splits of all lists that were given to the function
Examples
>>> X = ["The room was nice", "We did not enjoy our stay", ...] >>> Y = ["POS", "NEG", ...] >>> trainX, trainY, testX, testY = split_dataset(X, Y, split_at=0.2)
- autonlu.check_model.is_valid(modelname)¶
Returns
True
if the specified model is a valid model that can be loaded by AutoNLU- Parameters
modelname (
str
) – Model to be checked. Can be anything that can also be used bySimpleModel
to load a model.- Return type
bool
- Returns
True
if the specified model is valid,False
otherwise
Examples:
>>> if autonlu.is_valid("/path/to/uploaded/model"): >>> # Add model to database
- autonlu.check_model.modeltype(modelname)¶
Returns the model type of a specified model
- Parameters
modelname (
str
) – Model to be checked. Can be anything that can also be used bySimpleModel
to load a model.- Return type
str
- Returns
base
,class
,label
,classlabel
, orinvalid
- autonlu.get_best_device()¶
Returns best available device to use for classifier.
Can be overwritten by the environment variable
DO_DEVICE
- Return type
str
- Returns
'cuda'
if CUDA is available and'cpu'
otherwise. If the environment variableDO_DEVICE
is set, its content will be returned instead.
Example
>>> device = get_device() >>> if device == "cuda": >>> print("CUDA is available")
- autonlu.get_config(config_path)¶
Loads config.json given a filepath (file or directory)
- Parameters
config_path (
str
) – Path to theconfig.json
or a directory containingconfig.json
- Return type
Dict
- Returns
Dictionary with the content of
config.json
- Raises
ValueError – In case the given filename does not exist or the given path does not contain a file named
config.json
Example
>>> config = get_config("/path/to/model/config.json") >>> config = get_config("/path/to/model")
- autonlu.get_model_dir(model_name, organization=None, key=None, baseurl=None)¶
Returns the folder for a given model name or returns the name itself if the model seems to be a huggingface model
This ensures that
SimpleModel
,Model
, andDocumentModel
can simply use the returned value to load a model from different sources. In practice this method does not have to be called separately sinceSimpleModel
,Model
, andDocumentModel
use it internally to resolve model names.- Parameters
model_name (
str
) –Name of the model to be loaded. Can be:
A direct path to a model
The name of a model in Studio
The name of a model in the Huggingface model repo
organization (
Optional
[str
]) – Name of organization (if any) to usekey (
Optional
[str
]) – Token to authenticate a user. If not given, a token will be looked up in the environment variableDO_PRODUCT_KEY
.baseurl (
Optional
[str
]) – Base url of the studio instance used for this call. IfNone
, the environment variableDO_BASEURL
will be used. IfDO_BASEURL
is not defined, the standard base-url will be used.
- Return type
str
- Returns
If
model_name
is a directory, the directory itself.If
model_name
is the name of a model in Studio, the model is downloaded if necessary and the directory is returned. If the model was loaded previously, the path to the cached model will be returnedIf
model_name
is the name of a model in the Huggingface model repo, the name itself is returned.
- Raises
ValueError – If model is found in DO download API but could not be downloaded
Examples
Load model from directory:
>>> modeldir = get_model_dir("path/to/my/private/model") >>> model = SimpleModel(modeldir)
Load model from Studio:
Assumes the environment variable
DO_PRODUCT_KEY
is correctly set>>> modeldir = get_model_dir("DeepOpinion/hotels_absa_en") >>> model = SimpleModel(modeldir)
Load model from Huggingface model repo:
>>> modeldir = get_model_dir("bert-base-uncased") >>> model = SimpleModel(modeldir)
- autonlu.get_classes(sourcedir)¶
Returns a list of all classes a model was trained on
- Parameters
sourcedir (
str
) – Either a model path containing ameta.json
, aconfig.json
, aclasses.json
, or anaspects.json
file. If multiple files are available they will be used in the given order.- Return type
List
[str
]- Returns
List of all classes, sorted alphabetically. If the model does not use classes (e.g. for a label-task)
None
will be returned- Raises
ValueError – If no classes could be found in any of the mentioned files
Example
>>> modeldir = get_model_dir("en-base", key=TOKEN) >>> classes = get_classes(modeldir)
- autonlu.get_all_labels(sourcedir)¶
Returns a list of all labels a model was trained on
- Parameters
sourcedir (
str
) – Path to a model containing either ameta.json
or aconfig.json
. If both are available,meta.json
will be preferred.- Return type
List
[str
]- Returns
List of labels. The order is important, since the model itself is internally trained on label numbers and those are the index of the label in the list. Returns
None
if no labels were found.- Raises
KeyError – If the labels given in
config.json
do not have contiguous indices.
Example
Assumes the environment variable
DO_PRODUCT_KEY
is correctly set>>> modeldir = get_model_dir("en-base") >>> all_labels = get_all_labels(modeldir)
- autonlu.get_standard_label(sourcedir)¶
Returns the standard label a model was trained on
It will be assumed that all classes without a specific label implicitly have the standard label (if the standard label is not
None
).A standard label will generally be used in cases where one label occurs at a much higher frequency than others. E.g. when solving aspect based sentiment analysis, the labels will generally be
["NONE", "NEG", "NEU", "POS"]
, but since most aspects will not occur in most sentences, theNONE
label will be predominant (often more than 90%) and it makes sense to makeNONE
the standard label. I.e. if a class/aspect is not mentioned during training or after prediction in the annotated document, it is assumed to have the labelNONE
(i.e. it did not occur).The standard label is actually not trained into the model and can be switched after training without any ill effects, but generally it will be fixed for a use case and is therefore associated with the model.
- Parameters
sourcedir (
str
) – Path to a model containing ameta.json
- Return type
Optional
[str
]- Returns
The name of the standard label (which has to occur in the list of all labels)
Example
Assumes the environment variable
DO_PRODUCT_KEY
is correctly set>>> modeldir = get_model_dir("en-hotels-absa") >>> all_labels = get_all_labels(modeldir)
- autonlu.get_segment_class_pairs(segments, all_classes)¶
Takes segments and all possible classes and returns a list of tuples containing all segment/class combinations
- Parameters
segments (
List
[str
]) – List of all segmentsall_classes (
List
[str
]) – List of all possible classes
- Return type
List
[Tuple
[str
,str
]]- Returns
List of (segment, class) tuples, containing all possible segment/class combinations
Example
>>> segments = ["Room was clean", "Staff was unfriendly"] >>> all_classes = ["Room", "Staff"] >>> segclspair = get_segment_class_pairs(segments, all_classes) segclspair == [("Room was clean", "Room"), ("Room was clean", "Staff"), ("Staff was unfriendly", "Room"), ("Staff was unfriendly", "Staff")]
- autonlu.get_segment_class_pairs_with_labels(segments, classlabels, standard_label=None, all_classes=None, label_probabilities={})¶
Generates segment class pairs with an associated list of target labels from a list of segments and a list of classlabels
- Parameters
segments (
List
[str
]) – Pieces of text for which the segment/class pairs should be generatedclasslabels (
List
[List
[Tuple
[str
,str
]]]) – A list of (class, label) tuples that assigns labels to classes for each segmentstandard_label (
Optional
[str
]) – If given, all classes are assumed to have this label if no specific label was given inclasslabels
. To work, also needsall_classes
.all_classes (
Optional
[List
[str
]]) – List of all possible classes that will be used to generate segment/class pairs for the standard label if no specific classlabel was given.label_probabilities (
Dict
[str
,float
]) – A dictionary, mapping label names to the probability (number between 0 and 1) of that label occuring in the generated data. All labels not mentioned inlabel_probabilities
are assumed to have a probability of 1. Can be used to subsample certain labels if they are overrepresented.
- Return type
Tuple
[List
[Tuple
[str
,str
]],List
[int
]]- Returns
A tuple
(X, Y)
whereX
is a list of segment/class pairs andY
a list of corresponding labels
Examples
Without
standard_label
andall_classes
:>>> segments = ["Hello", "World"] >>> get_segment_class_pairs_with_labels(segments, >>> classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]]) X = [("Hello", "C1), ("Hello", "C2"), ("World", "C2")] Y = ["L1", "L3", "L4"]
With
standard_label
andall_classes
:>>> segments = ["Hello", "World"] >>> get_segment_class_pairs_with_labels(segments, >>> classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]], >>> standard_label="L2", >>> all_classes=["C1", "C2", "C3"]) X = [("Hello", "C1), ("Hello", "C2"), ("Hello", "C3"), ("World", "C1"), ("World", "C2"), ("World", "C3")] Y = ["L1", "L3", "L2", "L2", "L4", "L2"]
With
standard_label
,all_classes
andlabel_probabilities
:>>> segments = ["Hello", "World"] >>> get_segment_class_pairs_with_labels(segments, >>> classlabels=[[("C1", "L1"), ("C2", "L3")], [("C2", "L4")]], >>> standard_label="L2", >>> all_classes=["C1", "C2", "C3"], >>> label_probabilities={"L2": 0.2}) Possible output (Samples with a label of "L2" will occur with a probability of approx 20%): X = [("Hello", "C1), ("Hello", "C2"), ("World", "C2"), ("World", "C3")] Y = ["L1", "L3", "L4", "L2"]
- autonlu.fix_seed(seed=None, cudnn_deterministic=True)¶
Fixes the seed for pytorch, numpy, python, etc.
Follows recommendations from https://pytorch.org/docs/stable/notes/randomness.html
Warning: Even all this does not ensure perfect determinism in all cases, since there is no way to make atomic operations from CUDA deterministic!
- Parameters
seed (
Optional
[int
]) – The seed that should be used for all random number generators (pytorch, numpy, python). If seed is None, a random seed will be set, used, and returned from the function. This is for example useful if you want to search for a seed for which a bug occurs.cudnn_deterministic (
bool
) – If set toTrue
, it also makes cuDNN deterministic (if it is used). Warning: Might slow down training/inference.
- Return type
int
- Returns
The seed which was used
- autonlu.utils.get_basemodel(model)¶
Returns the base model of a huggingface transformer model. (i.e. the pytorch model without the prediction heads etc.)