Data Point Settings
Available since: v.26.1.0, Apr 1, 2024 | Active |
Table of Content
- 1 Overview
- 1.1 Description
- 1.1.1 Lightweight
- 1.1.2 Deep Extract
- 1.1.3 DeepExtract v2
- 1.2 Prerequisites
- 1.1 Description
Overview
Description
The settings dialog of the data points step offers configurations the models used for training.
The data points step supports 3 different model types (also known as “Classifier Types”).
Lightweight
DeepExtract
DeepExtract v2
For each selection of those, there is a different set of parameters to adjust its behavior. The complete set of configurations can be found in the table below.
The complete overview over all settings can be found in the table below:
Name | Only Available for | Possible Values | Description | Default |
---|---|---|---|---|
Label Name As Custom Tag |
|
| Defines whether the Export Step will have the class attributes for tagged elements with the export key values or the label name of the datapoint schema. | Auto → No |
Classifier Type |
|
| Specify which machine learning model type is used for predicting the data points.
| DeepExtractv2 |
Learning Rate | Classifier Type = |
| The learning rate of the algorithm. We do not have a learning rate scheduler, it is this same value every epoch. | 0.08 |
Sigmoid Parameter | Classifier Type = |
| Parameter for the sigmoid function. | 0.5 |
Seed | Classifier Type = |
| A seed to shuffle the training set. | 42 |
Unbalanced Sets | Classifier Type = |
| Tells the model that the data is unbalanced in frequency. | Yes |
Featurization Type | Classifier Type = |
| This describes the type of featurization, i.e. the way text is translated to machine understandable format. Choose Table Aware, if it is relevant that a word is inside or outside a table. Basic, if it is irrelevant if a word is inside or outside a table. Like this the model is not able to distinguish the two words, but most likely needs less training data to get to good accuracy levels. | Table Aware |
Use Coordinates Only | Classifier Type = Featurization Type = |
| Available only if Featurization Type is set to “Basic” Specify if the model should depend on nothing but the position of the word to make the model even faster. Choose Yes, to use only the position of the word No, to use the standard set of features of a word Auto, to use the recommended setting | No |
Use Training Set Vocabulary | Classifier Type = |
| Specify, whether the model assumes a fixed vocabulary. Choose Yes, if only the words in the training set are to be predicted. E.g, if the given label can only have values “true” or “false” and both cases are annotated in the training set, the algorithm will use the frequencies for each word. This results in very efficient training times. No, if the above assumption is not guaranteed. | No |
Use Stop Words | Classifier Type = Use Training Set Vocabulary = |
| Specify, whether to use the standard set of stop words for the tokenization. This is, that the words like “a”, “the”,… with little semantic value are ignored for the algorithm training. Choose Yes, if the standard set of stop words is to be used No, if no stop words are to be used | Yes |
Vocabulary Size | Classifier Type = Use Training Set Vocabulary = |
| Specify the maximum size of the vocabulary | 5000 |
Vocabulary Source | Classifier Type = Use Training Set Vocabulary = |
| Specify, if the the vocabulary of words used, should be Contextual creates a vocabulary for each label from the labelled words and the words within the word distance from them, Global creates a vocabulary from all the words in the training documents. | Contextual |
Word Distance Limit | Classifier Type = Use Training Set Vocabulary = Vocabulary Source = |
| Specify the maximum word distance from the labelled words for a word to be considered in the vocabulary | 3 |
Number Of Iterations | Classifier Type = |
| Specify the number of iterations of the ensemble learning algorithm | 100 |
Samples Per Leaf | Classifier Type = |
| The tree can branch only if for a branch there are more than this threshold amount of examples | 1 |
Early Stop Max Rounds | Classifier Type = |
| If the loss function does not improve for this threshold of iterations, stop the algorithm regardless of the number of iterations. if no value is given there is no early stopping. | 0 |
Epochs | Classifier Type = |
| Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting. Suggested values are below 30. | 3 |
Learning Rate | Classifier Type = |
| Specify the learning rate of the algorithm. | 0.00006 |
Seed | Classifier Type = |
| A seed to shuffle the training set. | 42 |
Language | Classifier Type = |
| Specify, the language that is expected in the documents. Choose Multilingual, if multiple languages are expected. Over 100 languages are used | Multilingual |
Hyperparametrization | Classifier Type = |
| Specify which set of hyper-parameters should be used. Choose Low Rank Auto Adaptation, if you have a relatively small training set. This typically also performs well for large training sets. | Low Rank Auto Adaptation |
Learning Rate | Classifier Type = Hyperparametrization = |
| Specify the learning rate for the first epoch. | 0.001 |
Learning Rate Scheduler | Classifier Type = Hyperparametrization = |
| Specify which learning rate scheduler type. This scheduler defines, how the learning rate should vary from epoch to epoch. Adaptive is determined by reading the other parameters, at the moment it uses cosine for the low rank adaptation algorithm and constant for the rest | Adaptive |
Warmup | Classifier Type = Hyperparametrization = Learning Rate Scheduler is either of ( |
| For the specified amount of steps the learning rate is lowered.
| 0.1 |
Epochs | Classifier Type = Hyperparametrization = |
| Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting. Suggested values are below 30. | 15 |
Weight Decay | Classifier Type = Hyperparametrization = |
| Weight decay parameter, which reduces risk of overfitting | 0.01 |
Chunk Distance Limit | Classifier Type = Hyperparametrization = |
| The input is split into chunks that are recombined into pairs of chunks. This parameter controls how far apart two chunks can be in order to be recombined into a training sample) | 5 |
Gradient Accumulation Stops | Classifier Type = Hyperparametrization = |
| Accumulate gradients over n steps before performing backprop, ie try to simulate a larger batch size | N/A |
Max Tokens Per Word | Classifier Type = Hyperparametrization = |
| maximum number of tokens kept after tokenization per word | 30 |
Low Rank Adaptation | Classifier Type = Hyperparametrization = |
| This controls the low rank adaptation algorithm usage for the model creation: No does not use the low rand adaption algorithm | No |
Scaling Factor | Classifier Type = Hyperparametrization = Low Rank Adaptation = |
| Select the scaling factor of the low rank adaptation, controlling the magnitude of the low-rank update. | 1 |
Rank | Classifier Type = Hyperparametrization = Low Rank Adaptation = |
| The concept of "rank" in LoRA refers to the number of dimensions of the trainable parameters within the adapter. When a model is fine-tuned using LoRA, the weight update is represented not by a full-rank matrix, but by the product of two lower-rank matrices. | 16 |
Alpha | Classifier Type = Hyperparametrization = Low Rank Adaptation = |
| It represents a scaling factor that influences how the outputs of the adapter matrices are combined with the original model. | 16 |
Dropout | Classifier Type = Hyperparametrization = Low Rank Adaptation = |
| It's the probability that a trainable parameter will be artificially set to zero for a given batch of training. It’s used to help prevent overfitting the model to your data. | 0.1 |
Seed | Classifier Type = |
| A seed to shuffle the training set. | 42 |
Lightweight
This model type is optimized for speed, resulting in fast training and evaluation.
For setting details, refer to the table above
Deep Extract
Deep Extract is a deep neural network optimized for accuracy. Requiring more time for training and evaluation, this type of model delivers more accurate predictions.
For setting details, refer to the table above
DeepExtract v2
Deep Extract v2 is optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions. DeepExtract v2 supersedes Deep Extract with main improvements in a more powerful model architecture and support for multilingual text.
For setting details, refer to the table above
Prerequisites
This feature is available to users in the role of ADMINistrators Workflow managers