Overview

Description

The settings dialog of the data points step offers configurations the models used for training.

The data points step supports 3 different model types (also known as “Classifier Types”).

Lightweight
DeepExtract
DeepExtract v2

For each selection of those, there is a different set of parameters to adjust its behavior. The complete set of configurations can be found in the table below.

The complete overview over all settings can be found in the table below:

Name	Only Available for	Possible Values	Description	Default

Name	Only Available for	Possible Values	Description	Default
Label Name As Custom Tag		`Yes`, `No`, `Auto`	Defines whether the Export Step will have the class attributes for tagged elements with the export key values or the label name of the datapoint schema.	Auto → No
Classifier Type		`Lightweight`, `DeepExtract`, `DeepExtract v2`	Specify which machine learning model type is used for predicting the data points. Choose `Lightweight`, to select a model optimized for speed, resulting in fast training and evaluation `DeepExtract`, to select a model optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions with a focus on semantic undestanding. `DeepExtract v2`, to select a model optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions. DeepExtract v2 supersedes DeepExtract with main improvements in a more powerful model architecture and support for multilingual text.	DeepExtractv2
Learning Rate	Classifier Type = `Lightweight`	`Numeric`	The learning rate of the algorithm. We do not have a learning rate scheduler, it is this same value every epoch.	0.08
Sigmoid Parameter	Classifier Type = `Lightweight`	`Numeric`	Parameter for the sigmoid function.	0.5
Seed	Classifier Type = `Lightweight`	`Numeric`	A seed to shuffle the training set.	42
Unbalanced Sets	Classifier Type = `Lightweight`	`Yes`, `No`	Tells the model that the data is unbalanced in frequency.	Yes
Featurization Type	Classifier Type = `Lightweight`	`Table Aware`, `Basic`	This describes the type of featurization, i.e. the way text is translated to machine understandable format. Choose Table Aware, if it is relevant that a word is inside or outside a table. Basic, if it is irrelevant if a word is inside or outside a table. Like this the model is not able to distinguish the two words, but most likely needs less training data to get to good accuracy levels.	Table Aware
Use Coordinates Only	Classifier Type = `Lightweight` Featurization Type = `Basic`	`Yes`, `No`, `Auto`	Available only if Featurization Type is set to “Basic” Specify if the model should depend on nothing but the position of the word to make the model even faster. Choose Yes, to use only the position of the word No, to use the standard set of features of a word Auto, to use the recommended setting	No
Use Training Set Vocabulary	Classifier Type = `Lightweight`	`Yes`, `No`	Specify, whether the model assumes a fixed vocabulary. Choose Yes, if only the words in the training set are to be predicted. E.g, if the given label can only have values “true” or “false” and both cases are annotated in the training set, the algorithm will use the frequencies for each word. This results in very efficient training times. No, if the above assumption is not guaranteed.	No
Use Stop Words	Classifier Type = `Lightweight` Use Training Set Vocabulary = `Yes`	`Yes`, `No`	Specify, whether to use the standard set of stop words for the tokenization. This is, that the words like “a”, “the”,… with little semantic value are ignored for the algorithm training. Choose Yes, if the standard set of stop words is to be used No, if no stop words are to be used	Yes
Vocabulary Size	Classifier Type = `Lightweight` Use Training Set Vocabulary = `Yes`	`Numeric`	Specify the maximum size of the vocabulary	5000
Vocabulary Source	Classifier Type = `Lightweight` Use Training Set Vocabulary = `Yes`	`Contextual`, `Global`	Specify, if the the vocabulary of words used, should be Contextual creates a vocabulary for each label from the labelled words and the words within the word distance from them, Global creates a vocabulary from all the words in the training documents.	Contextual
Word Distance Limit	Classifier Type = `Lightweight` Use Training Set Vocabulary = `Yes` Vocabulary Source = `Contextual`	`Numeric`	Specify the maximum word distance from the labelled words for a word to be considered in the vocabulary	3
Number Of Iterations	Classifier Type = `Lightweight`	`Numeric`	Specify the number of iterations of the ensemble learning algorithm	100
Samples Per Leaf	Classifier Type = `Lightweight`	`Numeric`	The tree can branch only if for a branch there are more than this threshold amount of examples	1
Early Stop Max Rounds	Classifier Type = `Lightweight`	`Numeric`	If the loss function does not improve for this threshold of iterations, stop the algorithm regardless of the number of iterations. if no value is given there is no early stopping.	0
Epochs	Classifier Type = `DeepExtract`	`Numeric`	Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting. Suggested values are below 30.	3
Learning Rate	Classifier Type = `DeepExtract`	`Numeric`	Specify the learning rate of the algorithm.	0.00006
Seed	Classifier Type = `DeepExtract`	`Numeric`	A seed to shuffle the training set.	42
Language	Classifier Type = `DeepExtract v2`	`Multilingual`, `English`	Specify, the language that is expected in the documents. Choose Multilingual, if multiple languages are expected. Over 100 languages are used English, if only English documents are expected.	Multilingual
Hyperparametrization	Classifier Type = `DeepExtract v2`	`Low Rank Auto Adaptation`, `Full Model Auto Adaptation`, `Custom`	Specify which set of hyper-parameters should be used. Choose Low Rank Auto Adaptation, if you have a relatively small training set. This typically also performs well for large training sets. Full Model Auto Adaptation never uses low rank auto adaption. Hence it is only effective with very large training sets (typically around 3000 pages) Custom, if custom hyper-parameters need to be specified	Low Rank Auto Adaptation
Learning Rate	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Numeric`	Specify the learning rate for the first epoch.	0.001
Learning Rate Scheduler	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Adaptive`, `Constant`, `Constant With Warmup`, `Linear`, `Cosine`, `Polynomial`	Specify which learning rate scheduler type. This scheduler defines, how the learning rate should vary from epoch to epoch. Adaptive is determined by reading the other parameters, at the moment it uses cosine for the low rank adaptation algorithm and constant for the rest Constant the learning rate is the always the same Constant With Warmup for the specified amount of steps the learning rate is lowered then it is restored to constant Linear the learning rate grows linearly in the loss Cosine the learning swings goniometrically with the loss Polynomial complicated transformation, it has several unlifted params, it is the polynomial learning rate scheduler of torch	Adaptive
Warmup	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom` Learning Rate Scheduler is either of (`Constant With Warmup`, `Linear`, `Cosine`, `Polynomial`)	`Numeric`	For the specified amount of steps the learning rate is lowered.	0.1
Epochs	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Numeric`	Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting. Suggested values are below 30.	15
Weight Decay	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Numeric`	Weight decay parameter, which reduces risk of overfitting	0.01
Chunk Distance Limit	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Numeric`	The input is split into chunks that are recombined into pairs of chunks. This parameter controls how far apart two chunks can be in order to be recombined into a training sample)	5
Gradient Accumulation Stops	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Optional[Numeric]`	Accumulate gradients over n steps before performing backprop, ie try to simulate a larger batch size	N/A
Max Tokens Per Word	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`Numeric`	maximum number of tokens kept after tokenization per word	30
Low Rank Adaptation	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom`	`No`, `Adaptive`, `Custom`	This controls the low rank adaptation algorithm usage for the model creation: No does not use the low rand adaption algorithm Adaptive adapts the low rank adaptation algorithm rank to the schema size and uses a scaling factor to determine alpha Custom lets you set some of the the low rank adaptation algorithm parameters	No
Scaling Factor	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom` Low Rank Adaptation = `Adaptive`	`Numeric`	Select the scaling factor of the low rank adaptation, controlling the magnitude of the low-rank update.	1
Rank	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom` Low Rank Adaptation = `Custom`	`Numeric`	The concept of "rank" in LoRA refers to the number of dimensions of the trainable parameters within the adapter. When a model is fine-tuned using LoRA, the weight update is represented not by a full-rank matrix, but by the product of two lower-rank matrices.	16
Alpha	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom` Low Rank Adaptation = `Custom`	`Numeric`	It represents a scaling factor that influences how the outputs of the adapter matrices are combined with the original model.	16
Dropout	Classifier Type = `DeepExtract v2` Hyperparametrization = `Custom` Low Rank Adaptation = `Custom`	`Numeric`	It's the probability that a trainable parameter will be artificially set to zero for a given batch of training. It’s used to help prevent overfitting the model to your data.	0.1
Seed	Classifier Type = `DeepExtract v2`	`Numeric`	A seed to shuffle the training set.	42

Lightweight

This model type is optimized for speed, resulting in fast training and evaluation.

For setting details, refer to the table above

The setting options for the lightweight model

Deep Extract

Deep Extract is a deep neural network optimized for accuracy. Requiring more time for training and evaluation, this type of model delivers more accurate predictions.

For setting details, refer to the table above

The setting options for the Deep Extract model

DeepExtract v2

Deep Extract v2 is optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions. DeepExtract v2 supersedes Deep Extract with main improvements in a more powerful model architecture and support for multilingual text.

For setting details, refer to the table above

The setting options for the Deep Extract v2 model

Prerequisites

This feature is available to users in the role of ADMINistrators Workflow managers

For any questions you can contact our support team.

Data Point Settings

Overview

Description

Lightweight

Deep Extract

DeepExtract v2

Prerequisites