Data Point Settings

Data Point Settings

Available since: v.26.1.0, Apr 1, 2024

Active

Table of Content

Overview

 

Description

The settings dialog of the data points step offers configurations the models used for training.

The data points step supports 3 different model types (also known as “Classifier Types”).

  • Lightweight

  • DeepExtract

  • DeepExtract v2

For each selection of those, there is a different set of parameters to adjust its behavior. The complete set of configurations can be found in the table below.

The complete overview over all settings can be found in the table below:

Name

Only Available for

Possible Values

Description

Default

Name

Only Available for

Possible Values

Description

Default

Label Name As Custom Tag

 

Yes, No, Auto

Defines whether the Export Step will have the class attributes for tagged elements with the export key values or the label name of the datapoint schema.

Auto → No

Classifier Type

 

Lightweight, DeepExtract, DeepExtract v2

Specify which machine learning model type is used for predicting the data points.
Choose

Lightweight, to select a model optimized for speed, resulting in fast training and evaluation

DeepExtract, to select a model optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions with a focus on semantic undestanding.

DeepExtract v2, to select a model optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions. DeepExtract v2 supersedes DeepExtract with main improvements in a more powerful model architecture and support for multilingual text.

DeepExtractv2

Learning Rate

Classifier Type = Lightweight

Numeric

The learning rate of the algorithm. We do not have a learning rate scheduler, it is this same value every epoch.

0.08

Sigmoid Parameter

Classifier Type = Lightweight

Numeric

Parameter for the sigmoid function.

0.5

Seed

Classifier Type = Lightweight

Numeric

A seed to shuffle the training set.

42

Unbalanced Sets

Classifier Type = Lightweight

Yes, No

Tells the model that the data is unbalanced in frequency.

Yes

Featurization Type

Classifier Type = Lightweight

Table Aware,
Basic

This describes the type of featurization, i.e. the way text is translated to machine understandable format.

Choose

Table Aware, if it is relevant that a word is inside or outside a table.

Basic, if it is irrelevant if a word is inside or outside a table. Like this the model is not able to distinguish the two words, but most likely needs less training data to get to good accuracy levels.

Table Aware

Use Coordinates Only

Classifier Type = Lightweight

Featurization Type = Basic

Yes, No, Auto

Available only if Featurization Type is set to “Basic”

Specify if the model should depend on nothing but the position of the word to make the model even faster.

Choose

Yes, to use only the position of the word

No, to use the standard set of features of a word

Auto, to use the recommended setting

No

Use Training Set Vocabulary

Classifier Type = Lightweight

Yes, No

Specify, whether the model assumes a fixed vocabulary.

Choose

Yes, if only the words in the training set are to be predicted. E.g, if the given label can only have values “true” or “false” and both cases are annotated in the training set, the algorithm will use the frequencies for each word. This results in very efficient training times.

No, if the above assumption is not guaranteed.

No

Use Stop Words

Classifier Type = Lightweight

Use Training Set Vocabulary = Yes

Yes, No

Specify, whether to use the standard set of stop words for the tokenization. This is, that the words like “a”, “the”,… with little semantic value are ignored for the algorithm training.

Choose

Yes, if the standard set of stop words is to be used

No, if no stop words are to be used

Yes

Vocabulary Size

Classifier Type = Lightweight

Use Training Set Vocabulary = Yes

Numeric

Specify the maximum size of the vocabulary

5000

Vocabulary Source

Classifier Type = Lightweight

Use Training Set Vocabulary = Yes

Contextual, Global

Specify, if the the vocabulary of words used, should be

Contextual creates a vocabulary for each label from the labelled words and the words within the word distance from them,

Global creates a vocabulary from all the words in the training documents.

Contextual

Word Distance Limit

Classifier Type = Lightweight

Use Training Set Vocabulary = Yes

Vocabulary Source = Contextual

Numeric

Specify the maximum word distance from the labelled words for a word to be considered in the vocabulary

3

Number Of Iterations

Classifier Type = Lightweight

Numeric

Specify the number of iterations of the ensemble learning algorithm

100

Samples Per Leaf

Classifier Type = Lightweight

Numeric

The tree can branch only if for a branch there are more than this threshold amount of examples

1

Early Stop Max Rounds

Classifier Type = Lightweight

Numeric

If the loss function does not improve for this threshold of iterations, stop the algorithm regardless of the number of iterations. if no value is given there is no early stopping.

 0

Epochs

Classifier Type = DeepExtract

Numeric

Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting.

Suggested values are below 30.

 3

Learning Rate

Classifier Type = DeepExtract

Numeric

Specify the learning rate of the algorithm.

 0.00006

Seed

Classifier Type = DeepExtract

Numeric

A seed to shuffle the training set.

 42

Language

Classifier Type = DeepExtract v2

Multilingual,
English

Specify, the language that is expected in the documents.

Choose

Multilingual, if multiple languages are expected. Over 100 languages are used
English, if only English documents are expected.

Multilingual

Hyperparametrization

Classifier Type = DeepExtract v2

Low Rank Auto Adaptation,
Full Model Auto Adaptation,
Custom

Specify which set of hyper-parameters should be used.

Choose

Low Rank Auto Adaptation, if you have a relatively small training set. This typically also performs well for large training sets.
Full Model Auto Adaptation never uses low rank auto adaption. Hence it is only effective with very large training sets (typically around 3000 pages)
Custom, if custom hyper-parameters need to be specified

Low Rank Auto Adaptation

Learning Rate

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Numeric

Specify the learning rate for the first epoch.

 0.001

Learning Rate Scheduler

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Adaptive,
Constant,
Constant With Warmup,
Linear,
Cosine,
Polynomial

Specify which learning rate scheduler type. This scheduler defines, how the learning rate should vary from epoch to epoch.

Adaptive is determined by reading the other parameters, at the moment it uses cosine for the low rank adaptation algorithm and constant for the rest
Constant the learning rate is the always the same
Constant With Warmup for the specified amount of steps the learning rate is lowered then it is restored to constant
Linear the learning rate grows linearly in the loss
Cosine the learning swings goniometrically with the loss
Polynomial complicated transformation, it has several unlifted params, it is the polynomial learning rate scheduler of torch

Adaptive

Warmup

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Learning Rate Scheduler is either of (Constant With Warmup,
Linear,
Cosine,
Polynomial)

Numeric

For the specified amount of steps the learning rate is lowered.

 

0.1

Epochs

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Numeric

Specify the number of epochs (ie complete iterations over the training set) to create the model. The more epochs, the better it is optimized to the training set, but with a risk of overfitting.

Suggested values are below 30.

 15

Weight Decay

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Numeric

Weight decay parameter, which reduces risk of overfitting

0.01 

Chunk Distance Limit

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Numeric

The input is split into chunks that are recombined into pairs of chunks. This parameter controls how far apart two chunks can be in order to be recombined into a training sample)

 5

Gradient Accumulation Stops

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Optional[Numeric]

Accumulate gradients over n steps before performing backprop, ie try to simulate a larger batch size

 N/A

Max Tokens Per Word

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Numeric

maximum number of tokens kept after tokenization per word

 30

Low Rank Adaptation

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

No,
Adaptive,
Custom

This controls the low rank adaptation algorithm usage for the model creation:

No does not use the low rand adaption algorithm
Adaptive adapts the low rank adaptation algorithm rank to the schema size and uses a scaling factor to determine alpha
Custom lets you set some of the the low rank adaptation algorithm parameters

No

Scaling Factor

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Low Rank Adaptation = Adaptive

Numeric

Select the scaling factor of the low rank adaptation, controlling the magnitude of the low-rank update.

1

Rank

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Low Rank Adaptation = Custom

Numeric

The concept of "rank" in LoRA refers to the number of dimensions of the trainable parameters within the adapter. When a model is fine-tuned using LoRA, the weight update is represented not by a full-rank matrix, but by the product of two lower-rank matrices.

16

Alpha

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Low Rank Adaptation = Custom

Numeric

It represents a scaling factor that influences how the outputs of the adapter matrices are combined with the original model.

16

Dropout

Classifier Type = DeepExtract v2

Hyperparametrization = Custom

Low Rank Adaptation = Custom

Numeric

It's the probability that a trainable parameter will be artificially set to zero for a given batch of training. It’s used to help prevent overfitting the model to your data.

0.1

Seed

Classifier Type = DeepExtract v2

Numeric

A seed to shuffle the training set.

42

 

Lightweight

This model type is optimized for speed, resulting in fast training and evaluation.

For setting details, refer to the table above

image-20240717-155323.png
The setting options for the lightweight model

 

Deep Extract

Deep Extract is a deep neural network optimized for accuracy. Requiring more time for training and evaluation, this type of model delivers more accurate predictions.

For setting details, refer to the table above

image-20240717-135706.png
The setting options for the Deep Extract model

 

DeepExtract v2

Deep Extract v2 is optimized for optimal predictions. Requiring more time for training and evaluation, this type of model delivers more accurate predictions. DeepExtract v2 supersedes Deep Extract with main improvements in a more powerful model architecture and support for multilingual text.

For setting details, refer to the table above

image-20240717-142027.png
The setting options for the Deep Extract v2 model
image-20240717-144659.png
The setting options for the Deep Extract v2 model

Prerequisites

This feature is available to users in the role of ADMINistrators Workflow managers

 

For any questions you can contact our support team.