Release 5.x.x

Initial release date: Oct 13, 2022

This release introduces Regex transformations and adds usability improvements.

Newly available beta features include automatic noise page removal and document splitting.

New Features

Regex Transformations

Description

It is now possible to define regex-based data point transformations in the schema. This feature allows you to match the extracted content of a data point with a regular expression and transform the field’s value using captured groups of that match.

 

These transformations are being applied to the data read from the document and affect the values that are displayed in the UI and are written to the export. They are performed in training and production and can be immediately previewed in the data extraction panel on the right-hand side.

 

New Beta Features

Noise Pages Removal

Description

You can label pages as noise, indicating that these pages do not contain any relevant data and should be ignored by the system. This not only improves classification performance, but also simplifies training, processing and reviewing, since the scope of pages that need your attention can be limited to a minimum.

Training

The Noise Page identification works via a classifier, which requires training. In training, the option to exclude/include a page was added at page level in the Thumbnails pane that allows to user to label pages as noise.

Exclusion of pages

Once a page is excluded:

  • Existing labels (manual or automatic) are ignored.

  • The page thumbnail will show an excluded watermark.

  • The page itself will be disabled, which is shown with an excluded watermark.

  • It will not be possible to add manual labels to the page.

Re-inclusion of pages

Once an excluded page is re-included:

  • Assisted labeling will be applied to the included page (after the 3rd document).

  • The page thumbnail will not show an excluded watermark.

  • The page itself will be enabled and will not show an excluded watermark.

  • Labels added before excluding the page will be restored and it will be possible to add more labels to the page.

Assisted Labeling

Assisted labeling for Noise Pages will kick in after the 3rd document as normal labels.

Noise pages identified through Assisted Labeling do not require confirmation.

Production

Noise Pages will be identified in production documents based on training.

There is no accuracy assigned to Noise Pages, thus the prediction quality does not influence if a document gets processed automatically or stays in review.

Noise Pages review is possible in production:

  • False positives can be included again in the document (same UI as in training). Assisted labeling will apply for the re-included page.

  • False negatives can be excluded from the document (same UI as in training).

Production Document Splitting based on First Page

Description

This feature enables automatic splitting of uploaded production files containing multiple documents of the same category into individual documents based on their first page.

Splitting can be enabled or disabled per Category.

Enabling and disabling the Feature

  • Production Document Splitting is enabled for a given Category if the Category option Disable document splitting is FALSE.

Training

Noise Page identification is trained implicitly by looking at the differences between the first pages of each document of a category and all the other pages in that set. The user does not have to label any additional data for this feature to work.

If a Category’s option Disable document splitting is FALSE Document splitting will be trained based on the first page of each document of the training set.

There are no visual indications of this in the UI other than the Category’s Disable document splitting option.

Assisted Labeling

Assisted labeling does not affect or is affected by Document Splitting.

Production

If Document Splitting is activated:

  • Uploaded files are being analyzed for multiple occurrences of first pages.

  • If multiple first pages are found, the file is split into separate documents accordingly.

  • The split documents can be seen being processed in the production view as if they were manually uploaded as separate files by the user (using the same file name with an increasing numeric prefix).

Apart from the file name, there are no visual indications of this in the UI showing that the document was split from an initial document.

There is currently no User review for split documents.