Next Steps in Cashbook Data Extraction Techniques

News Discover Fintech Innovations with Cashbook Automation finance

Robotic Process Automation (RPA) in Data Extraction

Extracting and preparing data from digitized documents is now essential in any modern Cash Management System. Cashbook exemplifies this, with its software actively running at dozens of customer sites and handling a high volume of documents—especially Lockbox files, as well as formats like CSV, MT940, and EDI—all processed with the timing and accuracy today’s finance teams demand.

The system starts by analyzing unfamiliar documents—particularly image files (GIF, TIFF, PNG) or PDFs—where key data like invoice identifiers and amounts are distributed across the page according to an arbitrary scheme. The term arbitrary scheme layout means there’s no predefined location for key information in the document. The system needs some kind of guidance for processing the document in an effective, timely way.

When the system receives clear instructions on how to read a document, it quickly extracts clean and meaningful data. Processing a document without layout information leads to low-quality data, excessive noise, and poor semantic meaning.

Working with Layout Information in form of Templates

Cashbook includes an RPA-based system for extracting data from previously unknown structure documents.

In a first implementation, that automated system allowed the user to create document-reading heuristics to be used as guidance for identifying the fundamental areas of a document. Those heuristics consist of geospatial information that define the parts of the document to be analyzed (and hence ignoring the other areas). This type of document layout specification is known as a template since it can be generalized to other documents with the same or similar layouts.

Therefore, each customer defines a set of templates for their most frequent layouts. Then, when a document is presented, there is no initial information about the document structure. At this moment the template matching process comes into play. One of the most important elements of a template is known as the “anchor”. The anchor contains information that allows knowing if a previously created template can be assigned to a recently provided document. This usually involves selecting text expected in the same position on every new document.

Once the document has been matched with a template, the template areas specification is fetched and applied to the document. Now the system knows exactly which parts of the document it needs to read, and which formats it needs to apply to the text once it has been extracted. Then the data is extracted and presented to the user for changes, validation, postprocessing, storage, etc.

Issues in using templates

This RPA system based on templates works quite well and provides the user with quality and robust results. However, this system does not lack some drawbacks. One of the problems is that, if the user creates a large number of templates, the template matching process does not scale accordingly to the rest of the process, so it would need a non-trivial amount of time on performing this matching.

On the other hand, for the same template to be able to recognize a large-enough number of documents (which would be the ideal scenario), each one of those documents should have been scanned in the same way. Meeting minimum requirements of image quality to provide a similar layout (and particularly, the anchor) among all scans. Unfortunately, this is not the case and bad image scans are quite frequent.

For overcoming these problems, the Cashbook R&D department came up with the plan of changing the heuristic specification for reading documents. That new idea reduces to the minimum (or even avoids) the user interaction. And it does not use templates.

Introducing the new RPA-system for Data Extraction

The new implementation requires user interaction only in the latter stages of the process, at the moment of validation. The system applies analysis rules to the image to extract and retain important data.

The rules that analyze the image are based on years of experience from the Cashbook team, in working with RPA for Data Extraction systems. They are a combination of Machine Learning (ML) techniques (e.g. for column type identification) and Data Mining tools (for data identification, formatting, cleaning, etc.) that are run following a structured process. At the end of that process, the data is stored (using the JSON standard) for later processing.

From a high-level perspective, the rules-based system runs the following stages:

Image data fetching from an OCR library.
Filter noisy, useless, or ambiguous information.
Grid formatting.
Column specification.
Column identification.
User validation.
Final formatting and storage

It is only during the User Validation stage when the system will require some user interaction (e.g. fixing some incorrect data, checking a column type) if it detects that is required. The new RPA Data Extraction system processes images and documents quickly and cleanly. It prepares data for post-processing tasks like allocation algorithms. This new Data Extraction system is part of the Cashbook V6 version.

previous post all posts next post

March 12th, 2021

Next Steps in Cashbook Data Extraction Techniques

Robotic Process Automation (RPA) in Data Extraction

Working with Layout Information in form of Templates

Issues in using templates

Introducing the new RPA-system for Data Extraction

Contact

EU

US

March 12th, 2021

Next Steps in Cashbook Data Extraction Techniques

Robotic Process Automation (RPA) in Data Extraction

Working with Layout Information in form of Templates

Issues in using templates

Introducing the new RPA-system for Data Extraction

Share This Article

Related

November 22nd, 2024

ICMT Awards 2024 – An Eventful Day!

June 14th, 2021

Cashbook cash automation modules used around the world

December 10th, 2020

Processing PayPal Remittance files for Multiple Business Units

Events

Vlogs

Webinars Archive - Previous webinars and Insights

Contact

EU

US

Follow Us