Extracting and preparing data from digitized documents is now essential in any modern Cash Management System. Cashbook exemplifies this, with its software actively running at dozens of customer sites and handling a high volume of documents—especially Lockbox files, as well as formats like CSV, MT940, and EDI—all processed with the timing and accuracy today’s finance teams demand.
The system starts by analyzing unfamiliar documents—particularly image files (GIF, TIFF, PNG) or PDFs—where key data like invoice identifiers and amounts are distributed across the page according to an arbitrary scheme. The term arbitrary scheme layout means there’s no predefined location for key information in the document. The system needs some kind of guidance for processing the document in an effective, timely way.
When the system receives clear instructions on how to read a document, it quickly extracts clean and meaningful data. Processing a document without layout information leads to low-quality data, excessive noise, and poor semantic meaning.
Cashbook includes an RPA-based system for extracting data from previously unknown structure documents.
In a first implementation, that automated system allowed the user to create document-reading heuristics to be used as guidance for identifying the fundamental areas of a document. Those heuristics consist of geospatial information that define the parts of the document to be analyzed (and hence ignoring the other areas). This type of document layout specification is known as a template since it can be generalized to other documents with the same or similar layouts.
Therefore, each customer defines a set of templates for their most frequent layouts. Then, when a document is presented, there is no initial information about the document structure. At this moment the template matching process comes into play. One of the most important elements of a template is known as the “anchor”. The anchor contains information that allows knowing if a previously created template can be assigned to a recently provided document. This usually involves selecting text expected in the same position on every new document.
Once the document has been matched with a template, the template areas specification is fetched and applied to the document. Now the system knows exactly which parts of the document it needs to read, and which formats it needs to apply to the text once it has been extracted. Then the data is extracted and presented to the user for changes, validation, postprocessing, storage, etc.
This RPA system based on templates works quite well and provides the user with quality and robust results. However, this system does not lack some drawbacks. One of the problems is that, if the user creates a large number of templates, the template matching process does not scale accordingly to the rest of the process, so it would need a non-trivial amount of time on performing this matching.
On the other hand, for the same template to be able to recognize a large-enough number of documents (which would be the ideal scenario), each one of those documents should have been scanned in the same way. Meeting minimum requirements of image quality to provide a similar layout (and particularly, the anchor) among all scans. Unfortunately, this is not the case and bad image scans are quite frequent.
For overcoming these problems, the Cashbook R&D department came up with the plan of changing the heuristic specification for reading documents. That new idea reduces to the minimum (or even avoids) the user interaction. And it does not use templates.
The new implementation requires user interaction only in the latter stages of the process, at the moment of validation. The system applies analysis rules to the image to extract and retain important data.
The rules that analyze the image are based on years of experience from the Cashbook team, in working with RPA for Data Extraction systems. They are a combination of Machine Learning (ML) techniques (e.g. for column type identification) and Data Mining tools (for data identification, formatting, cleaning, etc.) that are run following a structured process. At the end of that process, the data is stored (using the JSON standard) for later processing.
From a high-level perspective, the rules-based system runs the following stages:
It is only during the User Validation stage when the system will require some user interaction (e.g. fixing some incorrect data, checking a column type) if it detects that is required. The new RPA Data Extraction system processes images and documents quickly and cleanly. It prepares data for post-processing tasks like allocation algorithms. This new Data Extraction system is part of the Cashbook V6 version.