March 12th, 2021

Next Steps in Cashbook Data Extraction Techniques

data-extraction

Robotic Process Automation (RPA) in Data Extraction

Getting the data from a digitized document and having it ready for processing has become a key process in any Cash Management System. This is the case of Cashbook, whose software instances are working in dozens of customers, and they receive a substantial flow of documents (especially in Lockbox format, but also in formats like CSV, MT940, EDI, etc.) that need to be processed at the right time.

At the beginning of the process, those documents are unknown, particularly in the case of an image document (GIF, TIFF, PNG, etc.) or PDF, meaning that the valuable data contained on them (like the invoice identifier and of course, the amounts) are geographically distributed in a layout of arbitrary scheme. The term arbitrary scheme layout means that there is no previous information about where to find the important parts of the document data. The system needs some kind of guidance for processing the document in an effective, timely way.

When the system has robust instructions about how to read a document, clean and meaningful data is fetched in almost no time. On the other hand, processing a document without providing enough layout information or no information at all, will generate results with low-quality data, lots of noise and lacking semantic meaning.

Working with Layout Information in form of Templates

Cashbook includes an RPA-based system for extracting data from previously unknown structure documents.

In a first implementation, that automated system allowed the user to create document-reading heuristics to be used as guidance for identifying the fundamental areas of a document. Those heuristics consist of geospatial information that define the parts of the document to be analyzed (and hence ignoring the other areas). This type of document layout specification is known as a template since it can be generalized to other documents with the same or similar layouts.

Therefore, each customer defines a set of templates for their most frequent layouts. Then, when a document is presented, there is no initial information about the document structure. At this moment the template matching process comes into play. One of the most important elements of a template is known as the “anchor”. The anchor contains information that allows knowing if a previously created template can be assigned to a recently provided document. This is typically achieved by selecting some part of the document text that the user expects to be present on each new document, in the same position.

Once the document has been matched with a template, the template areas specification is fetched and applied to the document. Now the system knows exactly which parts of the document it needs to read, and which formats it needs to apply to the text once it has been extracted. Then the data is extracted and presented to the user for changes, validation, postprocessing, storage, etc.

Issues in using templates

This RPA system based on templates works quite well and provides the user with quality and robust results. However, this system does not lack some drawbacks. One of the problems is that, if the user creates a large number of templates, the template matching process does not scale accordingly to the rest of the process, so it would need a non-trivial amount of time on performing this matching.

On the other hand, for the same template to be able to recognize a large-enough number of documents (which would be the ideal scenario), each one of those documents should have been scanned in the same way. Meeting minimum requirements of image quality to provide a similar layout (and particularly, the anchor) among all scans. Unfortunately, this is not the case and bad image scans are quite frequent.

For overcoming these problems, the Cashbook R&D department came up with the plan of changing the heuristic specification for reading documents. That new idea reduces to the minimum (or even avoids) the user interaction. And it does not use templates.

Introducing the new RPA-system for Data Extraction

The new implementation requires user interaction only in the latter stages of the process, at the moment of validation. The implementation runs a set of analysis rules over the image with the mission of selecting the important data and keeping it available.

The rules that analyze the image are based on years of experience from the Cashbook team, in working with RPA for Data Extraction systems. They are a combination of Machine Learning (ML) techniques (e.g. for column type identification) and Data Mining tools (for data identification, formatting, cleaning, etc.) that are run following a structured process. At the end of that process, the data is stored (using the JSON standard) for later processing.

From a high-level perspective, the rules-based system runs the following stages:

  • Image data fetching from an OCR library.
  • Filter noisy, useless, or ambiguous information.
  • Grid formatting.
  • Column specification.
  • Column identification.
  • User validation.
  • Final formatting and storage

It is only during the User Validation stage when the system will require some user interaction (e.g. fixing some incorrect data, checking a column type) if it detects that is required. With this new approach in the RPA Data Extraction system. Images and documents are processed in a fast and clean process, leaving the data ready for any post-processing phase required, like running allocation algorithms. This new Data Extraction system is part of the Cashbook V6 version.

Live chat