Data research: Building the best receipt management app

March 20, 2018 The Sensibill Team

As a Data Research Analyst, I help train our model to read receipts the same way humans do. In many ways, this is no different than teaching a child how to read. Think of the machine as a human brain - they both have a network of neurons, which stores information along the way; and these pathways can be altered by experiences, but only if those experiences are remembered. My job is to tell this “brain” how to work towards a problem: reading receipts. I start by figuring out what the brain (machine) picks up, how it reads outside information (a receipt), and what it does with that information. If I notice that it’s reading something wrong, I correct it. Basically, I help the brain learn how to think in order to make sure it does it properly on its own the next time (accuracy). Of course, supervising the machine’s learning this way is time-consuming. We have to research every receipt format, and train the machine on all of them in order for it to read a receipt on its own with any degree of confidence. So why do we bother having Data Research Analysts? This isn’t an existential question! There are plenty of other methods. So why supervised machine learning? Why me?Let’s do a quick comparison.

What about manual intervention?

Manual intervention is the act of a human purposefully reviewing documents submitted by users and editing the extracted data to match what is found on the document. While it’s possible to have humans checking over every receipt, machine learning is much more scalable. Whether it be one receipt or 500 receipts that are sent through production everyday, the machine is much faster at recognition once it is trained. With manual intervention, each additional receipt would cost additional time and money to extract data from. Regardless of how many receipts are sent through the system, a trained machine learning model will take the same amount of time to extract data and present it in a clean and concise view for the end user. With manual intervention, processing times would increase and as you can imagine, the end user will get impatient with the product, moving onto the next best thing.

Takeaway: Regardless of how many receipts are sent through the system, a trained machine learning model will take the same amount of time to extract data and present it in a clean and concise view for the end user.

What about rules-based systems?

A rules-based system is a more sophisticated approach than simple human intervention but there are downfalls to having it be the primary method of data extraction. Rules-based systems work with a set of facts and a set of rules. Take, for example, programming AI to suggest appropriate clothing to a user. The facts could be: 1) It is sunny; 2) It is humid; and, 3) It is a weekend. A set of rules could be created to say “If the conditions are sunny, humid, and the weekend, then the appropriate clothing to wear would be shorts.” For simple tools, rules based systems are usable. But things can get tricky when special cases are constantly brought into the mix. For example, if a fact ends up being “4) It's a holiday” and there is no rule created for that condition, the end user will not get what they expect.Machine learning is based on models and training to ensure there is a higher chance of the computer being able to respond to unanticipated circumstances. Like receipts it’s never seen before!

Takeaway: For simple tools, rules based systems are usable. But things can get tricky when special cases are constantly brought into the mix, like new receipt formats.

So why is our model the best?

Using a machine learning model lets us focus on efficiently gaining accurate, high-quality data. And even though training the machine this way is a lot of work, it's also the reason our product is the best tool for our end users (self-employed professionals and small business owners). Data Research Analysts make sure the data the end user submits is returned to them in the cleanest and easiest way to understand, cutting out advertisements and bloat. We tag and target the most important pieces of information on a receipt to tell the machine “Hey, this is pretty important. You should learn to pick this up when reading receipts moving forward.” Knowing what our users expect to receive from the data they submit is extremely important so that we can teach the machine to prioritize it. Understanding the users’ journeys is easier with a plethora of data, and it’s much more achievable when time is not spent on manual intervention. Receipts are always changing, so making sure our end users are getting all the information they need from them by using our receipt management app is our top priority.

Takeaway: Receipts are always changing, so making sure our end users are getting all the information they need from them by using our receipt management app is our top priority.

Especially when the information presented on a receipt can help self-employed professionals and business owners keep their books in order with little to no effort, and save them thousands of dollars a year on deductible expenses! (Shameless plug, I’m sorry!)