White Paper

Harvey Spencer Offers Valuable Synopsis of Informative OCE Article

OCR and ICR technologies are extensively used in forms processing solutions. Voting -- a technology which was initially implemented a number of years ago -- has been shown to substantially improve accuracy. The white paper, "Improving OCR's Accuracy Through Expert Voting," discusses in detail how the latest voting techniques work and how they reduce the incidence of expensive substituted characters. It makes extensive use of diagrams and graphs to illustrate the points.

The paper starts with the basic building blocks of OCR. These are:

  • Line Find and Extraction
  • Hand or Machine Print Detection
  • Character Segmentation
  • Feature Extraction
  • Classification of Character
  • Validation or Proving the Accuracy
  • Logical Context
  • Geometric Context
  • Formal Context
  • Trigram (3 continuous letter) Analysis (proprietary to Oce)
  • Dictionary Lookup

OCR (Optical Character Recognition) extracts and analyzes the shape of a bitmapped character and assigns a value to it. Different OCR products use differing methodologies usually based on a system of template matching or a mathematical analysis (consisting of feature analysis and feature extraction) as their baseline methodology. These analyses usually produce a range of possible results, so they are supplemented by post analysis validations which support the most likely result followed by the possible alternatives. Each possible character is supported by a likelihood percentage. Each character is tested against the validations as an iterative process within an engine with the analysis and check performed multiple times -- sometimes as many as 10 different times within one engine to derive the most likely result.

Simple Fonts Give High Accuracy
Most of today's OCR software can produce highly accurate results with well-formed laser or good quality machine printed text. This is particularly true with simple fonted 10-16 point fixed text and fixed character spacing on a plain background. However, variable widths, proportional fonts, kerning, large sized text or exotic fonts will reduce accuracy and hand printed characters pose even larger challenges.

Differences between Full Text OCR and Forms Processing
Full text OCR is designed to convert a page of similar machine printed textual elements interrupted by photographs or diagrams, often formatted into two or more columns. The software needs to understand and decode this formatting as well as identify and capture the fonts used so as to enable easy editing. Forms processing OCR is designed to capture transactional data from a form in an ASCII format typically to update a back-end computer system.

Forms Processing Poses Challenges
Forms processing challenges are greater. Data on forms can be created from carbon or carbonless forms, the printer may be a dot matrix, the original scanned forms may have been a fax, the background of the form may interfere with the foreground. Fields may not have dictionary entries to look up.

Some fields may be created with either constrained handprint or worse, with unconstrained handprint. Sometimes a field may have mixed data types and frequently the field may contain either machine print or handprint which varies from form to form.

Accuracy Statistics Do Not Catch Substitutions
While everyone wants accurate conversion, accuracy is a difficult concept because different vendors measure it differently. Some vendors define accuracy as the percentage of all characters output as "recognized" by the OCR engine regardless whether the character has in fact been correctly recognized or not. One of the parameters available from a good OCR engine is the acceptance threshold, which allows the user to manipulate the substitution rates over the rejection rates. Generally speaking, a low acceptance threshold will return more "recognized" characters and contain more errors while a high acceptance threshold will do the opposite. So if the acceptance threshold is set too low, the engine will accept a very high percentage of characters and may include some characters which it thinks are correct, but which are in fact wrong. These are known as substitutions and they represent the most expensive errors to correct. On the other side, setting the acceptance threshold too high results in more rejected characters. Even though most of these rejected characters may have been recognized correctly, they need to be verified in a very labor-intensive post-processing step. Eliminating substitutions with a low rejection rate should be the true goal of a good OCR engine. In Oce's world there are 3 categories of character recognition. Those it recognizes correctly. (they know the correct characters, because they analyze this against a pre-known 'truth file'). Errors or substitutions i.e those which have been output AND have been wrongly recognized and low-confidence characters.

Voting Eliminates Errors and Improves Accuracy Rates
Voting takes the output from two or more recognition engines and compares the results -- voting on the most likely. Voting is often cited as a method to try to improve the recognition accuracy from difficult types of images, however this is inaccurate. Voting is designed to eliminate errors and/or increase accuracy percentages at the same time. The preference in an OCR application of whether to get less errors at the same accuracy level, or higher accuracy at the same error level is controlled by various switches within the OCR engine. All OCR engines produce more than one result -- and assign likelihood percentages to each result. Voting takes the recognition results from multiple engines and compares them -- in some cases eliminating an engine, in others combining them to improve the results. In forms processing applications and in handprint applications, voting can be remarkably attractive.

How Voting Works
Voting leverages from using the answers from more than one OCR engine to increase accuracy. The presumption is that different engines use different methodologies to recognize a character. It has evolved over the last few years from simply using two or three separate engines with majority voting to leveraging from an understanding of the internal processes of each engine. To appreciate this it is useful to review the different voting techniques in use today.

Simple Voting
A simple voting algorithm will determine a character based on the majority ranking alone and not on confidence factors. It needs at least two engines to work, but three engines produce better results. Depending on how many engines the system runs, the likelihood can be adjusted accordingly. It is a simple and effective way for manufacturers of forms processing to reduce errors, but it is possible to further improve performance by leveraging from the confidence.

Use of Confidence Levels
The next level of voting leverages from the confidence levels reported by the OCR engines. In this case you do not need more than two engines, as the system has a lot more information to work from.

Use of Internal Information
The latest level of voting utilizes two engines, whose internal processes are different but known, to leverage from an understanding of the choices available at each stage in the building blocks of OCR to understand what the best choice is. It is a multi-iterative process that requires use of much of the internal processor power now available. The benefit is a higher level of true accuracy and reduced substitutions.

Voting systems reduce expensive errors. If the voting engine has access to the internal OCR processes, it can make the fine adjustments in its iterative process that are needed to reduce substitutions on the most problematic characters. If you are interested in more information on this subject please download the complete White Paper with many illustrations at http://www.odt-oce.com/usa/pdf/voting.pdf

Harvey Spencer