Harvey Spencer Offers Valuable Synopsis of Informative OCE Article
OCR and ICR technologies are extensively used in forms processing solutions. Voting -- a technology which was initially implemented a number of years ago -- has been shown to substantially improve accuracy. The white paper, "Improving OCR's Accuracy Through Expert Voting," discusses in detail how the latest voting techniques work and how they reduce the incidence of expensive substituted characters. It makes extensive use of diagrams and graphs to illustrate the points.
The paper starts with the basic building blocks of OCR. These are:
- Line Find and Extraction
- Hand or Machine Print Detection
- Character Segmentation
- Feature Extraction
- Classification of Character
- Validation or Proving the Accuracy
- Logical Context
- Geometric Context
- Formal Context
- Trigram (3 continuous letter) Analysis (proprietary to Oce)
- Dictionary Lookup
OCR (Optical Character Recognition) extracts and analyzes the shape of a bitmapped character and assigns a value to it. Different OCR products use differing methodologies usually based on a system of template matching or a mathematical analysis (consisting of feature analysis and feature extraction) as their baseline methodology. These analyses usually produce a range of possible results, so they are supplemented by post analysis validations which support the most likely result followed by the possible alternatives. Each possible character is supported by a likelihood percentage. Each character is tested against the validations as an iterative process within an engine with the analysis and check performed multiple times -- sometimes as many as 10 different times within one engine to derive the most likely result.
Simple Fonts Give High Accuracy
Most of today's OCR software can produce highly accurate results with
well-formed laser or good quality machine printed text. This is
particularly
true with simple fonted 10-16 point fixed text and fixed character
spacing
on a plain background. However, variable widths, proportional fonts,
kerning, large sized text or exotic fonts will reduce accuracy and hand
printed characters pose even larger challenges.
Differences between Full Text OCR and Forms Processing
Full text OCR is designed to convert a page of similar machine printed
textual elements interrupted by photographs or diagrams, often formatted
into two or more columns. The software needs to understand and decode
this
formatting as well as identify and capture the fonts used so as to
enable
easy editing. Forms processing OCR is designed to capture transactional
data from a form in an ASCII format typically to update a back-end
computer
system.
Forms Processing Poses Challenges
Forms processing challenges are greater. Data on forms can be created
from
carbon or carbonless forms, the printer may be a dot matrix, the
original
scanned forms may have been a fax, the background of the form may
interfere
with the foreground. Fields may not have dictionary entries to look up.
Some fields may be created with either constrained handprint or worse, with unconstrained handprint. Sometimes a field may have mixed data types and frequently the field may contain either machine print or handprint which varies from form to form.
Accuracy Statistics Do Not Catch Substitutions
While everyone wants accurate conversion, accuracy is a difficult
concept
because different vendors measure it differently. Some vendors define
accuracy as the percentage of all characters output as "recognized" by
the
OCR engine regardless whether the character has in fact been correctly
recognized or not.
One of the parameters available from a good OCR engine is the acceptance
threshold, which allows the user to manipulate the substitution rates
over
the rejection rates. Generally speaking, a low acceptance threshold will
return more "recognized" characters and contain more errors while a high
acceptance threshold will do the opposite.
So if the acceptance threshold is set too low, the engine will accept a
very high percentage of characters and may include some characters which
it
thinks are correct, but which are in fact wrong. These are known as
substitutions and they represent the most expensive errors to correct.
On
the other side, setting the acceptance threshold too high results in
more
rejected characters. Even though most of these rejected characters may
have
been recognized correctly, they need to be verified in a very
labor-intensive post-processing step. Eliminating substitutions with a
low
rejection rate should be the true goal of a good OCR engine. In Oce's
world there are 3 categories of character recognition. Those it
recognizes
correctly. (they know the correct characters, because they analyze this
against a pre-known 'truth file'). Errors or substitutions i.e those
which
have been output AND have been wrongly recognized and low-confidence
characters.
Voting Eliminates Errors and Improves Accuracy Rates
Voting takes the output from two or more recognition engines and
compares
the results -- voting on the most likely. Voting is often cited as a
method to try to improve the recognition accuracy from difficult types
of
images, however this is inaccurate. Voting is designed to eliminate
errors
and/or increase accuracy percentages at the same time. The preference
in
an OCR application of whether to get less errors at the same accuracy
level, or higher accuracy at the same error level is controlled by
various
switches within the OCR engine. All OCR engines produce more than one
result -- and assign likelihood percentages to each result. Voting
takes
the recognition results from multiple engines and compares them -- in
some
cases eliminating an engine, in others combining them to improve the
results. In forms processing applications and in handprint
applications,
voting can be remarkably attractive.
How Voting Works
Voting leverages from using the answers from more than one OCR engine to
increase accuracy. The presumption is that different engines use
different
methodologies to recognize a character. It has evolved over the last
few
years from simply using two or three separate engines with majority
voting
to leveraging from an understanding of the internal processes of each
engine. To appreciate this it is useful to review the different voting
techniques in use today.
Simple Voting
A simple voting algorithm will determine a character based on the
majority
ranking alone and not on confidence factors. It needs at least two
engines
to work, but three engines produce better results. Depending on how
many
engines the system runs, the likelihood can be adjusted accordingly. It
is
a simple and effective way for manufacturers of forms processing to
reduce
errors, but it is possible to further improve performance by leveraging
from the confidence.
Use of Confidence Levels
The next level of voting leverages from the confidence levels reported by
the OCR engines. In this case you do not need more than two engines, as
the system has a lot more information to work from.
Use of Internal Information
The latest level of voting utilizes two engines, whose internal
processes
are different but known, to leverage from an understanding of the
choices
available at each stage in the building blocks of OCR to understand what
the best choice is. It is a multi-iterative process that requires use
of
much of the internal processor power now available. The benefit is a
higher level of true accuracy and reduced substitutions.
Conclusion
Voting systems reduce expensive errors. If the voting engine has access to
the internal OCR processes, it can make the fine adjustments in its
iterative process that are needed to reduce substitutions on the most
problematic characters.
If you are interested in more information on this subject please download
the complete White Paper with many illustrations at
http://www.odt-oce.com/usa/pdf/voting.pdf