Optical Character Recognition - The Mature Technology With The Brilliant Future

A look at the role of recognology in electronic document management systems.

Many people think that OCR technology is relatively new, something that has been with us for only a few years. However, the roots of OCR can be traced back a long way. In fact, the first patents involving OCR were awarded in 1809. Around 1870, a Mr. Carey from Boston patented an image transmission system that used a mosaic of photocells. This was an early example of a "retina scanner." According to Jacob Rabinow, an inventor and scientist at the National Bureau of Standards, "The great American pioneer in the OCR art was David Shepard, who founded the Intelligent Machine Research Corporation in Washington, D.C. in 1950 to develop and build OCR equipment." Mr. Shepard is currently the Chairman of the Board of Cognitronics Imaging Systems and is still active in the business. He is credited with developing and patenting the first modern practical OCR system in 1951. In the decades that followed, new firms entered the automatic data capture field. Large-scale commercial development of OCR equipment began in the 70s. During the past forty years, optical character recognition systems have come a long way from one-of-a-kind special purpose readers to the multi-purpose production and interactive online systems of today. This progress has lowered data capture costs and has caused development of more reliable and accurate OCR systems.

Electronic Document Management Systems
Electronic document management (EDM) systems are integrated information management systems enabled with modern imaging and information technologies, which increase the productivity and flexibility of the user's work processes. These systems, when enhanced with OCR technologies, usually provide a more complete system solutions for users' work process automation requirements. They do this by using electronic imaging and OCR to enable and automate the work process.

How Scanning Works
The data to be entered into the system may be read by the OCR reader from paper documents that are ordinary typed, special font-typed, handprinted, laser printed, dot-matrix printed, line printed or magnetic ink encoded as well as from electronic documents and microfilm. At this time, data can be read from paper documents at thruputs of up to 2,400 documents per minute, with instantaneous character recognition rates in excess of 4,000 characters per second. This is the "real world" performance seen in current check processing and payment processing systems. When capturing data from pages, in real time, the character acceptance rate of these data capture systems has been recorded from 98 to 99% and the accuracy rate is better than 99.999% (i.e., one error in each 100,000 valid characters). Most commercial electronic document management systems have scanning resolutions of from 125 to 400 dots or pixels per inch square. Scanner gray-scale capabilities vary, with the ability to assign as many as 256 shades of gray to each picture element (pixel) in the digitized bit mapped image. The price of desktop EDM systems ranges from $35,000 to $190,000 depending on the speed, performance and amount of manual entry required of the system. The price of high volume, high speed EDM systems can range from $100,000 to $1 million. An image scanner converts printed data in text or graphic form from paper documents or microfilm into electrical digital signals that can be processed or stored in a digital computer. When scanning (digitizing), the scanner illuminates the paper documents containing graphics or text with a high-intensity light of known spectral frequency. The reflected light from the media (paper or microfilm) is focused onto a precision light sensitive transducer which, in most cases, will be an array of charge-coupled (CCD) cells. The CCD cells in the scanner array convert the reflected light energy into electronic signals that are representative of the data being scanned. This digital data is proportional to the intensity of the reflected light received on the surface of each CCD in the linear array. A null signal, indicating no reflected light output by the CCD, represents a black pixel. A maximum signal output from a CCD cell indicates that the pixel being evaluated is pure white. Shades of gray are generated by intermediate analog signal levels between pure white and pure black. The number of shades of gray in a given scanner depends on the number of binary bits assigned to each digital output from each CCD element. Analog electronic signal levels are converted to digital levels in the analog-to-digital converter which is imbedded in the scanning electronics. The resultant digital signals are then stored in the computer's random access memory for recognition in real time systems or for further processing in slow-speed off-line systems. EDM scanners have resolutions up to 400 dots or pixels per inch square. This means that each scan across the width of an 8 inch by 10 inch document image can contain up to 3,200 dots or pixels. If the lengthwise resolution of the scanner is also 400 dpi, there will be 4,000 dots in each one-inch vertical column (the length of the image). Therefore, it requires about 12.8 million bits of digitized data to represent an uncompressed 8 inch by 10 inch image (8 x 10 x 400 x 400) digitized at 400 dpi.

Storage Requirements
A bitonal (black and white) scanned image of an 8 x 10 inch page requires about 900,000 bits of memory storage when compressed with a 14:1 compression algorithm. This is a significant amount of storage – it would require about three conventional 5-1/4 inch floppy diskettes to store it. If the image were a photograph, containing 16 shades of gray, it would require four times more data storage to do an acceptable job of reproduction. Storing and processing gray-scale images is like trying to paint a picture with only two colors: the white of the printing stock and the black of the ink. Minimizing the gray scale information for an image reduces the storage requirement in direct proportion to the number of binary dots used to define the image. Thus, converting an image with 256 shades of gray or 8 bits of data to a bitonal image requires one-eighth as much storage, because a half-tone image is really a bitonal image that can be stored at one bit per pixel rather than the 8 bits per pixel required for 256 shades of gray. Data compression algorithms can also reduce the storage requirement by as much as 50 percent. However, the system data storage requirements of an electronic image database with only a few hundred photographs is enormous and is a candidate for a optical or digital storage system. Initially, image scanners could handle either data or images. As microprocessors became lower in cost, faster and more powerful, suppliers and systems integrators offered effective software solutions for most text recognition applications. Now virtually all EDM systems use OCR recognology to convert scanned digital text from images automatically into digital text files for text recognition, document management, word processing and desktop publication. Today, there is a need for more accurate, faster and more reliable OCR systems components for automated data capture applications involving "real world" multi-font machine print, mark sense and unconstrained unsegmented alpha-numeric handprint.

Text Recognition
OCR "text-only" scanners are faster at reading characters than image scanners with auxiliary optical character recognition software. Dedicated OCR readers handle a wider variety of typefaces, line spacings and manual editing marks. Optical character recognition software recognizes patterns of dots (bits) from electronic bit-maps as complete characters and converts each character into ASCII code. OCR software can convert ASCII files to the compatible format for a word processor or spreadsheet. Text scanning software (machine print) recognition systems can use artificial intelligence and neural networks to attempt to "learn" typefaces and their ASCII equivalents. This requires a training phase for the recognition system reading real live documents. Other reliable recognition systems employ prior knowledge about certain typefaces and character shapes. Some OCR text recognition system software uses dictionaries to improve text recognition read rates. However, the dictionary matrix provided can be biased for the typefaces and syntax scanned. This class of scanners represents the high end of text recognition OCR and are typically used for desktop publishing and litigation support applications. High-end text recognition software also includes spell checking and the flagging of suspicious characters or words for offline video reject re-entry. However, desktop text recognition OCR systems with "omnifont recognition" software usually lack the reliability, speed and precision required for automated data capture or automatic indexing applications. Text recognition software should not be used for data capture and automatic indexing applications where accuracy is paramount and there is no context available.

Recognition Of "Real World" Characters
"Real world" characters are alpha and numeric characters which can be handprinted by people on tax forms, machine printed on change of address forms, medical and insurance claims and credit card invoices. Real world handprinted characters can be printed on paper using a #2 pencil, an ink pen, a ball point pen (red, black, or blue) or a soft (porous) tip pen. Real world handprinted characters do not always follow the printed guidelines half-toned inside the preprinted blind ink constraint boxes. In most cases, the real world characters are printed outside of the box or over the box and usually over each other (unsegmented characters). In tax reporting, insurance claims processing and change of address reporting applications, the input documents containing real world characters are usually received from a large number of people over a wide geographical area. In many cases the print quality of the characters is so poor that the data is almost unreadable by humans. Today's challenge is to accurately read real world characters automatically using OCR recognology. These characters contain a wide range of print quality and inks and are generated with a wide range of instruments. Characters are prepared by diverse groups of people who normally don't spend much time trying to print neatly or in the "pink handprint box" unless they anticipate a tax refund or some sort of financial return. The recognition systems employed on today's high-end, high-volume OCR systems are effectively reading these real world characters from real world paper (that is, paper which may be folded, spindled or bent). They read these characters accurately at high read rates within user acceptable error rates. Currently available OCR data-capture systems that effectively read real world handprint employ either a topological recognology or a neural network recognology system or both. The term "topological" is broadly defined as including curve tracing, feature analysis, matrix matching and topological recognologies. Real world characters in document processing applications must be read accurately and at an instantaneous character rate (characters per second) which supports the fast throughput speeds of the high performance scanning transports. That is, the recognology system must read real world characters better than can be read by a human, faster than a human and at the speed of the document transport so as to not impact the throughput and effectiveness of the EDM system. These high-end, OCR enhanced systems usually have online "total data entry" (key from image workstations) to manually capture unreadable characters as well as manually correct OCR rejected characters. This is usually affected by using the video images of the unreadable or rejected characters displayed on a workstation monitor for an operator to view (heads-up) and key into the system (keying from image). Today, the U.S. "de facto" reading standard for the first pass OCR recognition of machine-readable numeric handprint characters is 90% ± 3% of the valid characters. This read rate will usually be associated with a "raw error rate" of not more than 1%. Thus, if we had 1000 valid handprint characters in the sample, we would expect the OCR system to read at least 900 and reject 100 characters. The accepted characters should contain no more than 9 errors (misreads or substitutes). The 100 rejected characters will be displayed in context and manually entered from the video image of the rejected characters displayed on the monitor of the reject re-entry workstation. If people print poorly, a human recognition system (the eye) is required to decide what was intended by using an image enhanced work station, which displays images of the rejected characters on the workstation monitor. However, the scanner/reader need not pause or be delayed, no matter how poor the input quality.

User Choices
Today, users have two technologically different but valid methods available to effectively recognize "real world" characters – a proven reliable topological method and a new versatile accurate neural methodology. Both systems are cost effective, reading from digital images and accurately capturing data at the effective throughput speed of the scanner (real time). At this time, neural network recognition systems are being integrated into a wide range of automatic data-capture systems which read: credit applications, courtesy amount fields on checks, tax records, retail orders, insurance claims, service records, sales records and remittances. The two OCR recognologies (topological and neural network) are supplemental and facilitating technologies and not necessarily competing technologies. When used together they can sharply increase accuracy by minimizing substitutions while optimizing the OCR character read rate.

Topological Recognition
Topological recognition is a mature and proven recognition methodology used to capture characters quickly from images or source document at an instantaneous character rate of hundreds of characters per second. This mature proven recognology has been successfully used for many years in data capture OCR systems. Until now, it has not been used extensively in combination with other recognologies. However, with the user's current interest in whole page imaging using PCs in client/server architecture, this may be changing. Topological Recognition is a character recognition methodology that relies primarily on the properties of printed characters (machine print or hand print) which endure when the character undergoes distortions.
Neural Network Recognition
A newer methodology for recognizing "real world" characters employs neural networks, known to the industry as recognition enhanced data entry (REDE), a high technology approach to optimize data entry accuracy. A basic REDE workstation consists of an electronic digital scanner, a microcomputer host, a neural network recognition co-processor and operating software. All data capture processing is performed from digital images of the document. The operating software extracts ("strips") the specific areas of the electronic digital image which contains the data to be captured. Images of these "parsed areas" are electronically processed through the REDE system as digital data. Currently, neural OCR readers can recognize 100 to 250 segmented handprinted characters per second. Tests using handprint characters read on a neural network have indicated that character read rates of 95% are possible with error rates of less than 1%. Using this methodology, paper documents can be scanned at resolutions of less than 120 dpi.

Market Demand
Scanner transport pricing is directly proportional to the availability and pricing of CCD devices. Market demand for imaging scanners with imbedded OCR has increased by several orders of magnitude over the last five years primarily because of the increased demand for OCR in office automation and desktop publishing applications. Future EDM systems will be cost effective, faster and more reliable than current systems. The overall U.S. market size for electronic imaging products is pegged at $3.2 billion in 1996. This market segment is growing at a combined annual growth rate (CAGR) of 16% and is expected to exceed $5 billion in 1999. In 1996, the text recognition OCR segment of the EDMS Market is estimated to be $60 million. The OCR segment is expected to grow at a CAGR of 29% and exceed $100 million in 1999. In 1996, the overall automatic document capture (ADC) segment of the EDMS market is expected to exceed $70 million. In 1999, the ADC segment of the market is expected to exceed $190 million. Overall, in 1999, the demand for OCR in document management applications is expected to exceed $290 million. The price for these OCR systems includes the re-processing and post processing of the images to be recognized. This indicates that OCR systems include much more than pure recognology. They include forms recognition, forms ID and image enhancement, for example.

OCR In 1996
Today, EDM systems use information technologies such as electronic imaging and OCR to enable the EDM system to automate the targeted work processes and meet the needs of users by increasing productivity and/or lowering costs. The market forecast above does not include text recognition systems applied to desktop publishing, word processing or retail applications. (The retail segment of the OCR-ICR market is at least twice as large as the form recognition segment involved with EDMS applications.) This paper is focused on OCR, an enabling technology, applied primarily to applications which include the data-capture of characters by OCR from tax returns, medical claims, payments, insurance claims and order entry.

The Future
Before 1996 is over, dual recognology systems that accurately read real world handprint and multi-font machine print will be widely accepted and used for data-capture applications in EDM systems. Both neural network and topological recognition systems will be commonplace in client/server based EDM systems. Such systems will be used to scan, automatic-index, store, process and retrieve electronic images in the postal, insurance, financial, medical and government markets. With dual recognition systems, substitutions will be minimized, resulting in accurate automatic indexing and more satisfied users. What will be the impact of neural OCR networks on automated data capture systems applications in the more distant future? Neural OCR networks, when packaged on high speed VLSI chips, will be able to process machine printed characters at instantaneous character rates up to 3,000 characters per second and handprinted characters at up to 1,800 characters per second. The recognition will be compatible for real time performance on high-speed, high volume, data capture EDM systems. Real time applications include payment processing, credit card processing and check processing as well as GIRO, postal and government applications. As neural OCR networks become more compatible with the mature recognologies now installed, we can expect to see neural OCR networks and topological OCR systems integrated into a common recognition system. This integrated reading system will contain the best of both worlds, much like banking and GIRO systems today which use OCR and MICR recognologies to accurately recognize characters on the same check (one error in each 500,000 valid characters). This is the world of automatic document capture today. By the year 2000, it will be better...much better.

Herbert F. Schantz