Magazine Article | February 1, 2000

Counting The Masses With Document Management And Scanning

Source: Field Technologies Magazine

The Federal Census Bureau is responsible for compiling data on the almost 300 million people in this country. Document management and scanning technology have helped to make that project a reality.

Integrated Solutions, February 2000
The decennial census has been called the largest peace-time mobilization in the world. That statement is probably not inaccurate. From start to finish, the U.S. census takes years. It involves not only hundreds of thousands of employees, but also the participation of the United States' 274,201,482 citizens. And, that's only on the human resource side. The information technology (IT) side is even more impressive: a database of every address in the United States, 12 regional collection centers, 520 local offices, a call center equipped to handle up to 11 million calls, and 115 million census forms.

If that's not pressure, just remember that census employees have to turn in their final report to the boss by December 31st, 2000; the President can be somewhat more intimidating than your garden-variety manager. Governors also tend to get cranky when you don't deliver the information that determines their congressional districts. With all that pressure, and a job that gets bigger every decade, the Federal Census Bureau decided to contract vendors for the 2000 decennial census. Although the Bureau knew that integrated technologies would expedite the census, the agency didn't have the expertise to implement solutions. Partnering with the private sector turned out to be a prudent choice.

U.S. Address Database: A Starting Point For The Census
The entire census begins by compiling a complete address file of the United States. The Bureau's geographic support system is a digital map of the United States. All addresses reside on large VAX 8400 systems using open VMS. The information for the database is constantly changing, so field employees canvas areas with an address list to note changes or additions. If field employees find three new addresses in a particular region, they then return to one of the 12 regional census centers (RCC). The centers run three shifts to update information, and to compute and produce the census maps. SGI (silicon graphics) equipment running on a UNIX operating system is used for all map updating. Each of the 12 RCCs takes the address information and downloads it to the national processing center in Jeffersonville, IN. The processing center uses Dell PCs and servers running on an NT platform. There are more than 1,000 PCs with two shifts - just for the address changes. All addresses are captured by a VMS large VAC system that feeds the information to the server. The center also scans the maps and puts them into digital format.

Managing 520 Local Offices
All of the 520 local centers/offices (LCOs) are connected to a telecom infrastructure. Cisco routers and switches are used throughout the decennial census. Each of the twelve RCCs will have up to 50 LCOs reporting to it. The centers are connected by telecom lines. The RCCs have 4,100 servers running UNIX local databases. Each center has an operational control system and a personnel payroll system. Each LCO connects to those systems, so all the information emanates from those databases down to the LCOs.

All of the Dell network servers run Novell with Windows 95. The RCCs use SGI. The servers are backed up by a tape cartridge system every night. Any data that passed between the RCCs and headquarters uses T1 lines and Cisco routers and switches. Data comes through the buoy computer center and is distributed to the census headquarters in Sutland, MD. The only contractor involved in this part of the operation is UNISYS, which supplied the equipment for these offices. All the programming and integration for this equipment was done in-house by the Census Bureau.

Document Management And The Census
Verifying and compiling all the addresses in the United States is only the first step in the decennial census. The next crucial step is the dissemination and retrieval of the 2000 census survey forms. The Census Bureau sends the forms out in March, 2000, via the United States Postal Service (USPS). Unfortunately, the bureau expects only about a 61% return rate, or 70 million forms. For the remaining 45 million forms, they must hire hundreds of thousands of enumerators to collect information door-to-door. The Census Bureau has four data capture centers. This is the first stage in which the Census Bureau uses outside contractors: Lockheed Martin and TRW. With the exception of the facility in Jeffersonville, IN, the data capture centers have been built and staffed by TRW. Lockheed Martin is the technology provider for those centers.

When the forms arrive at a data capture center, they go through a two-stage operation: a high-speed check-in, and a data capture operation. Each form contains a unique bar code that can immediately be wanded through an envelope window. Collecting the information via scanning would exceed the Bureau's mandated two-day turnaround time for forms. Once scanned, the forms are accordingly sorted and checked against an overall address list. This allows the Bureau to determine which recipients have not responded to the survey - which, in turn, lets the enumerators know which houses to visit.

Lockheed Martin has provided docutronic sorters for this part of the process. The sorters can process up to 30,000 envelopes per hour, but the Bureau expects them to average 15,000 to 25,000 envelopes per sorter every hour. The forms are then removed from the envelopes and prepared by TRW employees. The Bureau uses Kodak 923 dual-sided scanners, which image the front and back pages at 200 dpi (dots per inch). The Bureau uses a yellow background, which Michael Longini, chief of Decennial Systems and Contracts Management Office (DSCMO) of the Federal Census Bureau says, "drops out nicely," facilitating the scanning process and reducing file sizes. Once scanned, the images go through an automated image quality assessment (AIQA) process, which looks for defects: file sizes that are too large or small, pages that have been bent over, or illegible or stray marks. Forms that pass the AIQA process will go on to various other recognition stages. The Bureau even has a redesigned double-page detector to assist with the document integrity on the Kodak scanners, which use optical character recognition (OCR) technology.

The census has also developed dictionaries to help improve the accuracy of the OCR-captured data. The Bureau scans two types of forms: a "short form" and a "pamphlet form." The "short form" is also referred to as the 100% form, since it consists of all data questions. Although it can be up to 25½ inches long, it has fewer questions and is easier to scan than the pamphlet form. The pamphlet form contains the "short form," but also has 40 additional pages of demographic questions. It is necessary to separate the pages of the pamphlet form prior to scanning. For this job, the Bureau has developed "biscuit cutters," not unlike the ones your grandmother used to make breakfast, that actually punch out the staples in the booklets. This process turns the 40-page pamphlet into 10 separate sheets.

The last stage in the scanning process is called checkout, and it ensures that every form in the data center has been captured. Every form is bar code-wanded and sent to Census headquarters for confirmation of receipt. Once confirmed, the forms are sent to paper storage.

Turning Data Into Meaningful Information
Once scanned and stored, the census information is sent as ASCII data to the Federal Census Bureau headquarters in Sutland, MD. The Bureau uses a large VAC system with open VMS. This system is controlled using a sun Solaris system. The incoming data goes through an "accuracy and coverage evaluation," which is an independent survey to measure the accuracy of the census.

The hard data is compared with the information that enumerators collected while canvassing addresses. Collating these data determines whether people have been over- or under- counted. Those results are integrated with the census processing system. That allows the Census Bureau to determine the "one" number census, whereby it estimates how many the census missed in the actual count, and to arrive at the "one" number. Once this information is established, it is passed to the data access and dissemination system (DADS).

"We are trying to reduce the amount of printed information by giving people access to census data via the Internet," says Longini. The Census Bureau has awarded IBM the contract to build the Web site, and the company is now in the process of developing the system.

Assessing The Decision To Contract
When asked what prompted the Federal Census Bureau to contract work and whether it has been a success, Longini explains, "We defined our requirements and put contracts out for competition. We sent out a number of RFPs to different vendors. As in any government procurement, it was a best-value choice. In terms of hiring interviewers, we have no problems. But, we realized that we didn't have the technological expertise to implement many of the solutions we wanted. Partnering with the private sector has worked tremendously. The government has never really looked at contractors as partners working together. But, we approached it that way, and the Bureau has benefited from it."

Questions about this article? E-mail the author at DougC@corrypub.com.