INEbase

INEbase / Demography and Population / Population Figures and Demographic Censuses / Population and Housing Census 2001

Population and Housing Census 2001

Computer processing

Introduction  Top

New technologies have obviously played a very important role in all the stages of the Population and housing censuses performed in 2001. In the stages prior to the collection of the census questionnaires, thanks to computer tools the different models of questionnaires were designed effectively and the information from the two pilot tests was processed very quickly, so that the results could be used to improve the final result of the operation; finally, these tools enabled the the preparation, based on register information, of the files used to personalise the census questionnaires.

Computer processes have also eliminated the need to visit the territory, as traditionally occurred one year before the Population and Housing Censuses were performed, to create the Building and Commercial Premises Censuses and to prepare the itinerary notebooks the census agents would use (updates of the sections and the a-to-z). In the 2001 Census, this costly operation was eliminated thanks to the advantages provided by computer processes applied to administrative registers, mainly municipal registers and cadastres. Consequently, the itinerary notebooks and the personal data and addresses of the persons registered could be printed on the questionnaires beforehand.

The hiring of the temporary personnel needed to collect the information, of census agents and group managers (over 40 thousand persons) was managed by each of the INE's Provincial Delegations, processing not only the contracts but also the registrations and cancellations with the Social Security and managing the payment of their wages.

New technologies also played an important role in the collection of the census data. The 500 regional offices that took part in the collection process have been in contact with the Provincial Delegations and the Central Services of the INE via a special private network created for the census using mobile phones. Each of the regional offices was provided with two personal computers, one of which was connected to the private census network, a printer and a bar code reader. This infrastructure was essential to make it easier to control the collection operation, to send instructions quickly to all census offices, and to send information on workloads carried out by each agent to calculate the variable part of their payment.

Although there were no precedents in the world of census operations that had allowed respondents to complete the information via the Internet, the INE decided to take on the challenge of being the first country to carry out this operation. Thus, the INE established a procedure implementing strict security measures when accessing personal information. Consequently, all persons who were previously registered at the address where they actually resided on the reference date used when printing the personal information from the register beforehand were able to complete the questionnaires via the Internet. Another innovation of allowing respondents to complete the census via the Internet was that it could be used by persons with visual disabilities. Finally, the number of houses that have completed the census via the Internet amounted to 13,818, which represents about one per thousand of the total of existing households. The second section of this chapter briefly summarises the technical characteristics of the systems used.

The computer systems the regional offices were equipped with allowed them to determine which households completed the information via the Internet, thus avoiding going to the dwelling to collect the paper questionnaire. Additionally, it has been used as a way to control the dispatches of questionnaires filled in on paper to the census production centre (in charge of capturing the information and the computer processing). Consequently, work commenced without having to wait for the collection to be considered complete in each of the census sections.

The INE's census production centre was created expressly for this function, with improvements in the building that accommodates it and the installation of the systems architecture and applications required. The third section describes the technical characteristics of the systems used.

This chapter ends with a brief description of the other computer processes that data will undergo, which will allow the INE to place the census information at the users' disposal.

Capturing questionnaires received via the Internet  Top

The key ideas that defined the project were:

  • Spain has been the first country in the world to offer the possibility of completing the Census via the Internet for the population in general, considering this to be all the persons that were previously registered in the dwelling where they actually reside on the date used as the reference when printing the questionnaires beforehand

  • Being able to complete the Census via the Internet guaranteed the confidentiality of the people taking part, also it was easy to use, in terms of the security procedures employed depending on the information required.

  • Persons who completed the census via the Internet were compensated with free statistical information regarding the geographical distribution of the surname the respondent required (protecting statistical secrecy at all times).

  • Using the Internet allowed people with visual disabilities or other impairments to use the computer to complete the Population Census

Hereunder is a brief description of the way the process was tackled:

The design of the Spanish census operation included specific individual information on each citizen printed beforehand on the paper questionnaires that were distributed among Spanish households, obtained from the register database.

The census questionnaire was located on a secure web server SSL 3 (at http://censos2001.es). When the user did not need to modify the data from the register, the authentication mechanism involved the following identifiers: 1) PASSWORD1 (identity code included in each envelope containing the census questionnaires); 2) PASSWORD2 (password associated to the Census via the Internet, also included in each census envelope); 3) The ID number of the persons included on the register information sheet (ID number printed on the questionnaire beforehand) the name of the father and the mother as they appear on the household member's ID card (this information was not already printed on the census questionnaire). PASSWORD1 and PASSWORD2 were different for each dwelling.

In the cases when the users had to modify the register information that was printed beforehand on the questionnaire, a mechanism involving an electronic signature was established (class 2 X.509 certificates, via agreement with the Spanish Mint), complementing passwords 1 and 2.

Closely linked to the authentication procedure, a series of measures were implemented to control incorrect accesses, frauds, the blocking or unblocking of questionnaires, etc.

The web server allowed the possibility of completing the questionnaires in the different co-official languages in use in Spain and in certain foreign languages.

The regulations included a series of norms to follow when editing the questionnaire on the Internet; that is, the series of edits needed to guarantee the quality and consistency of each of the questionnaires completed via the Internet, informing the user of any errors that would prevent the final acceptance of said information, so that they could be corrected immediately.

The system allowed the users to interrupt the process of completing the questionnaire and continue it subsequently. When the questionnaire had been completed correctly, the system provided the user with a number that would act as the receipt or the proof that said questionnaire had been completed completely.

The mechanisms needed to communicate the Regional Offices and the INE's Provincial Delegations were installed, so that no census agent could claim questionnaires that had already been filled in via the Internet.

This communication with Regional Offices considered different possibilities: the basic mechanism was a send-receive procedure, that guaranteed that each Regional Office and Provincial Delegation had a weekly copy of a file containing the identification data referring to the questionnaires collected via the Internet, and alternatively of query procedures by ranges of values.

The following graphs shows the architecture of the systems and the communications of the webhousing services, which was performed by UTE INDRA/TELEFÓNICA.

Image

Capturing questionnaires filled in on paper  Top

The computer processing procedures used with census data were strongly conditioned by the enormous amount of information to be processed and the substantial reduction of the time the users demand to obtain census data. Both factors coincide in the fact that, apart from guaranteeing the quality of the processes, the census procedures have to be fast above all else.

The determining factors of the current computerised census production process are as follows:

  • Computerised processing of over 60,000,000 questionnaires (over 100,000,000 images between front and back pages of the questionnaires).
  • 5 types of questionnaires and 60 models1 and a total of 120 different images for OCR (front and back pages of the questionnaires):
    • Register
    • Dwelling
    • Household
    • Individual
    • Itinerary Notebooks

  • Intelligent character recognition (ICR) of handwriting and marks2.
  • Use of digitalised images of the questionnaires for the documentary management involving production and post production3.
  • Duration: The production will be performed in less than 3 months.

As a result, the census production project involving questionnaires on paper is not only the largest project comprising document storage and management in Spain, but it is also the first project of this kind to be performed anywhere in the world.

The Census Production Centre (CPC) was located in San Fernando de Henares (Madrid). It spreads out over 5,000 square metres and is engaged in the production of the 2001 Population and Housing Censuses, except of those questionnaires completed via the Internet. Over 800 persons worked with the census data

The diagram for the INE's census production involves the following areas of management:

  • Area A - Management of reception / dispatch of the census documentation: This unit controls and manages the areas of reception, control of the documentation, reception storage, dispatch storage and issuing. This area prepares the documentation in work batches and arranges its distribution for digitalisation. After completing the computer processes, the documentation is sent to the dispatch storage after verifying the integrity of the information.
  • Area B - Management of the digitalisation of the census documentation: This unit is in charge of correctly digitalising the census documentation verifying the levels of quality of the images. The section performs the preventive maintenance operations specified at the beginning of each working day and manages the incidents that appear during the digitalisation process.
  • Area C - Management of the census video-recording system: This unit is in charge of inputting the data of the characters that were not recognised by the census' computer system, and correcting the characters the system interprets erroneously. These processes are performed using a system that presents the image of the different census questionnaires on the screen.
  • Area D - Management of the Validation of census data: This area filters the census data after the corresponding data files are obtained, controls duplicates, false registrations, etc.
  • Area E - Management of the system used to process census itinerary notebooks: This unit is in charge of managing the specific process this type of document requires. The section is composed by digitalisation, control, video-recording and validation personnel.
  • Area F - Backup control: This unit is in charge of the control and operation of all the systems for the generation of backup copies to secure files.
  • Area G - Computer control management This unit controls all processes, tasks and personnel working on the census production system. It controls all processes performed on work batches, following the sequence of documents, the coverage of the batches with the files from regional offices and the itinerary notebooks. This unit has to ensure that the productivity rates established are fulfilled.
  • Area H - Control management and administration, storage, communication and physical and logical safety of the image files and the census data; This unit is in charge of the systems and supports for storing images and data on the INE's general census production network. It is also in charge of the physical and logical safety of the information (images and data). It is responsible for exporting the information and of the communication with other census centres and/or INE centres. It has to solve computer problems that could arise in the general census production network and be aware of the applications and physical and logical systems in order to be able to solve the incidents that could appear. The unit also has to maintain and optimise the physical and logical devices to ensure the production is in compliance with the established objectives.
  • Area I - Quality control: This unit is in charge of the video-correction processes required to test the reliability / efficiency rates of the census production and ensure whether they are appropriate or if they need to be improved. A work batch is not OKed until it has been authorised by this unit.
  • Area J - Management of the incident system: This unit is in charge of solving incidents that appear regarding the census documentation (physical deterioration, incorrect identifications, control of the coverage with regional offices...). If a questionnaire presents physical deterioration that prevents correct digitalisation, the data must be entered traditionally to create a virtual questionnaire that will replace the damaged document. Other incidents are solved via the personalised digitalisation of each questionnaire and the subsequent video-recording. After solving the incidents regarding the questionnaires, the images and data are sent back to where the incident arose so that they can be integrated in the corresponding work batches.
  • Area K - Management of the control, monitoring and administration of the INE's general census production network: This unit is in charge of managing and controlling all the areas mentioned above for the execution of the census production in the deadline established and in the best possible conditions. It is in charge of optimising devices, systems... so as to maximise the census production in terms of the plan established. This section monitors the work flows in detail and presents reports to the INE of the results obtained and the foreseen planning. It is continuously in contact with the INE's Control Unit so as to achieve the expected quality rates.

Image

The physical and logical equipment needed to perform the 2001 Population and Housing Census, in compliance with the processing model established, is represented in the following graphs:

Image

Image

The capture operation is carried out via an OCR system that incorporates automatic coding, range control and inter- and intra-record coherence procedures. The operation includes the following steps:

  • Digitalisation via high production optic scanners, that process 120 questionnaires a minute. The model was a KODAK i810, a worldwide innovation, used for the first time in Europe for this operation.
  • Control of the coverage of the digitalisation
  • Intelligent recognition of handwritten characters
  • System for the improvement of literals and assisted coding
  • Video-correction associated to recognition and coherence controls
  • Control of the workflow
  • Quality control
  • Documentary management

The applications developed to carry out the documentary management and the OCR processes are based on the Bellview Scan system (created by the Pulse Train), incorporating systems to improve literals based on dictionaries, as well as automatic coding, that have been developed and are being used by the company ODEC. This has resulted in levels of recognition above 80% of the processed material, which are completed by video-correction processes.

The architecture of the computer systems is designed using SAN (multiple servers sharing a secure storage system via Fiber Channel protocol), and incorporates strict physical and logical security measures, RAID 0+1 discs (mirrored discs), cluster switches and servers (duplicate servers working cooperatively), remote assistance via modem connected to security systems and notification to computer providers, chip cards to access systems, etc.

The work stations used in the digitalisation process required major processing capacity (SIEMENS Primergy B210 models, with two PIII Xeon processors at 1 GHz and 256 MB RAM), whilst the work stations used in the recognition process required a large memory (SIEMENS Scenic Di815E models with a PIII processor at 1 GHz and 512 MB RAM).

The Bellview application servers (two in cluster) and management servers for image files are quite similar, 4-way Primergy N400, with two Xeon processors at 700 MHz, the two former with 3 GB RAM and the image server with 1 GB RAM, the database servers (two in cluster) are 8-way Primergy N800, with two Xeon processors at 700 MHz and 4 GB RAM.

The PCs used to carry out the rest of the processes (warehouse management, video-filtering, quality control, etc.), which amount to over 200 units, are equipped with Pentium III at 1 GHz, Pentium IV at 1.2 GHz and 128 MB RAM.

The storage used is a EMC2 (Symmetrix 8430) with 25 TB (Terabyte = measure of computer data storage capacity equivalent to one thousand billion bytes) with 140 181-GB discs each.

The backup system is called Scalar 100 (LTO) and has a 15 TB capacity, with 100/200 GB tapes, and a 15 MB/s transference rate and 324 Gb/h copy speed.

The equipment uses Windows 2000 Advance Server as its operative system, which requires system units that can store up to 4.5 TB, a global record storage that exceeds the limits known in the Windows environment to date, and the Microsoft SQL 2000 database.

There is a also a CD copy system which will be used to send the images corresponding to each Municipal Register to the corresponding municipality.

Procedures performed after the information is processed at the Census Production Centre  Top

When the Census production centre completes its tasks, the questionnaires have been scanned, recognised and validated. These processes use dictionaries that allow the coding of those questions that require a literal answer: province and municipality of birth or residence in 1991, activity, occupation, etc. However, the system does not achieve 100 per cent of the coding. Neither are the validations associated to coherence controls necessarily comprehensive, as they focus on eliminating the most important errors. Therefore, it is necessary to apply additional processes that allow the creation of final census files that can be used correctly from a statistical viewpoint.

The coding of the cases that have been completed by the Production centre is performed in two different ways: for registers corresponding to Autonomous Communities that have signed a collaboration agreement, the statistics institutes of the communities are in charge of this coding, and can use automatic, assisted or combined coding procedures, according to their possibilities, although they must always be coherent with those used by the INE for the rest of the State. This is specifically relevant in the cases in which the autonomous community has another official language, or there are powerful dictionaries that aid this coding, which they have from other projects. For the registers in the rest of the communities, the National Statistics Institute is in charge of this task. In order to do so, it will use an updated version of the procedures that proved successful in the 1991 Censuses, basically automatic coding via approximation, using dictionaries that are improved progressively.

Finally, both the records whose coding was completed by autonomic statistics institutes and those coded by the INE are subjected to a single automatic imputation procedure carried out by the INE (like the automatic coding processes, this processing is carried out in a centralised manner in the Computer Science Department), aiming to eliminate inconsistencies. It comprises a probabilistic imputation process that preserves the original information as much as possible. Thus the procedure produces the final file that will be used for statistical operations carried out by both the INE and autonomic institutes. This reduces the resources required considerably (as a single process is used for all data in Spain), and creates a single final file that avoids one same statistical source providing different quantifications of one same phenomenon, which is also an important innovation.

Compared to previous Censuses, the level of use of the automatic imputation is a lot lower, as the filtering and controls applied in the production centre have improved the quality of the data received. The INE will implement the automatic imputation again using the DIA system, developed by the INE and used previously in 1991 and in other surveys like the APS.


NOTES
1 Both register sheets and census questionnaires have bilingual models for each of the official Spanish languages (Castellano, Gallego, Mallorquín, Valenciano and Vasco).
2 The number of different types of handwriting that can be recognised in the questionnaire amounts to the total number of persons who complete them
3 The images will be used to send all Spanish municipalities their corresponding register questionnaires.