INEbase / Demography and Population / Population Figures and Demographic Censuses / Population and Housing Census 2001
Population and Housing Census 2001 Computer processing
New technologies have obviously played a very important role in all the stages of the Population and housing censuses performed in 2001. In the stages prior to the collection of the census questionnaires, thanks to computer tools the different models of questionnaires were designed effectively and the information from the two pilot tests was processed very quickly, so that the results could be used to improve the final result of the operation; finally, these tools enabled the the preparation, based on register information, of the files used to personalise the census questionnaires. Computer processes have also eliminated the need to visit the territory, as traditionally occurred one year before the Population and Housing Censuses were performed, to create the Building and Commercial Premises Censuses and to prepare the itinerary notebooks the census agents would use (updates of the sections and the a-to-z). In the 2001 Census, this costly operation was eliminated thanks to the advantages provided by computer processes applied to administrative registers, mainly municipal registers and cadastres. Consequently, the itinerary notebooks and the personal data and addresses of the persons registered could be printed on the questionnaires beforehand. The hiring of the temporary personnel needed to collect the information, of census agents and group managers (over 40 thousand persons) was managed by each of the INE's Provincial Delegations, processing not only the contracts but also the registrations and cancellations with the Social Security and managing the payment of their wages. New technologies also played an important role in the collection of the census data. The 500 regional offices that took part in the collection process have been in contact with the Provincial Delegations and the Central Services of the INE via a special private network created for the census using mobile phones. Each of the regional offices was provided with two personal computers, one of which was connected to the private census network, a printer and a bar code reader. This infrastructure was essential to make it easier to control the collection operation, to send instructions quickly to all census offices, and to send information on workloads carried out by each agent to calculate the variable part of their payment. Although there were no precedents in the world of census operations that had allowed respondents to complete the information via the Internet, the INE decided to take on the challenge of being the first country to carry out this operation. Thus, the INE established a procedure implementing strict security measures when accessing personal information. Consequently, all persons who were previously registered at the address where they actually resided on the reference date used when printing the personal information from the register beforehand were able to complete the questionnaires via the Internet. Another innovation of allowing respondents to complete the census via the Internet was that it could be used by persons with visual disabilities. Finally, the number of houses that have completed the census via the Internet amounted to 13,818, which represents about one per thousand of the total of existing households. The second section of this chapter briefly summarises the technical characteristics of the systems used. The computer systems the regional offices were equipped with allowed them to determine which households completed the information via the Internet, thus avoiding going to the dwelling to collect the paper questionnaire. Additionally, it has been used as a way to control the dispatches of questionnaires filled in on paper to the census production centre (in charge of capturing the information and the computer processing). Consequently, work commenced without having to wait for the collection to be considered complete in each of the census sections. The INE's census production centre was created expressly for this function, with improvements in the building that accommodates it and the installation of the systems architecture and applications required. The third section describes the technical characteristics of the systems used. This chapter ends with a brief description of the other computer processes that data will undergo, which will allow the INE to place the census information at the users' disposal. Capturing questionnaires received via the Internet The key ideas that defined the project were:
Hereunder is a brief description of the way the process was tackled: The design of the Spanish census operation included specific individual information on each citizen printed beforehand on the paper questionnaires that were distributed among Spanish households, obtained from the register database. The census questionnaire was located on a secure web server SSL 3 (at http://censos2001.es). When the user did not need to modify the data from the register, the authentication mechanism involved the following identifiers: 1) PASSWORD1 (identity code included in each envelope containing the census questionnaires); 2) PASSWORD2 (password associated to the Census via the Internet, also included in each census envelope); 3) The ID number of the persons included on the register information sheet (ID number printed on the questionnaire beforehand) the name of the father and the mother as they appear on the household member's ID card (this information was not already printed on the census questionnaire). PASSWORD1 and PASSWORD2 were different for each dwelling. In the cases when the users had to modify the register information that was printed beforehand on the questionnaire, a mechanism involving an electronic signature was established (class 2 X.509 certificates, via agreement with the Spanish Mint), complementing passwords 1 and 2. Closely linked to the authentication procedure, a series of measures were implemented to control incorrect accesses, frauds, the blocking or unblocking of questionnaires, etc. The web server allowed the possibility of completing the questionnaires in the different co-official languages in use in Spain and in certain foreign languages. The regulations included a series of norms to follow when editing the questionnaire on the Internet; that is, the series of edits needed to guarantee the quality and consistency of each of the questionnaires completed via the Internet, informing the user of any errors that would prevent the final acceptance of said information, so that they could be corrected immediately. The system allowed the users to interrupt the process of completing the questionnaire and continue it subsequently. When the questionnaire had been completed correctly, the system provided the user with a number that would act as the receipt or the proof that said questionnaire had been completed completely. The mechanisms needed to communicate the Regional Offices and the INE's Provincial Delegations were installed, so that no census agent could claim questionnaires that had already been filled in via the Internet. This communication with Regional Offices considered different possibilities: the basic mechanism was a send-receive procedure, that guaranteed that each Regional Office and Provincial Delegation had a weekly copy of a file containing the identification data referring to the questionnaires collected via the Internet, and alternatively of query procedures by ranges of values. The following graphs shows the architecture of the systems and the communications of the webhousing services, which was performed by UTE INDRA/TELEFÓNICA. Capturing questionnaires filled in on paper The computer processing procedures used with census data were strongly conditioned by the enormous amount of information to be processed and the substantial reduction of the time the users demand to obtain census data. Both factors coincide in the fact that, apart from guaranteeing the quality of the processes, the census procedures have to be fast above all else. The determining factors of the current computerised census production process are as follows:
As a result, the census production project involving questionnaires on paper is not only the largest project comprising document storage and management in Spain, but it is also the first project of this kind to be performed anywhere in the world. The Census Production Centre (CPC) was located in San Fernando de Henares (Madrid). It spreads out over 5,000 square metres and is engaged in the production of the 2001 Population and Housing Censuses, except of those questionnaires completed via the Internet. Over 800 persons worked with the census data The diagram for the INE's census production involves the following areas of management:
The physical and logical equipment needed to perform the 2001 Population and Housing Census, in compliance with the processing model established, is represented in the following graphs: The capture operation is carried out via an OCR system that incorporates automatic coding, range control and inter- and intra-record coherence procedures. The operation includes the following steps:
The applications developed to carry out the documentary management and the OCR processes are based on the Bellview Scan system (created by the Pulse Train), incorporating systems to improve literals based on dictionaries, as well as automatic coding, that have been developed and are being used by the company ODEC. This has resulted in levels of recognition above 80% of the processed material, which are completed by video-correction processes. The architecture of the computer systems is designed using SAN (multiple servers sharing a secure storage system via Fiber Channel protocol), and incorporates strict physical and logical security measures, RAID 0+1 discs (mirrored discs), cluster switches and servers (duplicate servers working cooperatively), remote assistance via modem connected to security systems and notification to computer providers, chip cards to access systems, etc. The work stations used in the digitalisation process required major processing capacity (SIEMENS Primergy B210 models, with two PIII Xeon processors at 1 GHz and 256 MB RAM), whilst the work stations used in the recognition process required a large memory (SIEMENS Scenic Di815E models with a PIII processor at 1 GHz and 512 MB RAM). The Bellview application servers (two in cluster) and management servers for image files are quite similar, 4-way Primergy N400, with two Xeon processors at 700 MHz, the two former with 3 GB RAM and the image server with 1 GB RAM, the database servers (two in cluster) are 8-way Primergy N800, with two Xeon processors at 700 MHz and 4 GB RAM. The PCs used to carry out the rest of the processes (warehouse management, video-filtering, quality control, etc.), which amount to over 200 units, are equipped with Pentium III at 1 GHz, Pentium IV at 1.2 GHz and 128 MB RAM. The storage used is a EMC2 (Symmetrix 8430) with 25 TB (Terabyte = measure of computer data storage capacity equivalent to one thousand billion bytes) with 140 181-GB discs each. The backup system is called Scalar 100 (LTO) and has a 15 TB capacity, with 100/200 GB tapes, and a 15 MB/s transference rate and 324 Gb/h copy speed. The equipment uses Windows 2000 Advance Server as its operative system, which requires system units that can store up to 4.5 TB, a global record storage that exceeds the limits known in the Windows environment to date, and the Microsoft SQL 2000 database. There is a also a CD copy system which will be used to send the images corresponding to each Municipal Register to the corresponding municipality. Procedures performed after the information is processed at the Census Production Centre When the Census production centre completes its tasks, the questionnaires have been scanned, recognised and validated. These processes use dictionaries that allow the coding of those questions that require a literal answer: province and municipality of birth or residence in 1991, activity, occupation, etc. However, the system does not achieve 100 per cent of the coding. Neither are the validations associated to coherence controls necessarily comprehensive, as they focus on eliminating the most important errors. Therefore, it is necessary to apply additional processes that allow the creation of final census files that can be used correctly from a statistical viewpoint. The coding of the cases that have been completed by the Production centre is performed in two different ways: for registers corresponding to Autonomous Communities that have signed a collaboration agreement, the statistics institutes of the communities are in charge of this coding, and can use automatic, assisted or combined coding procedures, according to their possibilities, although they must always be coherent with those used by the INE for the rest of the State. This is specifically relevant in the cases in which the autonomous community has another official language, or there are powerful dictionaries that aid this coding, which they have from other projects. For the registers in the rest of the communities, the National Statistics Institute is in charge of this task. In order to do so, it will use an updated version of the procedures that proved successful in the 1991 Censuses, basically automatic coding via approximation, using dictionaries that are improved progressively. Finally, both the records whose coding was completed by autonomic statistics institutes and those coded by the INE are subjected to a single automatic imputation procedure carried out by the INE (like the automatic coding processes, this processing is carried out in a centralised manner in the Computer Science Department), aiming to eliminate inconsistencies. It comprises a probabilistic imputation process that preserves the original information as much as possible. Thus the procedure produces the final file that will be used for statistical operations carried out by both the INE and autonomic institutes. This reduces the resources required considerably (as a single process is used for all data in Spain), and creates a single final file that avoids one same statistical source providing different quantifications of one same phenomenon, which is also an important innovation. Compared to previous Censuses, the level of use of the automatic imputation is a lot lower, as the filtering and controls applied in the production centre have improved the quality of the data received. The INE will implement the automatic imputation again using the DIA system, developed by the INE and used previously in 1991 and in other surveys like the APS. NOTES |