Results of DA

Network Schemes

Possible Network Schemes

There are several NSIs in the European Member States that offer access to national microdata for researchers in RDCs. Eurostat currently makes microdata available for six European Community Statistics surveys. Of these six, detailed microdata at the Safe-Centre in Luxembourg are available for only two, the others are released as highly anonymised licensed datasets. With respect to this, by now there was no need to implement any standardisation on dissemination, access and workflow procedures in general.

When considering potential schemes for implementing networked access to European Community Statistics a number of possible solutions were examined. The aim was to reduce the burden on Eurostat by finding a solution that considers the current legal constraints (especially for submitting and granting access to the data), and that makes use of existing access channels and data sources. The outcome includes the main workflow of an RDC, the special needs and characteristics and so the issues that need to be addressed.

Moving to a pilot implementation

The potential solutions were assessed against a range of criteria that were designed to take into account the simplicity of implementation. These criteria include: practical implementation in terms of RDC administration; technical capability (existing IT environment); legal constraints. After those solutions that were judged impractical or inadequate in relation to these criteria had been dropped, intermediate results of three possibilities were shortlisted. The likely costs and benefits of the three solutions were then considered and a matrix, identifying setup and ongoing issues for Eurostat and NSIs and whether these were complex or simple was then produced. The final result of this work is a hybrid solution where the most feasible parts of each solution on the shortlist are recognised. The outcome is ONE final solution which is foreseen for a practical implementation in a pilot study. The decision for the pilot study has been made on the three solutions that can be characterised as follows:

• The fat centre has little appeal. It is a potential short-term solution as the legal and administrative frameworks are in place, but this would require investment in IT infrastructure by Eurostat – and probably some investment by NSIs as well. More importantly, implementing the fat centre would require almost no changes to procedures. As a result, decentralised access could occur, but the incentives to consider better alternatives would be much lower than at present; hence, choosing this as the short-term solution would likely lead it to be also the long-term one. True international data sharing (including non-EU data) is unlikely to evolve in this environment. This approach has been discarded because it’s less innovative and not future-proof.

• The thin centre is more appealing. The main downside is concern over the legal aspects – while there is the lawful authority for Eurostat to transfer data to NSIs in MSs, objections might be raised by Member States which prevent the solution. Similar concerns might be raised over NSIs making decisions over applications, or clearance of outputs. The view was that these ‘downsides’ are actually positive aspects. Identification of the key behaviours that could be challenged is one of the main outputs of this report. The thin centre model forces hard questions to be asked, and allows for the answers to be tested in a real environment. It also has the advantage that it could be implemented relatively quickly from existing infrastructures. It was agreed that this model is the best for a short-term solution and therefore predestined for a pilot study .

• The decentralised network has several appealing features, but it was felt to require more experience and consideration to fully realise benefits – experience which could be gained through a well-designed pilot. It was agreed this should be considered as a potential model for a long-term solution.

Details on the pilot solution

Under the thin centre solution, RDCs would be responsible for handling applications, doing clearances, and managing IT for users (even if the physical IT system is at Eurostat). Eurostat’s involvement would be limited to ensuring that the data is in the RDCs IT space, wherever that is located. The optimal solution is a remote access system in Eurostat, where RDCs of MS can login and manage their own IT space. The advantage of this solution is that the data will remain at Eurostat. There is no need to transfer any data from Eurostat to MS. Because of this fact presumably much more MS will agree on researchers access to their data. A physically data transfer of community statistics to MS would lead to a decentralised data management which is difficult to realise in practice. The possibility to share data by electronic Data files Administration and Management Information System (eDAMIS) is seen as the second best and likewise the worst scenario. This is only a stopgap solution to make decentralised access basically possible and should not be seen as an investment for the future. The situation of data access will be confusing and unclear if the data are managed decentralised through such a system. If eDAMIS is already the second best choice for an access in a decentralised system, there is no way around a remote access platform in Eurostat. It might also have a role to play in applications: the ideal would be for the RDC to have full delegated authority, but if MSs approval is felt to be necessary than mechanisms need to be designed to minimise the decision-making at Eurostat.

The pilot – technical implementation

All three of the solutions on the shortlist have advantages and disadvantages. For most solutions there are practical barriers to overcome before they can be implemented. The realisation that all progress goes in small steps led to a search for a short-term solution that will point the ESS in the right direction and can be implemented in the near future.
The essential feature of the short-term solution is that it builds on existing infrastructures as far as possible, most of the administrative and supporting work currently carried out by Eurostat decentralised to local RDCs. The features of this short-term solutions are described below.

The application process

• The researcher applies for access at his local RDC. The local RDC then provides him with the standardised templates for access requests.
• Based on the filled out access requests, the local RDC checks the admissibility of the institute from Eurostat’s list. It also makes a recommendation on the project proposal. In making this recommendation, the local RDC will follow rules and considerations that Eurostat imposes.
• In an ideal system, the RDC will make the decision itself, based upon agreed standards. However, under current legislation, MS need to be consulted for agreement to the proposal.
• The local RDC then takes care of the signing of the contract by the researcher. If necessary, it then sends the contracts to Eurostat to be signed by them.
• The local RDC explains the use of the facility to the researcher and instructs him on disclosure control issues.

The IT system

• Eurostat builds an IT system for a remote access via thin-client; it is envisaged that NSIs would be able to manage their own ‘areas’ of the central system, setup and manage accounts for researchers. NSIs need to have in place methods to access the central system.
o In the absence of the above solution, the RDCs manage their own IT system. Existing centralised European infrastructures/processes would be used (CIRCA, eDAMIS, etc).

Data preparation

• Each country manages its own national area via remote access inside the European central data storage (imagine 27 different folders each with national microdatasets). Each MS gives -under its own decision- its own data to the researcher, by copying them to the corresponding user data area.
• In the absence of the above solution, Eurostat transmits data to the RDC via eDAMIS; the RDC then moves the data into the researcher’s local work area

Clearances

• The local RDC checks the output, making use of European guidelines, and makes decisions on its own authority
• Researchers get the results emailed to them and a copy is kept at the RDC
• If Eurostat is not willing to give up its authority, then it needs to set up a system to approve recommendations from RDCs. RDCs should be able to reject output by themselves, without reference to Eurostat
• A system for peer review is set up and actively carried out to ensure trust among NSIs in the way that output is being checked. Some proportion of outputs is afterwards double checked by another NSI or Eurostat.

Keeping the administration up to date

• A central simple administrative system is available. Local RDC personnel can remotely log on to this system to add some basis information on the new contract (name of the institute, research aim, name of researchers, datasets used, start date, finishing date etc). This central administrative system could be placed on the central IT-system at Eurostat or for instance on CIRCA or a secure website.

Supporting countries without RDCs

• Countries without RDCs should be encouraged to either set up low-cost local systems or to investigate joint access with countries that can provide internet access to a Safe-Centre.

Long-term solution

At its simplest level, the short-term solution is designed to be up and running quickly with minimal investment. The long-term solution is less specific on implementation and more concerned with standards and process. The short-term model is seen as testing a concept which would be necessary for any long-term model; for example, the willingness of MSs to delegate authority to Eurostat and/or RDCs.
If Eurostat would choose to build a remote access system as part of the short-term pilot, then this would be a useful indicator of how a long-term model might be implemented. However, given the range of experience across Europe in building thin-client systems, this seems a lower priority compared to resolving differences over process which could put the brake on any implementation.
In the long-term model, other issues need to be considered. How can metadata be developed usefully? How can research results be shared effectively? How do we ensure that RDCs learn from each other? Is there a benefit in a pan-European research databank? Could some form of secure ‘cloud’ or ‘grid’ computing replace single data cores? A decentralised system could pose difficulties in some of these areas. This does not necessarily mean that these developments are impossible; but there is currently limited experience in this area. A pilot using ECHP should also include reviewing some of the ways to improve knowledge sharing.

Requirements for a long-term solution

The long-term solution aims to provide a generalised access system within which access to European datasets is one outcome, but not the only one. The long-term solution is defined by a set of standards for the operations of RDCs which allow decisions about access to be taken with reference to principles of security, not details of implementation. These security standards would cover all four aspects of the extended security model (safe projects, people, settings, outputs).
In the case of access to European microdata, an additional requirement is the need for an equivalent framework detailing the legal position of Eurostat and the policy position of MSs. Again, this is so that decisions about access can be taken at a dataset/researcher level without needing to consider the implementation every time.
This long-term solution does require a number of developments, particularly in the confidence of MSs to delegate authority. Accordingly, a number of items should be tested during the pilot phase. These do not necessarily lead to a specific long-term solution, but are necessary steps to deciding how a long-term solution might work in practice.

Decisions over the location of data

Authority to hold EU microdata at various places is not clear; even if EU microdata can be stored locally at NSIs, there is at present no clear mechanism for deciding the practicalities. This project proposes

• The advantages of the right for NSIs to hold EU microdata supplied to them by Eurostat need to be discussed and (where appropriate) established
• Distribution of data to researchers should be as delegated as possible; that is, NSIs should be holding data for distribution to researchers once an application for access is granted

Delegated authority to approve applications

One of the key concerns about using European microdata is the need for MS approval of all project applications. If a project application needs 27 reviews, then any general system for giving data access will be hard to manage.
The aim of this project is to make access to EU microdata easier. This, in our opinion, would be possible only by reducing the number of assessments required. This project therefore proposes that

• Irrespective of the particular solution chosen, access systems should have a much lighter approval regime with delegated authority
• The pilot should therefore develop and test models of delegated access: is the legal/procedural framework appropriate? Will it stand up to generalised access to datasets? How will standards for approval be set?

Approval of clearances

As for applications, approval of clearances requires delegated authority to prevent clearance becoming impossibly complex and slow. This project proposes

• Both, long and short-term solutions require that clearances are carried out locally to an agreed standard
• The pilot should test processes for auditing, decision-making on the rejection of output, and the ‘comfort level’ of expanded the clearance model beyond ECHP to more controversial datasets

IT systems

If Eurostat chooses to set up a remote access system, then the pilot can usefully focus on both technical and procedural aspects:

• How does the IT perform? How expandable/flexible/reliable is it?
• How are user accounts etc efficiently managed – locally or at a distance? Does it make much difference?
• Do all users have to agree on how they manage ‘their’ user area? Can there be variations in software?

Test datasets

It has been suggested that ECHP will be used as the test dataset during the pilot phase. The advantage of ECHP is that it can be physically transferred to the MSs, even if formal agreement on its use/access is needed anyway (Reg. EC 233/09). However, the disadvantage of ECHP is that it does not challenge the distribution of data or project approval in the general case. A second disadvantage of ECHP is that usage is limited: it is unlikely to give a fair assessment of utility to researchers of decentralised access.
We therefore propose

• That the pilot include access to a dataset with a strong research interest and a difficult case to consider (at least under current arrangements), such as the European Labour Force Survey

Co-ordination with other bodies

An ESS Task Force has just begun a two-year project to study the ramifications of changes of the legislation. Much of the work of this Task Force overlaps with the functional issues this project recommends be tackled. The pilot could be a useful specific example within which the Task Force can develop and test some of its proposals. The group therefore proposes any pilot liaise closely with this ESS Task Force.

Accreditation System for a Safe-Centre

The intention to widen access also includes the question of standardising security aspects. Besides the need to guarantee anonymity of the results, the legal framework includes special restrictions and conditions that have to be considered when providing access. As the study initially concentrates on implementing a network to access to the ECHP, it seems reasonable to prove whether the guidelines of Eurostat and the project partners are transferable for standardised criteria to build up an accreditation system. We agree that there is a minimum legal bottom line, therefore the specifications on allowances and restrictions have to be defined.
The following accreditation system includes a definition of a Safe-Centre and the rules which have to be met to fulfil the criteria to provide access to European microdatasets.

A Safe-Centre is defined as a secure room in a MS/Eurostat, especially designed for researchers. It is a place where researchers can access detailed confidential data under contractual agreements which cover confidentiality. The Safe-Centre itself would consist of a secure working and data storage environment in which the security of the data for research can be ensured. Both the legal and the IT aspects of security are considered here. To ensure the security of the data, the Safe-Centre has to guarantee that:
It is not possible for the researcher to
• print documents
• copy data to removable media
• copy data to the local hard disk
• connect recording devices to the external interfaces
• connect a laptop to the network
• use e-mail or to connect to the internet. Exception:
► communication with outside via e-mail or internet (internet research) is permitted only via accounted desktops specially provided for this purpose (communication desktop) which ensure that (1) access to both confidential data and the internet at the same time is not possible (2) data cannot be transferred electronically between the access point and the communication desktop.
• install or remove hardware or software (the access point’s configuration is locked)
• boot the access point from floppy, CD-Rom, DVD-Rom or any other media
• access to the internal production network of the MS/Eurostat
• Supervision at any time should be possible, but need not be continuous; telephones may be used to provide supervision.

For the implementation of the short-term solution (“pilot”) there are already guidelines of the security requirements for a Safe-Centre developed during the project which can be a model for the minimum bottom line for the specifications of a Safe-Centre.
Within the “strategic aim”, a safe way to access data outside the environment of a RDC should be taken into account. There are serveral MSs using specific programmes to set up a safe connection between the desktop (in our case it is the PC in the Safe-Centre of the local RDC) of the researcher and a protected server of the MS (e.g. Netherlands, Italy, Denmark). The key issue is that the microdataset remains in the controlled environment of Eurostat, while the researcher can do the analysis in the RDC. The remote connection will enable the researcher to run statistical packages/programmes on the server located at Eurostat. The researcher will only see the session on his screen, which allows him to see the results on his analysis and also the microdata itself. For this reason the researchers are in the enclosed environment of the Safe-Centre. Only the screen-pictures will be sent to the PC of the researcher, but no data is transmitted. Even copying the data from the screen to the hard disk is not possible. The MS/Eurostat has to check the output for disclosure risks and after granting the anonymity the results will be submitted.

Output Checking of ECHP

The assignability of the guidelines to the ECHP

A determining necessity for providing access to microdata on a European scale is consistency in the way each MS checks output against disclosure of data on an individual level. A common set of guidelines is therefore needed.
The development of this set of guidelines is being dealt within the “guideline group on output checking” that is part of the ESSnet on Statistical Disclosure Control. The result of this guidelines project will be a set of guidelines that can be applied to all kinds of microdata (business, households, individuals).

First of all, it needs to be said that these guidelines have been developed in such a way that they should be applicable to all kinds of datasets. They contain no rules or guidelines that are specific for certain datafiles or variables.
But it will be interesting to know whether output based on the ECHP will fit into the framework set up by the guidelines (mainly the classification of outputs).
To do this, the publications that have been made using the ECHP datafile have been analysed. The EPUnet website provided a valuable source for this. 57 scientific publications were regarded, all of which are based upon data from the ECHP.
Before discussing the output used in these publications, first a short description of the involved institutes, the most popular terms of reference and the most commonly used variables will be given.

Institutes, terms of reference and variables

The scientific users of the ECHP are mainly from the university sector: about two thirds of all researchers applied for the data via a university, the rest gained access via other scientific institutes.
Grouping the topics of the different researchers in sectors, one can see that there are clear favourites which are examined under various angles: the most popular subjects are income rsp., job market and poverty, here above all child poverty. In addition, the subjects education/training, gender gap, family/family foundation and health were often examined. There was also research that dealt explicitly with methodological problems.
The used variables were similar in the different terms of reference. Almost always, the variables income, sex, age and country were used. Very often, there were also the variables job status, children, composition of the household and educational level. Remarkable is the number of the variables used for the analysis: while in some cases, there were only two variables (income and country), their number could grow very large in others.

Type of output

The type of output used most frequently in the regarded publications was the tabular analysis with frequency tables, followed by the usage of cross-tabulations. Thereby, the cross-tabulations were often created so that the country would work as the independent variable. Tabular analysis was used in most works, whereby the frequency tables were partly included in the written text of the publication and could thus not always be found themselves.
The second most used type of output – after the two types of tables – consists of different, rather simple statistical calculations. The most common of them was the calculation of regressions and estimators, but pretty often there were also means and measures of dispersion.
Less often used were graphics. In this group, the most commonly used ones were frequency polygons, bar charts and scatter diagrams. Histograms or other area diagrams were represented more seldom.
Weakly represented were also more complicated statistical calculations, like for example the application of the bootstrap-method or the calculation of different weighting factors.
To somewhat quantify the above analysis, all types of output that could be clearly distinguished in the 57 publications were counted. In total more than 200 different types of output were mentioned (248).

Concluding remarks

All output on ECHP that was clearly described in the publications can be assigned to one of the output classes of the guidelines project. This is a reassuring conclusion. Because the actual rules for these classes are not specific for certain statistics or variables, they will by definition apply to the ECHP. So since all output can be assigned to one of the classes, all examined output that was created using the ECHP can be checked with the general guidelines. The large volume of examined output makes it safe to extent this conclusion to future output based on ECHP data.

Costs

Based on the experiences of already existing RDCs it should be possible to calculate the costs of the hardware that allows the access to microdata either on a national server or via remote access. But the implementation of new ways of accessing community data surely leads to an increasing demand that causes an additional staff as well. Also for NSIs that are aiming to implement a RDC, an estimation of the occurring costs should be useful.
Thus, a cost template has been developed, which includes the following different categories:

1. staff planning (rates each qualification/grade)
2. breakdown on strategy and operational costs
3. breakdown on fixed and variable costs
4. number of projects
5. IT costs

On this basis, a cost model that gives information on the staff unit costs, the scale of operations, the operating costs per project and finally the share of costs split by categories can be estimated.
Also with regard to further implementation projects it is necessary to discuss on how the financial burden will be covered. This decision is still open if Eurostat is willing to support the costs or if the NSIs are required to self-finance the service. If last-mentioned occurs, a (partial) assumption of costs could be covered by the users.

Metadata

Although many metadata formats are currently in existence and it is not expected that only one single, homogenous overall metadata system will be in use in the future, the following summary concentrates on one out of more thinkable suitable formats.

Two variants appear to have taken on the role of industry standards in the last few years, where they appear to play out their respective strengths in different areas of application. SDMX is a well-established metadata standard which is particularly amenable to the annotation of low-volume time series data, while DDI is an open-source metadata format which has first and foremost gained popularity among social researchers who are more concerned with the analysis of often voluminous panel-type microdata.

Since microdata are produced in large quantities within NSIs such as the ONS, Destatis or Eurostat, DDI is a suitable metadata candidate to annotate such raw microdata for purposes of preservation, dissemination and easy discoverability. Many modern features incorporated into recent standards such as DDI 3.1, emphasizing the microdata life cycle view, and abstract, flexible and highly granular meta model-based XML implementations, are powerful enablers in the much-needed metadata revolution. Set against this are organisational realities which partially act as inhibitors to the implementation of such advanced features which are rendered difficult if not outright impossible to put into practice especially when viewed in the context of an all-participatory and all-collaborative process involving all participants to the microdata production chain. Such organisational problems are further accentuated by the well-known confidentiality issues surrounding microdata and their safe and secure accessibility by interested parties.

All told, given the mentioned enablers and inhibitors associated with typical microdata infrastructures present at NSIs, it is necessary to put forth the view that high-quality metadata – exploiting the latest cutting-edge and poised for taking advantage of future features – can only be produced in centralized fashion by dedicated centres of microdata expertise which have already been established in a number of NSIs across a large number of countries. The centralized view of metadata production emphasizes first and foremost the generation of high-quality metadata and not the exchange which can and arguably should occur in de-centralized and collaborative fashion. Since modern metadata standards are increasingly granular in nature and also aim to push the envelope in describing complex inter-linkages between various microdatasets, flexible metadata production systems will have to be designed to reflect that same or an increased extent of granularity and interrelatedness as is present in the metadata standards for which such systems are built. Since expanded levels of granularity and annotation of interrelatedness increasingly mimic the well-known structure of relational database systems, such systems should form an integral part of any feasible metadata production system. The structure of DDI 3.1 and the possibilities it offers in the generation of complex human- and non-human-readable metadata calls for a metadata production architecture which requires all critical elements to the process to be concentrated in one physical location. Environmental realities and the aspiration for the generation of high-quality metadata make the production of metadata in centralized fashion, secure manner and by expert groups residing within the boundaries of NSIs an inescapable design requirement.

Recommendations and Outlook

The recommendation for the setup of a decentralised system to European microdatasets is actually only one solution. The ideal model is a remote access system in Eurostat which researchers in MS can access through approved Safe-Centres in the NSIs. In the results two solutions are described, the only difference is the dimension of time. The first one, the pilot solution, is a starting point to build a system for decentralised access. The second one, the strategic aim, is more or less the other side of the coin. We do not have competitive systems. One system can be the consequence or the development of the other one. To reach a decentralised system like the strategic aim experience, cooperation and trust between the MS among themselves and Eurostat is needed. This will take some time. For this reason the recommendation of the project team is to start with the pilot solution and to develop from there in direction of the strategic aim.

The final results of this feasibility study have been accomplished. The project-team was able to give a clear recommendation on a network solution which can be implemented very quickly.
The current legal framework makes it already possible to transfer community statistics into MS, where they can be used for scientific analysis. For this, the explicit approval of the originating national authority is necessary. Whether or not the MS agree on such a request can only be shown by a practical implementation of this feasibility study in a follow up project.
From the project team’s point of view, the technical, legal and administrative requirements are in place to start with the recommended pilot study.
That doesn’t mean that the first decentralised access reached by the pilot will be comfortable and that the European microdatasets are promptly to use for everybody. But the basically decentralised access in the researchers own MS is possible. To allow decentralised access is part of the pilot study and to improve the access is part of the strategic aim.
The question whether additional ways of access to the community data in national NSIs are possible could be answered herewith. The debate solutions for remote access gets more and more important and needs to be considered in future developments. For a quick, safe and easy manageable access to EU microdata a modern way like remote access is indispensable for the scientific use. For this remote access system should be striven for as soon as possible to have a future-proof tool for the European empirical science.