Skip to main content

Table 3 Table Overview of challenges, considerations and recommendations in linking surveys data with administrative data; HISlink, Belgium

From: Linking health survey data with health insurance data: methodology, challenges, opportunities and recommendations for public health research. An experience from the HISlink project in Belgium

Category

Description

HISlink-specific experience

Recommendations

Technical, practical challenges

Data quality

Availability, completeness and discriminatory power of identifiers

National register number available and used as linkage key

Use unique identifier when available.

Otherwise, carefully select linkage variables to construct linkage the keys. Ensure that these variables are as complete as possible (less missing values, less errors) and that no duplicate records exist in each data source.

Linkage errors

Usually arises in data linkage, typically when ‘imperfect identifiers’ are used and could result in substantially biased results. False matches (i.e., when records from different individuals link erroneously) and missed matches (i.e., when records from the same individual fail to link) [45, 46] are of greatest concern.

The number of false matches and missed matches can directly affect the estimation of prevalence or incidence rates. False matches (low specificity) lead to overestimates of prevalence whilst missed matches (low sensitivity) lead to underestimates. The impact of linkage error depends on the underlying prevalence of the target condition: analyses of rare conditions are more severely affected by linkage error compared with more common conditions, as overestimation is inversely related to the underlying prevalence [46].

Negligible/marginal false matches because of the accuracy of the linkage key. However, up to 8% of missed matches (see section 4.1 for possible explanations). The comparison of linked and unlinked records identified subgroups that are more prone to linkage errors (see Table 1).

Evaluate linkage quality and assess the impact of linkage errors on the results [17, 35, 46]. The evaluation of linkage quality is vital to producing reliable results from studies using the linked data. Several methods can be used to assess linkage quality and errors:

- comparing linked data with reference or ‘gold-standard’ datasets where the true match status is known;

- structured sensitivity analyses where a number of linked datasets are produced using different linkage criteria;

- comparisons of characteristics of linked and unlinked data to identify any potential sources of bias;

- statistical methods accounting for linkage uncertainty within analysis (e.g. using missing data methods);

- quality control checks (implausible scenarios)

- sensitivity (proportion of matches that are correctly identified as links), specificity (proportion of non matches that are correctly identified as non-links), match rate and false match rate.

The TTPs should enhance the linkage methods by combining deterministic linkage in the first steps using the NRN and probabilistic approaches afterwards for unlinked persons using algorithm based on other personal data. Identify subgroups of records that are more prone to linkage error and are potential sources of bias. Comparisons of linked and unlinked records can be useful to identifying where modified linkage strategies may be required for specific groups of records.

Use the NRN of all individuals included in the survey, regardless of the composition of the household at one time, instead of that of the reference person first and then the other family members, in order to improve the linkage rate.

Costs

Data linkage can be expensive in terms of financial and human resources.

Government-sponsored (NIHDI) linked datasets

Make the system cost-effective by avoiding the ‘linked and destroyed’ philosophy and making available the linked data to other researchers under certain conditions.

Principle of proportionality respect

Means that only data that are relevant to the purpose of the study should be included to avoid re-identification of individuals.

Help from the BHIS team for the selection of BHIS variables and help from the IMA’s SPOC for what concern IMA variables.

Require a deep knowledge of the data sources. Involve people with good experience of the data sources to be linked in the relevant variable selection phase.

An alternative and more effective approach could be could be too ask for authorization to link both datasets completely in a first step. In a second step, each research project demands in a simplified procedure access to the relevant variables of the fully linked dataset in accordance with the proportionality principle. Such an approach is applied at Statistics Netherlands (49–51).

Infrastructures

Infrastructure needed to store and access the linked data.

The linked data was stored on the IMA server. Researchers access it through a secure remote connection using a token.

Identify where linked data can be stored securely and how it can be accessed (remote session, data extraction).

Statistical issues

Analysing linked data raises a number of statistical challenges for researchers.

Experts’ advice during the statistical analysis plan, data analysis and interpretation of results.

Experts’ advice useful for the statistical analysis plan, data analysis and results interpretation.

Apply appropriate statistical methods of adjusting analysis for linkage bias. E.g., an extension to standard multiple imputation methods, able to handle ‘partially observed’ (or partially linked) data; use of population weights to account for groups or people who are more or less likely to be linked [46].

Ethical, legal and societal aspects

Approval processes

Privacy concerns have led to policies that prevent records from being

easily linked. Usually, there is a need of intuitional/ethical review boards (IRB) approval which is a long and cumbersome process.

The linkage was approved by the Information Security committee (ISC). The approval process took three and five months for the HISlink 2013 and HISlink 2018, respectively.

Consider the IRB process in the timeline for the project.

Concerns about privacy led to policies that prevent records from being easily linked. Therefore, a strong case for using the data and a detailed description of how it will be protected is required when obtaining IRB approval.

Since the HISlink is government-sponsored linkage project which is repeated every BHIS wave, a solution to avoid an ad hoc approval process would be to set up an “umbrella” agreement protocol for public institutions such as Sciensano, covering several years and several waves of BHIS_BCHI linkages.

Privacy and confidentiality issues: actual linkage process and principle of separation (Trusted Third Party linkage)

Once the IRB approval has been obtained, the actual linkage is itself a time-consuming process.

The separation principle means a separation of the linking and analysis process. Although this principle preserves confidentiality and avoids disclosing sensitive information, it is bad for understanding the quality of linked data.

Trusted Third Party linkage, a lengthy process mainly due to the signing of an agreement between all parties involved. The whole linkage procedure took 12 months and 15 months for HISlink 2013 and HISlink 2018, respectively.

Although full separation of identifiers and attribute data has been argued to reduce the risk of re-identification, and is a valuable tool in reassuring data providers about the security of sharing their data. However, allowing linkage and analysis to take place together provides opportunities for both in-depth evaluation of linkage quality, and methodological advances in linkage technics [76, 77].

Consent form

To comply with the GDPR, an effective opt-in linkage consent form have to be received.

HISlink 2013 and 2018 were not consent-based (exemptions, linkage planed before the implementation of the GDPR).

However, for the next HISlink 2023, the consent of the BHIS participants was asked to link their data with existing administrative data.

For planned linkage, ask for linkage consent to the survey participants, preferably at the beginning of the survey to maximise consent rate [55, 68, 69].

For historical data linkage, certain exemptions exist. Check if the project falls under these exemptions.

Assess consent bias if applicable

Outcomes

Opportunities / limitations of linked data

The linked data is an important source for population health research and can bring enormous benefits in providing a more complete picture of the health of the population. A whole range of research possibilities exists.

Limitations of both BHIS and BCHI data remain, for instance lack of diagnostic information in the BCHI data

Include other data sources such as hospital discharge data

Consider substituting HIS information by administrative data as much as appropriate (e.g., or cancer screening, reimbursed, healthcare use or reimbursed drug use).

Linkage type and sustainability

Ad hoc linkages vs. systematic linkages

Ad hoc linkage (and ad hoc approval) can threat the sustainability of the project. HISlink is based on the ‘linked and destroyed philosophy’ (because of a limited data retention time by researchers in the IRB approval, i.e., five years after the linkage) As a result, the return on investment in linked data may be limited.

A clear data use agreements for governmental institutions, administrations, universities allowing share and use of the linked databases for at least several years even if for perpetuity in a secure manner. Such strategies will allow to exploit the full potential of the linked data in other researches.

Think about systematic linkage.

Access to the linked data

 

HISlink data is currently accessible to Sciensano researchers only.

Make de-identified data available to other researchers upon approval

Sample size

Small sample can prevent some analyses

Limited sample size for rare events, specific subgroup analysis

Consider subsample for specific subgroups such as low sociodemographic individuals, those with specific conditions if possible.