Skip to main content

Table 1 Methodological guidelines using linked data and/or machine learning techniques to estimate population-based indicators, a study performed under InfAct project, May 2021

From: Methodological guidelines to estimate population-based health indicators using linked data and/or machine learning techniques

Item number

Checklist item

Description

 

1

Rationale and objective of the study (i.e., research question)

Define the rationale and objective of the study by adopting PICO criteria to research studies focused on population health.

2

Study design

Select the appropriate study design that could best address the proposed research question.

3

Linked data sources

Select the required linked data sources to answer the proposed research question.

4

Study population

  

4.1

 

Define the inclusion and exclusion criteria of the study population by taking into account age, sex and period of data collection.

4.2

Sample size

State the significance level of alpha and power based on the defined research question to calculate the sample size.

5

Study outcomes

  

5.1

Main outcomes

Define the main outcomes by taking into account study population, health condition to be studied, exposure (intervention/risk factors, if relevant) and defined period of study.

5.2

Level of estimation

Describe the level of estimation of health outcomes at the lowest possible granularity level (i.e., at community, metropolitan, departmental or regional levels).

6

Data preparation

  

6.1

A. Data extraction

Extract data with required input variables from linked data set to a single file or a spreadsheet that could be converted according to the required format of the statistical software for data analysis.

6.2

Coding of variables

Code the input variables, which are common in different linked data sets continuous or categorical or binary variables for required data analysis.

B. Data preparation to develop and apply a ML-algorithm

  

6.3

 

Identify and define the target groups for a given defined time window based on the outcome of interest.

6.4

 

Code the inputs variables, which are common in different linked data sets to continuous or categorical or binary variables for a given defined time window time.

6.4

 

Split of final data set into 80% training and 20% test data set.

7

Data analysis

  

7.1

A. Variables selection

Select variables after the removal of all variables with a variance equal to zero.

7.2

 

Estimate the RelifExp score based on the relevance of each variable to the outcome of interest.

B. Statistical techniques

  

7.3

I. Classical statistical techniques

Select an appropriate statistical technique to address the proposed research question according to the study objectives and the available data.

II. ML-techniques

  

7.4

 

Train various models and compare the performances of each model in terms of AUC curve (only for binary classifier).

7.5

 

Validate the model performance using k-fold cross-validation first on training data set, and then assess the model performance on test data set.

7.6

 

Select the final model based on specific performance metrics including sensitivity, specificity, PPV*, NPV*, F1-score and kappa.

C. Sensitivity/uncertainty analysis

  

7.7

 

Perform a sensitivity analysis to identify the most influential parameters for a given output of a model.

7.8

 

Select an appropriate method to perform the sensitivity analysis.

7.9

 

Calculate the uncertainty in estimates using 95% CI* and describe the source of uncertainty (if relevant).

D. Potential issues during data analysis

  

I. Missing data

  

7.10

 

Identify the missing data in the given dataset.

7.11

 

Apply an appropriate technique for the imputation of missing values in the given data set.

7.12

II. Imbalanced target group in a given dataset

Apply an appropriate technique to create a balanced data set either using down sampling or over sampling approach.

7.13

III. Bias and variance tradeoff

Find the most generalizable model to keep the balance between bias and variance.

8

Study limitations

Describe the study limitations related to data sources (i.e., linkage, quality, access and privacy), study design, study population and statistical method used (if relevant).

  1. *PPV Positive Predictive Value, NPV Negative Predictive Value, CI Confidence interval