HIV survey in Mozambique: analysis with simultaneous model in contrast to separate hierarchical models

Background The analysis of correlated responses obtained one at a time in survey data is not as informative or as useful as modeling them simultaneously. Simultaneous modeling allows for the opportunity to evaluate the system in a more pragmatic form rather than to allow for responses that assumedly originated in isolation. Methods This research uses the Mozambique National Survey data to demonstrate the benefits of simultaneous modeling on blood test results, knowledge of HIV/AIDS, and awareness of an HIV/AIDS campaign. This simultaneous modeling also addresses the correlation inherent due to the hierarchical structure in the data collection. Results Employment and self-perceived risk of HIV/AIDS have different impact on blood test, awareness of an HIV/AIDS campaign, and knowledge of HIV/AIDS when examined simultaneously as opposed to separate modeling. Conclusion Simultaneous modeling of correlated responses improves the reliability of the estimates. More importantly, it provides an opportunity to engage in cost-saving decisions when designing future surveys and make better health policies.


Background
It is common in national health research to use survey data to advance health policies. Survey results provide a measure of policy priorities. For example, the Demographic and Health Survey (DHS) is conducted in over 90 nations globally to obtain representative data on population health, nutrition, and HIV/AIDS. Data from these surveys are analyzed to identify trends and to advance global health research agendas and national programs and policies [1,2].
National health surveys are often used to generate information that are critical in describing national and regional trends to identify gaps in knowledge. However, suboptimal analytic practices threaten the evidence base used for programmatic and policy decisions. Although national surveys capture multiple outcomes of interest, these outcomes are often modeled separately, thereby ignoring the correlation among outcomes or the interplay between outcomes. Also, the hierarchical design of national survey results in obtaining correlated observations are often omitted. This sort of omission leads to incorrect conclusions as incorrect standard errors are computed [3,4]. The problem is best addressed with simultaneous modeling of responses while accounting for the hierarchical structure of the survey data.
Mozambique is an example of a nation in sub-Saharan Africa that is severely impacted by the HIV/AIDS epidemic. The disease is one of the single largest global health priorities of the past two decades, with $562.6 billion spent globally between 2000 and 2015 as reported by the Global Burden of Disease Health Financing Collaborator Network 2018. Innumerable analyses have characterized the HIV/AIDS epidemic and its drivers within and across contexts [5,6]. Many of such decisions are made based on the Mozambique National Survey.

Mozambique survey data
This research utilizes a nationally representative, random sample of edited and cleaned from the Mozambique health data website. These data represent 270 clusters (primary units) distributed and sampled across Mozambique's 11 provinces. The data consist of 6232 households (secondary units) eligible for sampling. Men and women aged 15-64 living in these households are at the observational level and are eligible to participate by giving blood samples. There are 9311 adult participants. These data are available to estimate the prevalence of HIV/AIDS in the general population and to determine the impact of factors.

Outcomes of interest
This research concentrates on simultaneous modeling through the demonstration of three binary outcome measures of interest: blood test (positive or negative result), knowledge of HIV/AIDS (from community sources; a composite of a binary measure of participant awareness of HIV/ AIDS from five sources: community meetings, school/ teachers, conference in hospitals, community health workers, and church or mosque), and awareness of a campaign to combat HIV/AIDS (yes or no). These are binary outcome measures.

Covariates
The covariates include the following demographics: continuous variables in age and years of education; and a categorical variable in gender, religion, marital status, employment in the past 12 months (not working, worked in the past year, currently working), family wealth index measured on 5-point ordinal scale (poorest, poorer, middle, rich, richest), and self-perceived risk of contracting HIV/AIDS (no risk, small risk, moderate risk, great risk, respondent HIV-infected). Binary factors include electricity in the household (yes/no) and received any support or social assistance (yes/no).

Statistical model Separate binary model
The modeling of binary outcomes often make use of a standard logistic regression model. The standard logistic regression model belongs to the group of generalized linear model. It operates on the assumption that the observations are independent. However, when analyzing hierarchical data the independence assumption is no longer acceptable, so the researcher uses a generalized linear mixed model over a generalized linear model. In the generalized linear mixed logistic regression model, one must account for the intraclass correlation at the different levels of the hierarchical structure, most commonly through the use of random effects.
The intraclass correlation coefficient (ICC) indicates how much of the total variation in the probability is accounted for by the hierarchical level of the data. However, in fitting binary models, it often appears that there is no error at the lowest level of the hierarchy (level-1), but that incorrect assumption still must be addressed. Therefore, a slight modification is needed to calculate the ICC. This modification assumes the dichotomous outcome comes from an unknown latent continuous variable with a level-1 residual that follows a logistic distribution with a mean of 0 and a variance of 3.29. Therefore, 3.29 is used as our level-1 error variance in calculating the ICC [7]. In these data, there are three levels, thus two random effects are identified. One random effect represents household effects and the other random effect represents the cluster effects [4]. Then, 3.29 represents variance of the residuals at the observational level (resident) [8].
Thus, for modeling one binary response, a generalized linear mixed model with the clusters and the households incorporated as random effects to model the contribution due to households and clusters respectively is where p ihc is the probability of a favorable outcome for the i th resident within the h th household within the c th cluster, β i is the regression coefficient associated with the predictor X ihc for i = 1,2, …, n hc ; are the covariates associated with the h th household within that c th cluster, and a random effect household h =1, 2, …, n c ; and cluster c = 1, …, n; the random intercept u oc measures the unobserved variance attributable to the c th cluster, the random intercept u ohc measures differences of household level h within cluster c . These two random effects are assumed to be normally distributed with u oc N ð0; σ 2 u c Þ and u ohc N ð0; σ 2 u hc Þ . Further, this research assumes that the covariance of the random effects, σ u oc ;u ohc is zero. Households as random effects represent the differences in the residents' responses attributable to households, but were not captured by any of the covariates at the household level. Similarly, clusters as random effects represent the differences in the residents' responses attributable to the clusters, but were not captured by any of the covariates at the cluster level.

Simultaneous models
In public health research, it is common to find subjects providing information on a cadre of health responses with a set of covariates. However, the correlation among these responses is helpful to public health officials and decision makers. The identification of the overlap helps with distribution of resources and helps avoid duplication. Thus, it is advantageous to have simultaneous modeling.
This research demonstrates use of three simultaneous binary outcomes Y 1ihc , Y 2ihc , and Y 3ihc denoting the i th individual on the h th household member of the c th cluster for outcomes q= 1, 2, and 3 for h = 1, …. . n c , and c = 1, …..270. A simultaneous model of these binary outcomes f(Y 1hc , Y 2hc , Y 3hc ) consists of a shared-parameter that measures the correlation among the outcomes [8]. For q = 1, the response Y 1ihc for blood test follow a Bernoulli distribution with mean p 1hc and random effects u oc for clusters and random effects u ohc for households thus, Similarly, models are available for heard of HIV/ AIDS campaign [2] (q = 2) and heard of HIV/AIDS responses (q = 3). The joint modeling of these three outcomes has a vector of random effects u distributed as normal with mean vector 0 and covariance matrix, D 123 H∩C , such that the random effects for levels in the hierarchy is If the covariance d qp 1 ¼ 0 then they are uncorrelated, and the resulting model is equivalent to modeling the three outcomes separately [9,10] The joint log-likelihood is Through, the use of a modification of the expectationmaximization (EM) algorithm, the researcher is able to obtain maximum-likelihood estimates for model parameters when there is unobserved (hidden) latent variables.
The maximum likelihood estimates for the correlated logistic regression model is obtained [11]. The iteration process in the EM algorithm context provides convergence to the true ML estimates [12]. It is an iterative way to approximate the maximum likelihood function.
This research presents simultaneous generalized linear mixed models for binary responses (knowledge of HIV/ AIDS, awareness of an HIV/AIDS campaign, and blood testing for HIV/AIDS) using a shared joint random effects. This research uses these survey data to demonstrate the advantages of simultaneous modeling of these responses. These data are obtained based on a hierarchical structure. The SAS procedure PROC QLIM, among other models, fit simultaneous binary models. This procedure is designed to analyze mainly cross-sectional data.

Results
The survey data contained 58% of respondents who are female and nearly 70% of respondents who are married or living with a partner. The average age of the respondents is 31 years, and the average years of education is three. Of the respondents, 54.08% are Catholic or Muslim. Approximately 25% households have electricity. Approximately 16% of the respondents did not work in the last 12 months. About 30.7% of the respondents are classified into the richest category, and 12.2% of the respondents are classified into the poorest category. About 31% of the respondents perceived that they had no risk of contracting HIV/AIDS. The blood tests reveal 13.4% of respondents are HIVinfected. There are 77.2% of the respondents aware of HIV from community organizations and other institutes (schools, hospitals, religious institutions), and about 55% of respondents are aware of a campaign to combat HIV/AIDS. These results are summarized in Table 1.
There are no respondents in the survey who tested positive and did not hear about HIV/AIDS, but heard about the campaign. In addition, there are no respondents who tested negative and did not hear about the disease, but heard about the campaign, as shown in Table 2.
The data are collected in an hierarchical structure. Respondents are nested within households and households are nested within clusters. The correlation due to this structure, households and clusters, are considered as random effects. The variances of the random effects at the household level and at the cluster levels are shown in Table 3. The estimates suggest that the variance of the random effects due to clusters are significant (blood test, aware of HIV/AIDS, and aware of campaign) and too large to ignore in any model [4]. The variance of the random household effects are significant in measuring the blood test, but not significant when modeling for awareness of HIV and awareness of campaigns.

Simultaneous hierarchical logistic models
The simultaneous modeling of the three binary outcomes provides an opportunity to address interplay among the responses. The estimates for this simultaneous model of these three binary outcomes (knowledge of HIV/AIDS, awareness of HIV/AIDS campaign, and blood testing for HIV/AIDS) are given in Table 4.
The model shows that having electricity in the house increases the likelihood of hearing about the HIV/ AIDS campaign and decreases the HIV-infected rate (p < 0.0074). Wealthiest Mozambicans are more likely to have a positive blood test, knowledge of HIV/AIDS, and awareness of HIV/AIDS campaign in all models (p < 0.0024). Respondents with more years of education are more likely to be aware of HIV/AIDS campaign (p < 0.001). Respondents who perceived any risk (small, moderate, great) are more likely to have HIV-infected test results compared to those perceiving no risk (p < 0.001). Those residents who are married or living together are more likely to be HIV-infected (p < 0.001). Males are more likely to hear about HIV/AIDS campaign (p < 0.001). Support or social assistance is a significant factor only for knowledge of HIV/AIDS (p = 0.039). Marital status has no effect on knowledge Total respondents were less than 9331 for social support (n = 9295) and for selfperceived risk of HIV/AIDS (n = 6630) due to selection of "I don't know" but has an impact on blood test and awareness (p < 0.001 and (p-0.034) respectfully. Risk of AIDS and richer residents are significant for all three responses. Similar covariates are significant in modeling awareness of campaigns to combat HIV/AIDS and for modeling awareness of HIV/AIDS (Table 4). There are marked difference in separate modeling of these responses versus simultaneous modeling the responses. The simultaneous modeling accounts for the other responses in determining the impact of a covariate on a particular response. A separate response model is compared to the simultaneous model, as shown in Table 4

Conclusion
The survey data are correlated due to the hierarchical structure of the data. Statistical methods for the analysis of correlated data have become more accessible as statistical programs include the opportunity to use such models. The fit of correlated data with a generalized linear mixed model is common. However, it is important to note the analysis of correlated data does not have the same interpretation as when the data are assumed independent in its analysis. The analysis of correlated data with random effects are referred to as subject-specific model. Modeling simultaneous responses allows researchers to address correlation and explain the interplay. Such information results in cost saving measures in the design of future surveys. The advantage of simultaneous modeling lies with its ability to address one response while controlling for another. It is typical in survey data to have the respondents provide responses to a series of outcomes. More importantly, the simultaneous modeling of responses on hierarchical data provides policymakers and researchers with results on which to base allocation of resources at a time when funding is a scarce commodity.
Researchers are often faced with data with complicated structure but often choose to forgo complex models and rely on two-at-a-time modeling, one response and one covariate, with independent observations. However, there are multivariable methods [one response and several covariates] based on independent observations. For analysis, when the observations are not independent, a correlated model is necessary to identify the pattern of association. Such a model provides larger standard errors, which affects the significance of the covariates.
The analysis of the 2018 Mozambique survey data, like most survey data, present simultaneous responses [5,6]. Modeling simultaneous responses allows for the interpretation of interplay, which can lead to cost saving in future surveys. This approach is unique in that it addresses simultaneously the factors and the extra variation, as well as the interplay usually seen in survey data [13].

Additional file 1.
Abbreviations DHS: Demographic and Health Survey; HIV: Human immunodeficiency virus; AIDS: Acquired immunodeficiency syndrome