Some of the cohorts are community-based, while others specifically focus on clinical populations. Furthermore, some cohorts are focused on young men who have sex with men with others focused on persons who inject drugs. In order to allow for cross-cohort analyses, we implemented a rigorous data harmonization process for a core set of data elements. Te specifics of the process have been described elsewhere, but briefly, we requested data dictionaries from each of the cohorts and identified a core set of variables including sociodemographic factors, clinical characteristics, and substance using behaviors. Common data elements were first reviewed by the consortium data team both qualitatively and quantitatively and the subsequent harmonized data sets were further reviewed with each cohort data team in order to ensure fdelity in the harmonization process. Given the consortium’s focus on substance use, we were particularly interested in maintaining as much information and specificity as possible related to substance use. While standardized measures of substance use were utilized by each cohort, the choice of measures differed across cohorts. Even when measures overlapped, most studies used variations making it challenging to harmonize data across studies. For instance, substance use was assessed with various time frames, including 30-day, 3-month, and 6-month recall periods. Combining these data to obtain substance use in the past six months could lead to misclassification bias particularly for occasional users who may not have used a given drug in the shorter recall periods. This challenge to harmonizing substance use data – a key variable for the consortium – resulted in a patchwork pattern of missing data. There are a number of strategies to deal with missing data resulting from the harmonization processes where disparate measures cannot be collapsed into one variable. One common strategy is to ignore the missingness and use only participants with complete data in the analyses, which is well known for its potential for bias and inefficiency. A strategy to overcome this issue, which is widely used when dealing with missing data is imputation then analyzing the full data set as if imputed values were observed. In recent years,vertical hydroponic garden as a result of significant advances in computing power, a wide array of techniques for producing imputations has emerged including regression based techniques that allow for specification of multi-variable models, hotdeck techniques, as well as multiple imputation methods.
Additionally, strategies to evaluate the statistical properties of imputation techniques have also been explored, though few studies have taken a more applied and translational approach. Te objective of this study was to move beyond consideration of the statistical properties of these methods and present an applied overview of the performance of different imputation strategies when used for data harmonization. We used data from one of the cohorts participating in the consortium as a validation set and created missing data in such a way as to mimic the missingness that results during the harmonization process. We then applied three imputation strategies that vary in complexity including logistic regression, single hot-deck imputation, and multiple imputation and evaluated the performance of each strategy.At baseline and subsequent follow-up visits, which occurred at least six months apart, participants completed a self-administered, computer-based questionnaire. Te questionnaire included questions on a number of domains ranging from sociodemographic characteristics, sexual risk behaviors, as well as an extensive battery of questions related to substance use. In this analysis we used substance use data collected as part of a modified version of the Alcohol, Smoking and Substance Involvement Screening Test . specifically, for each substance participants were asked how often they have used it in the past six months. Substances of interest were cocaine, crack, ecstasy, heroin, cannabis, methamphetamine, poppers, and prescription drugs. Response options included never, once, monthly, weekly, and daily/almost daily. For the purpose of this analysis, all those who reported using a given substance at least once in the past six months were categorized as having used the particular drug, with all others being categorized as non-users. We selected drugs which were reported at low, medium, and high prevalence of use including heroin , methamphetamine , and cannabis , respectively. This allowed us to evaluate the performance of the imputation strategies under various prevalence estimates of the outcome.Data collected from August 2014 through June 2019, from 528 participants and the resulting 2,389 study visits were used in this analysis.
A Monte Carlo simulation study with 500 iterations was run to assess the relative performance of each imputation method. At each iteration, first a proportion of the data was set to missing with this step intended to mimic the missingness that results when we attempt to harmonize disparate measures across studies that measure substance use. Second, using the amputated data, three strategies including logistic regression scoring, single hot-deck, and multiple imputation were used to impute the missing data. Each imputation generated an estimated prevalence and confidence interval which was stored until 500 iterations were achieved. Finally, summary statistics across the 500 iterations allowed us to compare the performance of each strategy against the prevalence from the original data. Details of each of the steps in the process are described below.Data amputation – the process of generating the missing data – involved simulations such that the original dataset was sampled with replacement and amputated giving consideration to several factors including the missing data mechanism, the amount of missing data, as well as the pattern of missingness. Te primary consideration for the missing data mechanism was whether the missingness was related to the underlying value for that variable. This is relevant given that strategies to handle missing data are largely reliant on correct assumptions of the mechanisms which caused the missingness. For the purpose of this analysis we gave consideration to three different missing data mechanisms including missing completely at random , missing at random , and missing not at random. MCAR indicates there is no relationship between the missing data and any observed or unobserved variables. In this scenario, the probability of missing is the same for all cases in a given data set. MAR indicates a missing data mechanism in which there is a systematic relationship between the probability of missing and some observed data, but not the missing data itself. More specifically, under MAR the missingness is conditionally independent of unobserved outcomes but there is dependence on observed outcomes . Te premise of MAR is that once the analyst controls for these auxiliary variables, the missingness is ignorable. Finally, MNAR suggests that there is a relationship between missingness and unobserved outcomes , which makes it the most difficult mechanism to handle properly. Te level of missingness used in the amputation was set at 10, 30, and 50% in order to assess low to high rates of missing data. Additionally, the pattern of missingness was varied by substance use in order to allow for any one of the following scenarios: missing heroin only; missing cannabis only; missing methamphetamine only; or missing all three drugs simultaneously.
Te ampute package in R was used to generate the missingness. In addition to the three drug use variables age and employment status were used to generate missingness in the substance use data. Te reason age and employment status were chosen as auxiliary variables is because in the context of this project, these variables serve as a proxy for the specific characteristics of cohorts across which we intend to harmonize data and will help in replicating the most plausible missing data pattern in the context of our work.After the missing data were generated in such a way as to simulate ‘real world’ missing data scenarios that may result during the data harmonization process, various data imputation strategies were used to impute the missing data. Te imputation methods used included two different single imputation strategies as well as multiple imputation including: logistic regression; single hot-deck imputation; and multiple imputation with five and twenty imputations. These imputation strategies were chosen since they reflect a range of strategies from simple to complex, both from the technical expertise required to implement as well as the computational resources needed to execute. specifics of each of the imputation strategies are described below. Imputation with logistic regression is a single imputation strategy that produces predicted probabilities obtained by regressing the missing variable on other variables. In this case, the specific drug was an outcome variable and age, employment status, and cannabis and/or heroin use served as predictor or auxiliary variables. This strategy is technically relatively simple, preserves relationships among variables involved in the imputation model, and may provide a more informed estimate of the missing value that moves beyond a strategy that ignores other auxiliary variables. Hot-deck imputation is a computationally simple imputation strategy that uses data from an individual in the sample who has similar values on other variables to impute the missing values. Observations imputed are labeled recipients and observations drawn from a pool of matching candidates are labeled donors. For this analysis, donors were matched based on age, employment status,plant bench indoor and other substance use information. For example, if a recipient with missing data on methamphetamine was 25 years of old, employed, and reported cannabis use , then all 25 year old, employed participants who reported cannabis use other than the recipient were considered donors and a random observation was taken from this pool and the methamphetamine use status of the selected donor was used for the recipient. Instead of using actual observed values from a donor pool, multiple imputation uses a stochastic logistic regression model to generate n-sets of data – in this analysis n was either five or twenty – given pre-specified auxiliary variables. Te auxiliary variables used were the same as those described above. For example, five predicted data sets were generated for missing cannabis data using a stochastic logistic regression model composed of age, employment status, as well as reported methamphetamine and/or heroin use. Multiple imputation is expected to result in lower bias, however, this strategy is computationally intensive and requires technical expertise that may makes its regular application less practical. Finally, in order to allow for direct comparison between the various imputation strategies, the auxiliary variables were the same in all strategies. Te Monte Carlo simulation study from amputation to imputation was conducted using R .
Te data amputation and subsequent imputation was repeated 500 times in order to generate a simulated distribution that allowed for calculations to assess the performance of each strategy. We calculated the prevalence estimate resulting from the simulations as an average estimate across the 500 simulations. First, we report prevalence estimates for each of the substances given 10%, 30%, and 50% missingness based on listwise deletion. Listwise deletion, also known as complete case analysis, is the default strategy in most analytic software and provides an estimate of the prevalence and potential magnitude of bias if imputation is not employed. Next, we estimated the magnitude of the potential bias based on the average difference between the prevalence estimate from the original data and the mean of the prevalence estimate across the 500 simulation replicates. We also provide calculations for the root mean squared error as well as coverage of the 95% confidence interval, which was calculated based on the proportion of times the 95% confidence interval of the estimated summary estimate contained the prevalence estimate from the original data. Comparable to the scenarios with low and medium prevalence outcomes, both single and multiple imputation strategies with lower levels of missingness with an MCAR missing data mechanism performed well . Additionally, none of the strategies were effective under circumstances where data were MNAR. For all levels of missingness and assuming data were MAR, multiple imputation outperformed all strategies with both five and twenty imputed data sets resulting in comparable outcomes. For instance, with 50% missingnes, MI with five and twenty data sets resulted in a prevalence estimate of 52%, minimal bias and otherwise comparable in terms of coverage and RMSE .We evaluated the performance of different imputation strategies used to address missingness in key variables that thwart efforts to harmonize data collected as part of HIV-cohort studies. Our findings suggest that while multiple imputation is an effective tool for re-creating unbiased prevalence rates of substance use under MAR, single imputation strategies may also be effective if the missing data mechanism is MCAR. Furthermore, we demonstrate that when the missing data mechanism is MAR , ignoring the missingness can result in underestimation of the prevalence estimates and that single imputation strategies are ineffectual in correcting this bias, especially in cases where the prevalence of the outcome is low. Finally, we demonstrate that none of the imputation strategies are effective if missingness is not at random .