Completed Research Projects 2008-2013

Harmonizing Data Sets for Genomic Research

Investigators: Jane Costello, Rick Hoyle, Madeline Carrig, Jerry Reiter, Krista Ranby, and Daniel Manrique-Vallier

Overview

In understanding the cause of many diseases, the search for genes has moved from the identification of rare high-risk variants to that of common low-risk variants.  The size of samples required for adequate power has correspondingly increased.  In the study of gene by environment interaction (G-E) models of disease risk, pooling data from different completed or ongoing studies is viewed as a time- and cost-effective alternative to the conduct of large new investigations designed to collect detailed phenotypic and “envirotypic” information.  Unfortunately, attempts to pull together cases from existing data sets have faced significant challenges to date, in part because studies lack consistent rules and methods for making diagnoses and for defining environmental risk.

Our project sought to develop and test a new methodology for pooling data from studies that used different measures to assess the same or similar constructs.  In the present investigation, data was pooled from the National Longitudinal Study of Adolescent Health (Add Health), Great Smoky Mountains Study (GSMS), and Child Development Project (CDP) data sets.  The proposed data harmonization methodology involved the creation of a calibration data set, in which two or more measures of the same or similar constructs, obtained from the same participants, are compared and the scores on each measure mapped onto the other.  Calibration samples may be internal to the primary samples of scientific interest (if both measures were used in an existing data set), or may be external (obtained de novo); our work involved both types of samples.  We hope our investigation will provide an important tool for research across many areas of genomic research.

Activities

  1. C-StARR involved in the “harmonization” project include Rick Hoyle, Jerry Reiter (Department of Statistical Science and SSRI), Madeline Carrig, Daniel Manrique-Vallier (Postdoctoral Associate, SSRI), Krista Ranby (Postdoctoral Associate, C-StARR), and Rose Wilson and Jenny Park, who were hired to work as full-time (Ms. Wilson) and half-time (Ms. Park) data technicians on the project.  We were also in collaboration with Danielle Dick (Department of Psychiatry and Virginia Institute for Psychiatric and Behavioral Genetics, VCU), who brought to the team special expertise on gene-environment interplay in alcohol use and dependence.  On-site team members met weekly to discuss project progress.
  2. Members of the team acquired the needed permissions and computing/networking infrastructure for access to the Add Health, GSMS, and CDP data sets (IRB approval had already been secured for the secondary data analysis of the Add Health data set).  We reviewed theoretical and empirical literature pertinent to the study of gene-environment interactions in the development of substance use disorders and built the theoretical models to be tested using the harmonized data.
  3. Items from the Add Health, GSMS, and CDP data sets that tap constructs implicated in prior work as environmental/contextual risk factors (especially in the presence of genetic susceptibility) for the development of alcohol use disorders were identified.  This work involved the grouping of individual items that tap the same or similar constructs across waves of data collection and across data sets.  A calibration survey, which included both web-based and telephone interview components, was developed using these items and an additional set of “gold standard” alcohol use items.
  4. A protocol was approved by the IRB for the collection of the external calibration sample data.  Calibration sample data were collected by an outside vendor, Knowledge Networks, between December 2010 and February 2011.  A total of 242 19-20 year old participants completed both the online and telephone interview portions of the survey.  Complete data sets were delivered to the C-StARR on March 1, 2011.
  5. Demographic covariates were modified to possess similar scales/coding schemes across the three existing data sets.  Upon receipt of the calibration sample data, the three existing data sets and the external calibration data set were combined.

Findings

We treated the problem of harmonizing the data sets as a missing data problem.  The objective was to produce a single data set for which all records have non-missing scores for all items. Considering the empty cells as “missing data,” we approached the problem of harmonizing the data sets as the multivariate imputation of those missing cells conditional on the observed cells from all data sets. To that end, we employed a flexible iterative framework based on classification and regression trees (CART; see Burgette and Reiter, 2010). This approach has the advantage of being simple to implement but at the same time, being flexible enough to accommodate very complex multivariate dependencies without overfitting. Additionally, CART-based algorithms have the advantage of handling both continuous and ordinal and discrete outcomes well (see Friedman et al, 2009).

The results of applying these techniques to our data sets are encouraging. We have applied the algorithms to create 100 instances of the imputed data sets that can be analyzed using multiple imputation techniques. In order to assess the quality of the results, we also created several instances of fully-synthetic data sets generated from the same models used to impute the missing values. We then compared different statistics computed from the original data to their values computed from the synthetic data. In general, relationships between variables within each data set appeared to be well preserved.

Next steps include more comprehensive tests, as well as using the harmonized data sets to draw relevant scientific conclusions. The extra uncertainty introduced by the imputation procedure can be handled using standard multiple-imputation techniques (Rubin, 1987).  If the harmonization sample is not to be released, we must develop appropriate techniques to handle that situation (see Reiter, 2008).