Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks

Vivek Appadurai*, Jonas Bybjerg-Grauholm, Morten Dybdahl Krebs, Anders Rosengren, Alfonso Buil, Andrés Ingason, Ole Mors, Anders D. Børglum, David M. Hougaard, Merete Nordentoft, Preben B. Mortensen, Olivier Delaneau, Thomas Werge, Andrew J. Schork

*Corresponding author af dette arbejde

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

5 Citationer (Scopus)
24 Downloads (Pure)

Abstract

Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks.

OriginalsprogEngelsk
Artikelnummer101
TidsskriftCommunications Biology
Vol/bind6
ISSN2399-3642
DOI
StatusUdgivet - 2023

Bibliografisk note

Funding Information:
Vivek Appadurai is supported by the Lundbeck Foundation postdoctoral grant: R380-2021-1465. Andrew Schork is supported by the Lundbeck Foundation fellowship: R335-2019-2318. The iPSYCH team was supported by grants from the Lundbeck Foundation (R102-A9118, R155-2014-1724, and R248-2017-2003), NIMH (1R01MH124851-01 to A.D.B.) and the Universities and University Hospitals of Aarhus and Copenhagen. The Danish National Biobank resource was supported by the Novo Nordisk Foundation. High-performance computer capacity for handling and statistical analysis of iPSYCH data on the GenomeDK HPC facility was provided by the Center for Genomics and Personalized Medicine and the Centre for Integrative Sequencing, iSEQ, Aarhus University, Denmark (grant to A.D.B.).

Publisher Copyright:
© 2023, The Author(s).

Citationsformater