TY - UNPB
T1 - SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method
AU - Andersen, Mikkel Meyer
AU - Christiansen, Steffan
AU - Andersen, Jeppe Dyrberg
AU - Eriksen, Poul Svante
AU - Morling, Niels
PY - 2022/1/20
Y1 - 2022/1/20
N2 - We introduce the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs. The butterfly method is based on a three-component mixture of normal distributions, in which parameters are easily found using the open-source statistical software R. This makes the method transparent, straight-forward to change parameters according to the user’s needs, and easy to analyse the data within R after the SNPs have been called. We contribute with two open-source R packages that make SNP calling easy by helping with bookkeeping and by giving easy access to meta-information about the SNPs on the Illumina Infinium Omni5-4 BeadChip Kit (including chromosome, probe type, and SNP bases). We test our method on > 4 mio. SNPs and compare the results with those obtained with the GenTrain method used by Illumina GenomeStudio as well as SNPs obtained by PCR-free whole genome sequencing (WGS). We demonstrate two variants of our method: one where we account for potential probe type bias by estimating a separate model for each probe type (type I and type II) and another that uses a general model such that the model’s parameter estimates do not depend on the sample that is being analysed. We focused on varying the no-call rate and show how it changed the concordance with that of WGS. This is done by using a threshold on the a posteriori probability of belonging to a SNP cluster and by using the number of beads to adjust the stringency of the no-call mechanism. With the butterfly method, we achieve a SNP call rate of around 99% and a SNP concordance of around 99% with the WGS data. By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data than the GenomeStudio.
AB - We introduce the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs. The butterfly method is based on a three-component mixture of normal distributions, in which parameters are easily found using the open-source statistical software R. This makes the method transparent, straight-forward to change parameters according to the user’s needs, and easy to analyse the data within R after the SNPs have been called. We contribute with two open-source R packages that make SNP calling easy by helping with bookkeeping and by giving easy access to meta-information about the SNPs on the Illumina Infinium Omni5-4 BeadChip Kit (including chromosome, probe type, and SNP bases). We test our method on > 4 mio. SNPs and compare the results with those obtained with the GenTrain method used by Illumina GenomeStudio as well as SNPs obtained by PCR-free whole genome sequencing (WGS). We demonstrate two variants of our method: one where we account for potential probe type bias by estimating a separate model for each probe type (type I and type II) and another that uses a general model such that the model’s parameter estimates do not depend on the sample that is being analysed. We focused on varying the no-call rate and show how it changed the concordance with that of WGS. This is done by using a threshold on the a posteriori probability of belonging to a SNP cluster and by using the number of beads to adjust the stringency of the no-call mechanism. With the butterfly method, we achieve a SNP call rate of around 99% and a SNP concordance of around 99% with the WGS data. By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data than the GenomeStudio.
U2 - 10.1101/2022.01.17.476594
DO - 10.1101/2022.01.17.476594
M3 - Preprint
BT - SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method
PB - bioRxiv
ER -