TY - JOUR
T1 - Dealing with dimensionality
T2 - the application of machine learning to multi-omics data
AU - Feldner-Busztin, Dylan
AU - Nisantzis, Panos Firbas
AU - Edmunds, Shelley Jane
AU - Boza, Gergely
AU - Racimo, Fernando
AU - Gopalakrishnan, Shyam
AU - Limborg, Morten Tønsberg
AU - Lahti, Leo
AU - de Polavieja, Gonzalo G.
PY - 2023
Y1 - 2023
N2 - Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.Availability and implementationAll data and processing scripts are available at this GitLab repository: or in Zenodo: .Supplementary informationare available at Bioinformatics online.
AB - Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.Availability and implementationAll data and processing scripts are available at this GitLab repository: or in Zenodo: .Supplementary informationare available at Bioinformatics online.
KW - CAUSAL INFERENCE
KW - GENE
KW - EXPRESSION
KW - MODELS
U2 - 10.1093/bioinformatics/btad021
DO - 10.1093/bioinformatics/btad021
M3 - Journal article
C2 - 36637211
VL - 39
JO - Bioinformatics (Online)
JF - Bioinformatics (Online)
SN - 1367-4811
IS - 2
M1 - btad021
ER -