A Novel Framework for the Identification of Reference DNA Methylation Libraries for Reference-Based Deconvolution of Cellular Mixtures

Reference-based deconvolution methods use reference libraries of cell-specific DNA methylation (DNAm) measurements as a means toward deconvoluting cell proportions in heterogeneous biospecimens (e.g., whole-blood). As the accuracy of such methods depends highly on the CpG loci comprising the referen...

Full description

Bibliographic Details
Published in:Frontiers in Bioinformatics
Main Authors: Bell-Glenn, Shelby, Thompson, Jeffrey A., Salas, Lucas A, Koestler, Devin C.
Format: Text
Language:English
Published: 2022
Subjects:
DML
Online Access:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9004796/
http://www.ncbi.nlm.nih.gov/pubmed/35419567
https://doi.org/10.3389/fbinf.2022.835591
Description
Summary:Reference-based deconvolution methods use reference libraries of cell-specific DNA methylation (DNAm) measurements as a means toward deconvoluting cell proportions in heterogeneous biospecimens (e.g., whole-blood). As the accuracy of such methods depends highly on the CpG loci comprising the reference library, recent research efforts have focused on the selection of libraries to optimize deconvolution accuracy. While existing approaches for library selection work extremely well, the best performing approaches require a training data set consisting of both DNAm profiles over a heterogeneous cell population and gold-standard measurements of cell composition (e.g., flow cytometry) in the same samples. Here, we present a framework for reference library selection without a training dataset (RESET) and benchmark it against the Legacy method (minfi:pickCompProbes), where libraries are constructed based on a pre-specified number of cell-specific differentially methylated loci (DML). RESET uses a modified version of the Dispersion Separability Criteria (DSC) for comparing different libraries and has four main steps: (1) identify a candidate set of cell-specific DMLs, (2) randomly sample DMLs from the candidate set, (3) compute the Modified DSC of the selected DMLs, and (4) update the selection probabilities of DMLs based on their contribution to the Modified DSC. Steps 2–4 are repeated many times and the library with the largest Modified DSC is selected for subsequent reference-based deconvolution. We evaluated RESET using several publicly available datasets consisting of whole-blood DNAm measurements with corresponding measurements of cell composition. We computed the RMSE and R(2) between the predicted cell proportions and their measured values. RESET outperformed the Legacy approach in selecting libraries that improve the accuracy of deconvolution estimates. Additionally, reference libraries constructed using RESET resulted in cellular composition estimates that explained more variation in DNAm as compared to the ...