Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models

Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to...

Full description

Bibliographic Details
Main Authors: Belitz, Kenneth, Stackelberg, Paul E, Sharpe, Jennifer B
Format: Dataset
Language:unknown
Published: U.S. Geological Survey 2021
Subjects:
Online Access:https://dx.doi.org/10.5066/p9lctyi2
https://www.sciencebase.gov/catalog/item/5f61fd2482ce38aaa235c07a
Description
Summary:Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously published ML case studies of groundwater quality: (1) pH in the glacial aquifer system; (2) pH in the North Atlantic Coastal Plain; (3) nitrate in the Central Valley of California; and (4) iron in the Mississippi Embayment. This data release includes nine tables. For each of the four case studies, there are training data and holdout data; hence there are eight data tables. Each of the data tables includes observed values and ML estimates; these were obtained from previously published reports (Ransom and others, 2017; DeSimone and others, 2020; Knierem and others, 2020; Stackelberg and others, 2020). Each of the tables also includes bias-corrected values for each of the data points. The methods for obtaining the bias-corrected values are described in the primary related publication (Belitz and Stackelberg; 2021). The ninth table includes coefficients of equations associated with selected bias-correction methods for each of the case studies. Not all of the methods were applied to all of the case studies.