Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models

Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to...

Full description

Bibliographic Details
Main Authors:	Belitz, Kenneth, Stackelberg, Paul E, Sharpe, Jennifer B
Format:	Dataset
Language:	unknown
Published:	U.S. Geological Survey 2021
Subjects:	Water Quality North Atlantic
Online Access:	https://dx.doi.org/10.5066/p9lctyi2 https://www.sciencebase.gov/catalog/item/5f61fd2482ce38aaa235c07a

id	ftdatacite:10.5066/p9lctyi2
record_format	openpolar
spelling	ftdatacite:10.5066/p9lctyi2 2023-05-15T17:35:19+02:00 Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models Belitz, Kenneth Stackelberg, Paul E Sharpe, Jennifer B 2021 https://dx.doi.org/10.5066/p9lctyi2 https://www.sciencebase.gov/catalog/item/5f61fd2482ce38aaa235c07a unknown U.S. Geological Survey https://dx.doi.org/10.1016/j.envsoft.2021.105006 Water Quality dataset Dataset 2021 ftdatacite https://doi.org/10.5066/p9lctyi2 https://doi.org/10.1016/j.envsoft.2021.105006 2021-11-05T12:55:41Z Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously published ML case studies of groundwater quality: (1) pH in the glacial aquifer system; (2) pH in the North Atlantic Coastal Plain; (3) nitrate in the Central Valley of California; and (4) iron in the Mississippi Embayment. This data release includes nine tables. For each of the four case studies, there are training data and holdout data; hence there are eight data tables. Each of the data tables includes observed values and ML estimates; these were obtained from previously published reports (Ransom and others, 2017; DeSimone and others, 2020; Knierem and others, 2020; Stackelberg and others, 2020). Each of the tables also includes bias-corrected values for each of the data points. The methods for obtaining the bias-corrected values are described in the primary related publication (Belitz and Stackelberg; 2021). The ninth table includes coefficients of equations associated with selected bias-correction methods for each of the case studies. Not all of the methods were applied to all of the case studies. Dataset North Atlantic DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Water Quality
spellingShingle	Water Quality Belitz, Kenneth Stackelberg, Paul E Sharpe, Jennifer B Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
topic_facet	Water Quality
description	Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously published ML case studies of groundwater quality: (1) pH in the glacial aquifer system; (2) pH in the North Atlantic Coastal Plain; (3) nitrate in the Central Valley of California; and (4) iron in the Mississippi Embayment. This data release includes nine tables. For each of the four case studies, there are training data and holdout data; hence there are eight data tables. Each of the data tables includes observed values and ML estimates; these were obtained from previously published reports (Ransom and others, 2017; DeSimone and others, 2020; Knierem and others, 2020; Stackelberg and others, 2020). Each of the tables also includes bias-corrected values for each of the data points. The methods for obtaining the bias-corrected values are described in the primary related publication (Belitz and Stackelberg; 2021). The ninth table includes coefficients of equations associated with selected bias-correction methods for each of the case studies. Not all of the methods were applied to all of the case studies.
format	Dataset
author	Belitz, Kenneth Stackelberg, Paul E Sharpe, Jennifer B
author_facet	Belitz, Kenneth Stackelberg, Paul E Sharpe, Jennifer B
author_sort	Belitz, Kenneth
title	Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
title_short	Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
title_full	Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
title_fullStr	Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
title_full_unstemmed	Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
title_sort	data release for evaluation of six methods for correcting bias in estimates from ensemble tree machine learning regression models
publisher	U.S. Geological Survey
publishDate	2021
url	https://dx.doi.org/10.5066/p9lctyi2 https://www.sciencebase.gov/catalog/item/5f61fd2482ce38aaa235c07a
genre	North Atlantic
genre_facet	North Atlantic
op_relation	https://dx.doi.org/10.1016/j.envsoft.2021.105006
op_doi	https://doi.org/10.5066/p9lctyi2 https://doi.org/10.1016/j.envsoft.2021.105006
_version_	1766134445785481216

Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models

Similar Items