A machine learning approach to predict ethnicity using personal name and census location in Canada.

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Can...

Full description

Bibliographic Details
Published in:PLOS ONE
Main Authors: Kai On Wong, Osmar R Zaïane, Faith G Davis, Yutaka Yasui
Format: Article in Journal/Newspaper
Language:English
Published: Public Library of Science (PLoS) 2020
Subjects:
R
Q
Online Access:https://doi.org/10.1371/journal.pone.0241239
https://doaj.org/article/1e1aee1f79ad46c9b2893f5de0ea196b
id ftdoajarticles:oai:doaj.org/article:1e1aee1f79ad46c9b2893f5de0ea196b
record_format openpolar
spelling ftdoajarticles:oai:doaj.org/article:1e1aee1f79ad46c9b2893f5de0ea196b 2023-05-15T16:16:57+02:00 A machine learning approach to predict ethnicity using personal name and census location in Canada. Kai On Wong Osmar R Zaïane Faith G Davis Yutaka Yasui 2020-01-01T00:00:00Z https://doi.org/10.1371/journal.pone.0241239 https://doaj.org/article/1e1aee1f79ad46c9b2893f5de0ea196b EN eng Public Library of Science (PLoS) https://doi.org/10.1371/journal.pone.0241239 https://doaj.org/toc/1932-6203 1932-6203 doi:10.1371/journal.pone.0241239 https://doaj.org/article/1e1aee1f79ad46c9b2893f5de0ea196b PLoS ONE, Vol 15, Iss 11, p e0241239 (2020) Medicine R Science Q article 2020 ftdoajarticles https://doi.org/10.1371/journal.pone.0241239 2022-12-31T15:16:56Z Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68-95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63-67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by ... Article in Journal/Newspaper First Nations inuit Directory of Open Access Journals: DOAJ Articles Canada PLOS ONE 15 11 e0241239
institution Open Polar
collection Directory of Open Access Journals: DOAJ Articles
op_collection_id ftdoajarticles
language English
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Kai On Wong
Osmar R Zaïane
Faith G Davis
Yutaka Yasui
A machine learning approach to predict ethnicity using personal name and census location in Canada.
topic_facet Medicine
R
Science
Q
description Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68-95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63-67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by ...
format Article in Journal/Newspaper
author Kai On Wong
Osmar R Zaïane
Faith G Davis
Yutaka Yasui
author_facet Kai On Wong
Osmar R Zaïane
Faith G Davis
Yutaka Yasui
author_sort Kai On Wong
title A machine learning approach to predict ethnicity using personal name and census location in Canada.
title_short A machine learning approach to predict ethnicity using personal name and census location in Canada.
title_full A machine learning approach to predict ethnicity using personal name and census location in Canada.
title_fullStr A machine learning approach to predict ethnicity using personal name and census location in Canada.
title_full_unstemmed A machine learning approach to predict ethnicity using personal name and census location in Canada.
title_sort machine learning approach to predict ethnicity using personal name and census location in canada.
publisher Public Library of Science (PLoS)
publishDate 2020
url https://doi.org/10.1371/journal.pone.0241239
https://doaj.org/article/1e1aee1f79ad46c9b2893f5de0ea196b
geographic Canada
geographic_facet Canada
genre First Nations
inuit
genre_facet First Nations
inuit
op_source PLoS ONE, Vol 15, Iss 11, p e0241239 (2020)
op_relation https://doi.org/10.1371/journal.pone.0241239
https://doaj.org/toc/1932-6203
1932-6203
doi:10.1371/journal.pone.0241239
https://doaj.org/article/1e1aee1f79ad46c9b2893f5de0ea196b
op_doi https://doi.org/10.1371/journal.pone.0241239
container_title PLOS ONE
container_volume 15
container_issue 11
container_start_page e0241239
_version_ 1766002795608014848