Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature

Text mining and topic analysis algorithms which group textual contents in the most efficient way, are becoming increasingly useful to summarise the main information contained in large data corpus of complex scientific fields. Using the literature about reindeer pastoralism as a case study, this meth...

Full description

Bibliographic Details
Published in:Italian Journal of Animal Science
Main Authors: Contiero, Barbara, Holand, Øystein, Cozzi, Giulio
Format: Article in Journal/Newspaper
Language:English
Published: Taylor and Francis Ltd. 2024
Subjects:
Online Access:https://hdl.handle.net/11577/3540862
https://doi.org/10.1080/1828051x.2024.2398168
_version_ 1821693253055938560
author Contiero, Barbara
Holand, Øystein
Cozzi, Giulio
author2 Contiero, Barbara
Holand, Øystein
Cozzi, Giulio
author_facet Contiero, Barbara
Holand, Øystein
Cozzi, Giulio
author_sort Contiero, Barbara
collection Padua Research Archive (IRIS - Università degli Studi di Padova)
container_issue 1
container_start_page 1348
container_title Italian Journal of Animal Science
container_volume 23
description Text mining and topic analysis algorithms which group textual contents in the most efficient way, are becoming increasingly useful to summarise the main information contained in large data corpus of complex scientific fields. Using the literature about reindeer pastoralism as a case study, this methodological investigation addressed the issue related to the identification of the suitable number of topics that provide the best in-depth interpretation of a large data corpus. Two-thousand eight hundred and seventy-five documents extracted from Scopus® regarding the scientific literature of reindeer pastoralism were used. Four simulations with 8, 10, 12, and 20 topics were carried out to define the optimal number of topics that best explained the issues related to reindeer husbandry. The results showed that a reasonable trade-off between the number of articles and the number of topics, based on the reduction of the variance explained within the group, leads to an optimal choice in the search for the most meaningful simulation. The adoption of a too large number of topics, with the excessive fragmentation of the data corpus into small aggregations of documents, encourages the emergence of topics without any technical or practical meaning, solely as a result of the unsupervised iterative process.HIGHLIGHTS Text mining for insight vast and complex scientific fields: a case study on reindeer pastoralism. Optimising topic identification to strike a balance between the size of the articles corpus and the number of topics and achieve the most insightful results. Too many topics can lead to fragmentation and irrelevant results, while too few may oversimplify the complexity of the dataset.
format Article in Journal/Newspaper
genre reindeer husbandry
genre_facet reindeer husbandry
id ftunivpadovairis:oai:www.research.unipd.it:11577/3540862
institution Open Polar
language English
op_collection_id ftunivpadovairis
op_container_end_page 1357
op_doi https://doi.org/10.1080/1828051x.2024.2398168
op_relation info:eu-repo/semantics/altIdentifier/wos/WOS:001304463800001
volume:23
issue:1
firstpage:1348
lastpage:1357
numberofpages:10
journal:ITALIAN JOURNAL OF ANIMAL SCIENCE
https://hdl.handle.net/11577/3540862
doi:10.1080/1828051x.2024.2398168
publishDate 2024
publisher Taylor and Francis Ltd.
record_format openpolar
spelling ftunivpadovairis:oai:www.research.unipd.it:11577/3540862 2025-01-17T00:28:48+00:00 Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature Contiero, Barbara Holand, Øystein Cozzi, Giulio Contiero, Barbara Holand, Øystein Cozzi, Giulio 2024 https://hdl.handle.net/11577/3540862 https://doi.org/10.1080/1828051x.2024.2398168 eng eng Taylor and Francis Ltd. info:eu-repo/semantics/altIdentifier/wos/WOS:001304463800001 volume:23 issue:1 firstpage:1348 lastpage:1357 numberofpages:10 journal:ITALIAN JOURNAL OF ANIMAL SCIENCE https://hdl.handle.net/11577/3540862 doi:10.1080/1828051x.2024.2398168 Number of topic reindeer pastoralism simulation text mining topic analysis info:eu-repo/semantics/article 2024 ftunivpadovairis https://doi.org/10.1080/1828051x.2024.2398168 2024-12-20T00:56:57Z Text mining and topic analysis algorithms which group textual contents in the most efficient way, are becoming increasingly useful to summarise the main information contained in large data corpus of complex scientific fields. Using the literature about reindeer pastoralism as a case study, this methodological investigation addressed the issue related to the identification of the suitable number of topics that provide the best in-depth interpretation of a large data corpus. Two-thousand eight hundred and seventy-five documents extracted from Scopus® regarding the scientific literature of reindeer pastoralism were used. Four simulations with 8, 10, 12, and 20 topics were carried out to define the optimal number of topics that best explained the issues related to reindeer husbandry. The results showed that a reasonable trade-off between the number of articles and the number of topics, based on the reduction of the variance explained within the group, leads to an optimal choice in the search for the most meaningful simulation. The adoption of a too large number of topics, with the excessive fragmentation of the data corpus into small aggregations of documents, encourages the emergence of topics without any technical or practical meaning, solely as a result of the unsupervised iterative process.HIGHLIGHTS Text mining for insight vast and complex scientific fields: a case study on reindeer pastoralism. Optimising topic identification to strike a balance between the size of the articles corpus and the number of topics and achieve the most insightful results. Too many topics can lead to fragmentation and irrelevant results, while too few may oversimplify the complexity of the dataset. Article in Journal/Newspaper reindeer husbandry Padua Research Archive (IRIS - Università degli Studi di Padova) Italian Journal of Animal Science 23 1 1348 1357
spellingShingle Number of topic
reindeer pastoralism
simulation
text mining
topic analysis
Contiero, Barbara
Holand, Øystein
Cozzi, Giulio
Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title_full Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title_fullStr Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title_full_unstemmed Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title_short Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
title_sort identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
topic Number of topic
reindeer pastoralism
simulation
text mining
topic analysis
topic_facet Number of topic
reindeer pastoralism
simulation
text mining
topic analysis
url https://hdl.handle.net/11577/3540862
https://doi.org/10.1080/1828051x.2024.2398168