Software Framework for Topic Modelling with Large Corpora

Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We descr...

Full description

Bibliographic Details
Main Authors: Radim Řehůřek, Petr Sojka
Format: Conference Object
Language:English
Published: 2010
Subjects:
DML
Online Access:https://zenodo.org/record/1034483
https://doi.org/10.13140/2.1.2393.1847
id ftzenodo:oai:zenodo.org:1034483
record_format openpolar
spelling ftzenodo:oai:zenodo.org:1034483 2023-05-15T16:01:54+02:00 Software Framework for Topic Modelling with Large Corpora Radim Řehůřek Petr Sojka 2010-05-17 https://zenodo.org/record/1034483 https://doi.org/10.13140/2.1.2393.1847 eng eng https://zenodo.org/record/1034483 https://doi.org/10.13140/2.1.2393.1847 oai:zenodo.org:1034483 info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/4.0/legalcode Natural Language Processing info:eu-repo/semantics/conferencePaper publication-conferencepaper 2010 ftzenodo https://doi.org/10.13140/2.1.2393.1847 2023-03-11T02:11:08Z Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ. Conference Object DML Zenodo
institution Open Polar
collection Zenodo
op_collection_id ftzenodo
language English
topic Natural Language Processing
spellingShingle Natural Language Processing
Radim Řehůřek
Petr Sojka
Software Framework for Topic Modelling with Large Corpora
topic_facet Natural Language Processing
description Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
format Conference Object
author Radim Řehůřek
Petr Sojka
author_facet Radim Řehůřek
Petr Sojka
author_sort Radim Řehůřek
title Software Framework for Topic Modelling with Large Corpora
title_short Software Framework for Topic Modelling with Large Corpora
title_full Software Framework for Topic Modelling with Large Corpora
title_fullStr Software Framework for Topic Modelling with Large Corpora
title_full_unstemmed Software Framework for Topic Modelling with Large Corpora
title_sort software framework for topic modelling with large corpora
publishDate 2010
url https://zenodo.org/record/1034483
https://doi.org/10.13140/2.1.2393.1847
genre DML
genre_facet DML
op_relation https://zenodo.org/record/1034483
https://doi.org/10.13140/2.1.2393.1847
oai:zenodo.org:1034483
op_rights info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/4.0/legalcode
op_doi https://doi.org/10.13140/2.1.2393.1847
_version_ 1766397585053974528