Software Framework for Topic Modelling with Large Corpora

Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software...

Full description

Bibliographic Details
Main Authors:	Řehůřek Radim, Sojka Petr
Format:	Article in Journal/Newspaper
Language:	English
Published:	University of Malta 2010
Subjects:	document similarity NLP software vector space model topical modelling software framework topical document similarity Python IR LSA LDA gensim DML-CZ podobnost dokumentů vektorový model dokumentů softwarový framework tematická podobnost dokumentů DML
Online Access:	https://is.muni.cz/publication/884893

id	ftmasarykis:oai:is.muni.cz:884893
record_format	openpolar
spelling	ftmasarykis:oai:is.muni.cz:884893 2024-09-15T18:03:49+00:00 Software Framework for Topic Modelling with Large Corpora Řehůřek Radim Sojka Petr 2010 5 https://is.muni.cz/publication/884893 eng eng University of Malta https://is.muni.cz/publication/884893 info:eu-repo/semantics/restrictedAccess Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks document similarity NLP software vector space model topical modelling software framework topical document similarity Python IR LSA LDA gensim DML-CZ podobnost dokumentů vektorový model dokumentů softwarový framework tematická podobnost dokumentů info:eu-repo/semantics/article D 2010 ftmasarykis 2024-08-29T03:18:41Z Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ. Velké korpusy jsou dnes všudypřítomné. Při jejich plnotextovém zpracování ve vektorové reprezentaci (podobnost dokumentů) brzy začne být limitujícím faktorem velikost paměti. Identifikovali jsme a zaplnili mezeru v dobře škálovatelné implementaci několika populárních algoritmů. Popisujeme snadno použitelný NLP softwarový framework založený na myšlence proudového zpracování dokumentů, tedy zpracování jednoho dokumentu po druhém, tedy v konstatní paměti vzhledem k počtu dokumentů. Implementujeme několik populárních algoritmů pro tematickou inferenci, včetně Latentní sémantické analýzy a Latentní Dirichletovy alokace způsobem, který je nezávislý na velikosti korpusu. Důraz je kladen na přímočarý a intuitivní design, aby modifikace a rozšíření metod a jejich užití v praxi bylo co nejjednodušší. Demonstrujeme užitečnost našeho přístupu na nasazení software na příkladu počítání podobností dokumentů v existující digitální matematické knihovně DML-CZ. Article in Journal/Newspaper DML Masaryk University: Open Services of Information System
institution	Open Polar
collection	Masaryk University: Open Services of Information System
op_collection_id	ftmasarykis
language	English
topic	document similarity NLP software vector space model topical modelling software framework topical document similarity Python IR LSA LDA gensim DML-CZ podobnost dokumentů vektorový model dokumentů softwarový framework tematická podobnost dokumentů
spellingShingle	document similarity NLP software vector space model topical modelling software framework topical document similarity Python IR LSA LDA gensim DML-CZ podobnost dokumentů vektorový model dokumentů softwarový framework tematická podobnost dokumentů Řehůřek Radim Sojka Petr Software Framework for Topic Modelling with Large Corpora
topic_facet	document similarity NLP software vector space model topical modelling software framework topical document similarity Python IR LSA LDA gensim DML-CZ podobnost dokumentů vektorový model dokumentů softwarový framework tematická podobnost dokumentů
description	Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ. Velké korpusy jsou dnes všudypřítomné. Při jejich plnotextovém zpracování ve vektorové reprezentaci (podobnost dokumentů) brzy začne být limitujícím faktorem velikost paměti. Identifikovali jsme a zaplnili mezeru v dobře škálovatelné implementaci několika populárních algoritmů. Popisujeme snadno použitelný NLP softwarový framework založený na myšlence proudového zpracování dokumentů, tedy zpracování jednoho dokumentu po druhém, tedy v konstatní paměti vzhledem k počtu dokumentů. Implementujeme několik populárních algoritmů pro tematickou inferenci, včetně Latentní sémantické analýzy a Latentní Dirichletovy alokace způsobem, který je nezávislý na velikosti korpusu. Důraz je kladen na přímočarý a intuitivní design, aby modifikace a rozšíření metod a jejich užití v praxi bylo co nejjednodušší. Demonstrujeme užitečnost našeho přístupu na nasazení software na příkladu počítání podobností dokumentů v existující digitální matematické knihovně DML-CZ.
format	Article in Journal/Newspaper
author	Řehůřek Radim Sojka Petr
author_facet	Řehůřek Radim Sojka Petr
author_sort	Řehůřek Radim
title	Software Framework for Topic Modelling with Large Corpora
title_short	Software Framework for Topic Modelling with Large Corpora
title_full	Software Framework for Topic Modelling with Large Corpora
title_fullStr	Software Framework for Topic Modelling with Large Corpora
title_full_unstemmed	Software Framework for Topic Modelling with Large Corpora
title_sort	software framework for topic modelling with large corpora
publisher	University of Malta
publishDate	2010
url	https://is.muni.cz/publication/884893
genre	DML
genre_facet	DML
op_source	Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks
op_relation	https://is.muni.cz/publication/884893
op_rights	info:eu-repo/semantics/restrictedAccess
_version_	1810441273881919488

Software Framework for Topic Modelling with Large Corpora

Similar Items