HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup

Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are re...

Full description

Bibliographic Details
Main Author: Miikka Silfverberg
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Subjects:
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753
http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf
id ftciteseerx:oai:CiteSeerX.psu:10.1.1.541.1753
record_format openpolar
spelling ftciteseerx:oai:CiteSeerX.psu:10.1.1.541.1753 2023-05-15T18:12:25+02:00 HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup Miikka Silfverberg The Pennsylvania State University CiteSeerX Archives application/pdf http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753 http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf en eng http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753 http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf Metadata may be used without restrictions as long as the oai identifier remains attached to it. http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf text ftciteseerx 2016-01-08T11:04:21Z Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are represented also plays a large role, since this de-termines how eciently transitions can be accessed. We consider three principal ways to represent transition sets leading to three dierent trans-ducer formats. We call them the linked list model, the ordered array model and the random access model. As test material for evaluating the speed allowed by the formats, we use two full- edged morphologi-cal descriptions: Morphalou for French, which is a full-form morpholog-ical lexicon resulting in an acyclic transducer, and the Divvun lexicon for Northern Sami, which is a morphological description with a pro-ductive compounding mechanism resulting in a cyclic transducer. Both have sucient coverage to be used in real applications. The random ac-cess model using Liang compaction implemented in the HFSTHelsinki Finite-State Technology runtime is the fastest and achieves a through-put of 119 000-408 000 w/s on running text with a moderate memory consumption. 1 Text sami Unknown
institution Open Polar
collection Unknown
op_collection_id ftciteseerx
language English
description Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are represented also plays a large role, since this de-termines how eciently transitions can be accessed. We consider three principal ways to represent transition sets leading to three dierent trans-ducer formats. We call them the linked list model, the ordered array model and the random access model. As test material for evaluating the speed allowed by the formats, we use two full- edged morphologi-cal descriptions: Morphalou for French, which is a full-form morpholog-ical lexicon resulting in an acyclic transducer, and the Divvun lexicon for Northern Sami, which is a morphological description with a pro-ductive compounding mechanism resulting in a cyclic transducer. Both have sucient coverage to be used in real applications. The random ac-cess model using Liang compaction implemented in the HFSTHelsinki Finite-State Technology runtime is the fastest and achieves a through-put of 119 000-408 000 w/s on running text with a moderate memory consumption. 1
author2 The Pennsylvania State University CiteSeerX Archives
format Text
author Miikka Silfverberg
spellingShingle Miikka Silfverberg
HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
author_facet Miikka Silfverberg
author_sort Miikka Silfverberg
title HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
title_short HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
title_full HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
title_fullStr HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
title_full_unstemmed HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
title_sort hfst runtime format- a compacted transducer format allowing for fast lookup
url http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753
http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf
genre sami
genre_facet sami
op_source http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf
op_relation http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753
http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf
op_rights Metadata may be used without restrictions as long as the oai identifier remains attached to it.
_version_ 1766184948240220160