HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup
Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are re...
Main Author: | |
---|---|
Other Authors: | |
Format: | Text |
Language: | English |
Subjects: | |
Online Access: | http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753 http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf |
Summary: | Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are represented also plays a large role, since this de-termines how eciently transitions can be accessed. We consider three principal ways to represent transition sets leading to three dierent trans-ducer formats. We call them the linked list model, the ordered array model and the random access model. As test material for evaluating the speed allowed by the formats, we use two full- edged morphologi-cal descriptions: Morphalou for French, which is a full-form morpholog-ical lexicon resulting in an acyclic transducer, and the Divvun lexicon for Northern Sami, which is a morphological description with a pro-ductive compounding mechanism resulting in a cyclic transducer. Both have sucient coverage to be used in real applications. The random ac-cess model using Liang compaction implemented in the HFSTHelsinki Finite-State Technology runtime is the fastest and achieves a through-put of 119 000-408 000 w/s on running text with a moderate memory consumption. 1 |
---|