HFST Runtime Format- A Compacted Transducer Format Allowing for Fast Lookup

Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are re...

Full description

Bibliographic Details
Main Author: Miikka Silfverberg
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Subjects:
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.541.1753
http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/hfst-runtime/fsmnlp2009lookup.pdf
Description
Summary:Abstract. Lexical transducers form part of most language-aware appli-cations, which means that less time spent on lexical lookup will have wide-ranging eects. The eciency of a morphological analyzer stems mainly from the properties of the underlying transducer, but the way its transition sets are represented also plays a large role, since this de-termines how eciently transitions can be accessed. We consider three principal ways to represent transition sets leading to three dierent trans-ducer formats. We call them the linked list model, the ordered array model and the random access model. As test material for evaluating the speed allowed by the formats, we use two full- edged morphologi-cal descriptions: Morphalou for French, which is a full-form morpholog-ical lexicon resulting in an acyclic transducer, and the Divvun lexicon for Northern Sami, which is a morphological description with a pro-ductive compounding mechanism resulting in a cyclic transducer. Both have sucient coverage to be used in real applications. The random ac-cess model using Liang compaction implemented in the HFSTHelsinki Finite-State Technology runtime is the fastest and achieves a through-put of 119 000-408 000 w/s on running text with a moderate memory consumption. 1