Low-Resource Active Learning of Morphological Segmentation

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for man...

Full description

Bibliographic Details
Published in:Northern European Journal of Language Technology
Main Authors: Grönroos, Stig-Arne, Hiovain, Katri, Smit, Peter, Rauhala, Ilona, Jokinen, Kristiina, Kurimo, Mikko, Virpioja, Sami
Format: Article in Journal/Newspaper
Language:unknown
Published: Linkoping University Electronic Press 2016
Subjects:
Online Access:http://dx.doi.org/10.3384/nejlt.2000-1533.1644
https://nejlt.ep.liu.se/article/download/1662/1005
id crlinkopinguep:10.3384/nejlt.2000-1533.1644
record_format openpolar
spelling crlinkopinguep:10.3384/nejlt.2000-1533.1644 2024-06-02T08:11:52+00:00 Low-Resource Active Learning of Morphological Segmentation Grönroos, Stig-Arne Hiovain, Katri Smit, Peter Rauhala, Ilona Jokinen, Kristiina Kurimo, Mikko Virpioja, Sami 2016 http://dx.doi.org/10.3384/nejlt.2000-1533.1644 https://nejlt.ep.liu.se/article/download/1662/1005 unknown Linkoping University Electronic Press Northern European Journal of Language Technology volume 4, page 47-72 ISSN 2000-1533 journal-article 2016 crlinkopinguep https://doi.org/10.3384/nejlt.2000-1533.1644 2024-05-07T14:07:22Z Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. Article in Journal/Newspaper North Sámi Sámi LiU Electronic Press (Linköping University) Northern European Journal of Language Technology 4 47 72
institution Open Polar
collection LiU Electronic Press (Linköping University)
op_collection_id crlinkopinguep
language unknown
description Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
format Article in Journal/Newspaper
author Grönroos, Stig-Arne
Hiovain, Katri
Smit, Peter
Rauhala, Ilona
Jokinen, Kristiina
Kurimo, Mikko
Virpioja, Sami
spellingShingle Grönroos, Stig-Arne
Hiovain, Katri
Smit, Peter
Rauhala, Ilona
Jokinen, Kristiina
Kurimo, Mikko
Virpioja, Sami
Low-Resource Active Learning of Morphological Segmentation
author_facet Grönroos, Stig-Arne
Hiovain, Katri
Smit, Peter
Rauhala, Ilona
Jokinen, Kristiina
Kurimo, Mikko
Virpioja, Sami
author_sort Grönroos, Stig-Arne
title Low-Resource Active Learning of Morphological Segmentation
title_short Low-Resource Active Learning of Morphological Segmentation
title_full Low-Resource Active Learning of Morphological Segmentation
title_fullStr Low-Resource Active Learning of Morphological Segmentation
title_full_unstemmed Low-Resource Active Learning of Morphological Segmentation
title_sort low-resource active learning of morphological segmentation
publisher Linkoping University Electronic Press
publishDate 2016
url http://dx.doi.org/10.3384/nejlt.2000-1533.1644
https://nejlt.ep.liu.se/article/download/1662/1005
genre North Sámi
Sámi
genre_facet North Sámi
Sámi
op_source Northern European Journal of Language Technology
volume 4, page 47-72
ISSN 2000-1533
op_doi https://doi.org/10.3384/nejlt.2000-1533.1644
container_title Northern European Journal of Language Technology
container_volume 4
container_start_page 47
op_container_end_page 72
_version_ 1800758147431792640