Low-Resource Active Learning of Morphological Segmentation
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for man...
Published in: | Northern European Journal of Language Technology |
---|---|
Main Authors: | , , , , , , |
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
Linkoping University Electronic Press
2016
|
Subjects: | |
Online Access: | http://dx.doi.org/10.3384/nejlt.2000-1533.1644 https://nejlt.ep.liu.se/article/download/1662/1005 |
id |
crlinkopinguep:10.3384/nejlt.2000-1533.1644 |
---|---|
record_format |
openpolar |
spelling |
crlinkopinguep:10.3384/nejlt.2000-1533.1644 2024-06-02T08:11:52+00:00 Low-Resource Active Learning of Morphological Segmentation Grönroos, Stig-Arne Hiovain, Katri Smit, Peter Rauhala, Ilona Jokinen, Kristiina Kurimo, Mikko Virpioja, Sami 2016 http://dx.doi.org/10.3384/nejlt.2000-1533.1644 https://nejlt.ep.liu.se/article/download/1662/1005 unknown Linkoping University Electronic Press Northern European Journal of Language Technology volume 4, page 47-72 ISSN 2000-1533 journal-article 2016 crlinkopinguep https://doi.org/10.3384/nejlt.2000-1533.1644 2024-05-07T14:07:22Z Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. Article in Journal/Newspaper North Sámi Sámi LiU Electronic Press (Linköping University) Northern European Journal of Language Technology 4 47 72 |
institution |
Open Polar |
collection |
LiU Electronic Press (Linköping University) |
op_collection_id |
crlinkopinguep |
language |
unknown |
description |
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. |
format |
Article in Journal/Newspaper |
author |
Grönroos, Stig-Arne Hiovain, Katri Smit, Peter Rauhala, Ilona Jokinen, Kristiina Kurimo, Mikko Virpioja, Sami |
spellingShingle |
Grönroos, Stig-Arne Hiovain, Katri Smit, Peter Rauhala, Ilona Jokinen, Kristiina Kurimo, Mikko Virpioja, Sami Low-Resource Active Learning of Morphological Segmentation |
author_facet |
Grönroos, Stig-Arne Hiovain, Katri Smit, Peter Rauhala, Ilona Jokinen, Kristiina Kurimo, Mikko Virpioja, Sami |
author_sort |
Grönroos, Stig-Arne |
title |
Low-Resource Active Learning of Morphological Segmentation |
title_short |
Low-Resource Active Learning of Morphological Segmentation |
title_full |
Low-Resource Active Learning of Morphological Segmentation |
title_fullStr |
Low-Resource Active Learning of Morphological Segmentation |
title_full_unstemmed |
Low-Resource Active Learning of Morphological Segmentation |
title_sort |
low-resource active learning of morphological segmentation |
publisher |
Linkoping University Electronic Press |
publishDate |
2016 |
url |
http://dx.doi.org/10.3384/nejlt.2000-1533.1644 https://nejlt.ep.liu.se/article/download/1662/1005 |
genre |
North Sámi Sámi |
genre_facet |
North Sámi Sámi |
op_source |
Northern European Journal of Language Technology volume 4, page 47-72 ISSN 2000-1533 |
op_doi |
https://doi.org/10.3384/nejlt.2000-1533.1644 |
container_title |
Northern European Journal of Language Technology |
container_volume |
4 |
container_start_page |
47 |
op_container_end_page |
72 |
_version_ |
1800758147431792640 |