Language Technology Tools for Low-Resource Languages — Five Cases for Sakha, Norwegian, and Finnish

This dissertation develops language technology tools for low-resource languages. It is important to ensure that low-resource languages are not left behind in the rapidly evolving digital landscape, as language technology tools can greatly improve communication and information access for speakers of...

Full description

Bibliographic Details
Main Author: Ivanova, Sardana
Other Authors: Laippala, Veronika, Toivonen, Hannu, Granroth-Wilding, Mark, Helsingin yliopisto, matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen tohtoriohjelma, Helsingfors universitet, matematisk-naturvetenskapliga fakulteten, Doktorandprogrammet i datavetenskap, University of Helsinki, Faculty of Science, Doctoral Programme in Computer Science
Format: Doctoral or Postdoctoral Thesis
Language:English
Published: Helsingin yliopisto 2024
Subjects:
Online Access:http://hdl.handle.net/10138/572878
Description
Summary:This dissertation develops language technology tools for low-resource languages. It is important to ensure that low-resource languages are not left behind in the rapidly evolving digital landscape, as language technology tools can greatly improve communication and information access for speakers of these languages. The support of low-resource languages through technology development and revitalisation efforts is essential for preserving linguistic diversity and maintaining the richness of cultural heritage. The dissertation presents five case studies for three languages, starting from the truly low-resource Sakha language to the more resourceful languages, Finnish and Norwegian, which still lack many resources available for English. Sakha is a Turkic language spoken in the Republic of Sakha in Siberia by 0.5 million people. Finnish is a Uralic language of the Finnic branch, spoken by 5.8 million people in Finland and by ethnic Finns outside of Finland. Norwegian is a North Germanic language, spoken mainly in Norway by 5.32 million people. The five cases covered in the dissertation range from essential tools for Sakha, such as a morphological analyser, to higher-level tools for Norwegian and Finnish. The contributions of the dissertation are as follows. We developed a morphological analyser and generator for Sakha within the framework of two-level morphology. It has a coverage of above 90\% and 99\% precision. While developing the analyser, we expanded linguistic knowledge about Sakha and devised strategies for complex grammatical patterns. We implemented a language-learning environment for Sakha in the Revita computer-assisted language-learning platform, using the morphological analyser we developed. We created a Turkic Interlingua corpus and trained Russian-Sakha, Sakha-Russian, English-Sakha, and Sakha-English machine translation models, as well as a multi-way neural machine translation model. We performed an extensive analysis using automatic metrics as well as human evaluations. We created NorQuAD---the ...