Neural Transfer Learning for Truly Low-Resource Natural Language Processing

The vast majority of the world's languages are low-resource, lacking the data resources required in advanced natural language processing (NLP) based on data-intensive deep learning. Furthermore, annotated training data can be insufficient in some domains even within resource-rich languages. Low...

Full description

Bibliographic Details
Main Author: Soisalon-Soininen, Eliel
Other Authors: Dinu, Liviu, University of Helsinki, Faculty of Science, Department of Computer Science, Doctoral Programme in Computer Science, Helsingin yliopisto, matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen tohtoriohjelma, Helsingfors universitet, matematisk-naturvetenskapliga fakulteten, Doktorandprogrammet i datavetenskap, Toivonen, Hannu, Granroth-Wilding, Mark
Format: Doctoral or Postdoctoral Thesis
Language:English
Published: Helsingin yliopisto 2023
Subjects:
Online Access:http://hdl.handle.net/10138/359240
Description
Summary:The vast majority of the world's languages are low-resource, lacking the data resources required in advanced natural language processing (NLP) based on data-intensive deep learning. Furthermore, annotated training data can be insufficient in some domains even within resource-rich languages. Low-resource NLP is crucial for both the inclusion of language communities in the NLP sphere and the extension of applications over a wider range of domains. The objective of this thesis is to contribute to this long-term goal especially with regard to truly low-resource languages and domains. We address truly low-resource NLP in the context of two tasks. First, we consider the low-level task of cognate identification, since cognates are useful for the cross-lingual transfer of many lower-level tasks into new languages. Second, we examine the high-level task of document planning, a fundamental task in data-to-text natural language generation (NLG), where many domains are low-resource. Thus, domain-independent document planning supports the transfer of NLG across domains. Following recent encouraging results, we propose neural network models to these tasks, using transfer learning methods in three low-resource scenarios. We divide our high-level objective into three research tasks characterised by different resource conditions. In our first research task, we address cognate identification in endangered Sami languages of the Uralic family, given scarce labelled training data. We propose a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), which we pre-train on unrelated Indo-European data, lacking high-resource close relatives. We find that S-CNN performs best at direct transfer to Sami, and adapts fast when fine-tuned on a small amount of Sami data. In our second research task, we address a scenario with only unlabelled data to adapt S-CNN from Indo-European to Uralic data. We propose both discriminative adversarial networks and pre-trained symbol embeddings, finding that adversarial adaptation ...