Named entity recognition for Icelandic: comparing and combining different machine learning methods

Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like...

Full description

Bibliographic Details
Main Author: Ásmundur Alma Guðjónsson 1990-
Other Authors: Háskólinn í Reykjavík
Format: Thesis
Language:English
Published: 2021
Subjects:
Ner
Online Access:http://hdl.handle.net/1946/37548
Description
Summary:Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction. A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous Machine Learning (ML) model for the task in the Icelandic language. The goal of this project was to develop other ML methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future. The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field (CRF) model that used word features, but also made use of gazetteers. These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments. These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and are suitable for testing how the models perform in a real world scenario. Moreover, the texts can be used to see how well the models generalize what they have learned by measuring their performance on data that is of considerable difference from the training data. Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset ...