Named entity recognition for Icelandic: comparing and combining different machine learning methods

Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like...

Full description

Bibliographic Details
Main Author: Ásmundur Alma Guðjónsson 1990-
Other Authors: Háskólinn í Reykjavík
Format: Thesis
Language:English
Published: 2021
Subjects:
Ner
Online Access:http://hdl.handle.net/1946/37548
id ftskemman:oai:skemman.is:1946/37548
record_format openpolar
spelling ftskemman:oai:skemman.is:1946/37548 2023-05-15T16:52:55+02:00 Named entity recognition for Icelandic: comparing and combining different machine learning methods Nafnakennsl fyrir íslensku: samanburður og samsetning mismunandi vélnámsaðferða. Ásmundur Alma Guðjónsson 1990- Háskólinn í Reykjavík 2021-01 application/pdf http://hdl.handle.net/1946/37548 en eng http://hdl.handle.net/1946/37548 Máltækni Meistaraprófsritgerðir Vélrænt nám Líkanagerð Language technology Machine learning Modeling Thesis Master's 2021 ftskemman 2022-12-11T06:51:02Z Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction. A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous Machine Learning (ML) model for the task in the Icelandic language. The goal of this project was to develop other ML methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future. The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field (CRF) model that used word features, but also made use of gazetteers. These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments. These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and are suitable for testing how the models perform in a real world scenario. Moreover, the texts can be used to see how well the models generalize what they have learned by measuring their performance on data that is of considerable difference from the training data. Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset ... Thesis Iceland Skemman (Iceland) Ner ENVELOPE(6.622,6.622,62.612,62.612)
institution Open Polar
collection Skemman (Iceland)
op_collection_id ftskemman
language English
topic Máltækni
Meistaraprófsritgerðir
Vélrænt nám
Líkanagerð
Language technology
Machine learning
Modeling
spellingShingle Máltækni
Meistaraprófsritgerðir
Vélrænt nám
Líkanagerð
Language technology
Machine learning
Modeling
Ásmundur Alma Guðjónsson 1990-
Named entity recognition for Icelandic: comparing and combining different machine learning methods
topic_facet Máltækni
Meistaraprófsritgerðir
Vélrænt nám
Líkanagerð
Language technology
Machine learning
Modeling
description Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction. A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous Machine Learning (ML) model for the task in the Icelandic language. The goal of this project was to develop other ML methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future. The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field (CRF) model that used word features, but also made use of gazetteers. These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments. These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and are suitable for testing how the models perform in a real world scenario. Moreover, the texts can be used to see how well the models generalize what they have learned by measuring their performance on data that is of considerable difference from the training data. Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset ...
author2 Háskólinn í Reykjavík
format Thesis
author Ásmundur Alma Guðjónsson 1990-
author_facet Ásmundur Alma Guðjónsson 1990-
author_sort Ásmundur Alma Guðjónsson 1990-
title Named entity recognition for Icelandic: comparing and combining different machine learning methods
title_short Named entity recognition for Icelandic: comparing and combining different machine learning methods
title_full Named entity recognition for Icelandic: comparing and combining different machine learning methods
title_fullStr Named entity recognition for Icelandic: comparing and combining different machine learning methods
title_full_unstemmed Named entity recognition for Icelandic: comparing and combining different machine learning methods
title_sort named entity recognition for icelandic: comparing and combining different machine learning methods
publishDate 2021
url http://hdl.handle.net/1946/37548
long_lat ENVELOPE(6.622,6.622,62.612,62.612)
geographic Ner
geographic_facet Ner
genre Iceland
genre_facet Iceland
op_relation http://hdl.handle.net/1946/37548
_version_ 1766043425486929920