Named entity recognition for Icelandic: comparing and combining different machine learning methods
Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | http://hdl.handle.net/1946/37548 |
id |
ftskemman:oai:skemman.is:1946/37548 |
---|---|
record_format |
openpolar |
spelling |
ftskemman:oai:skemman.is:1946/37548 2023-05-15T16:52:55+02:00 Named entity recognition for Icelandic: comparing and combining different machine learning methods Nafnakennsl fyrir íslensku: samanburður og samsetning mismunandi vélnámsaðferða. Ásmundur Alma Guðjónsson 1990- Háskólinn í Reykjavík 2021-01 application/pdf http://hdl.handle.net/1946/37548 en eng http://hdl.handle.net/1946/37548 Máltækni Meistaraprófsritgerðir Vélrænt nám Líkanagerð Language technology Machine learning Modeling Thesis Master's 2021 ftskemman 2022-12-11T06:51:02Z Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction. A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous Machine Learning (ML) model for the task in the Icelandic language. The goal of this project was to develop other ML methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future. The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field (CRF) model that used word features, but also made use of gazetteers. These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments. These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and are suitable for testing how the models perform in a real world scenario. Moreover, the texts can be used to see how well the models generalize what they have learned by measuring their performance on data that is of considerable difference from the training data. Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset ... Thesis Iceland Skemman (Iceland) Ner ENVELOPE(6.622,6.622,62.612,62.612) |
institution |
Open Polar |
collection |
Skemman (Iceland) |
op_collection_id |
ftskemman |
language |
English |
topic |
Máltækni Meistaraprófsritgerðir Vélrænt nám Líkanagerð Language technology Machine learning Modeling |
spellingShingle |
Máltækni Meistaraprófsritgerðir Vélrænt nám Líkanagerð Language technology Machine learning Modeling Ásmundur Alma Guðjónsson 1990- Named entity recognition for Icelandic: comparing and combining different machine learning methods |
topic_facet |
Máltækni Meistaraprófsritgerðir Vélrænt nám Líkanagerð Language technology Machine learning Modeling |
description |
Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction. A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous Machine Learning (ML) model for the task in the Icelandic language. The goal of this project was to develop other ML methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future. The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field (CRF) model that used word features, but also made use of gazetteers. These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments. These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and are suitable for testing how the models perform in a real world scenario. Moreover, the texts can be used to see how well the models generalize what they have learned by measuring their performance on data that is of considerable difference from the training data. Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset ... |
author2 |
Háskólinn í Reykjavík |
format |
Thesis |
author |
Ásmundur Alma Guðjónsson 1990- |
author_facet |
Ásmundur Alma Guðjónsson 1990- |
author_sort |
Ásmundur Alma Guðjónsson 1990- |
title |
Named entity recognition for Icelandic: comparing and combining different machine learning methods |
title_short |
Named entity recognition for Icelandic: comparing and combining different machine learning methods |
title_full |
Named entity recognition for Icelandic: comparing and combining different machine learning methods |
title_fullStr |
Named entity recognition for Icelandic: comparing and combining different machine learning methods |
title_full_unstemmed |
Named entity recognition for Icelandic: comparing and combining different machine learning methods |
title_sort |
named entity recognition for icelandic: comparing and combining different machine learning methods |
publishDate |
2021 |
url |
http://hdl.handle.net/1946/37548 |
long_lat |
ENVELOPE(6.622,6.622,62.612,62.612) |
geographic |
Ner |
geographic_facet |
Ner |
genre |
Iceland |
genre_facet |
Iceland |
op_relation |
http://hdl.handle.net/1946/37548 |
_version_ |
1766043425486929920 |