Automated analysis of Norwegian text
In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by traini...
Main Author: | |
---|---|
Format: | Doctoral or Postdoctoral Thesis |
Language: | English |
Published: |
The University of Bergen
2019
|
Subjects: | |
Online Access: | http://hdl.handle.net/1956/20906 |
id |
ftunivbergen:oai:bora.uib.no:1956/20906 |
---|---|
record_format |
openpolar |
spelling |
ftunivbergen:oai:bora.uib.no:1956/20906 2023-05-15T17:08:18+02:00 Automated analysis of Norwegian text Johansen, Bjarte 2019-06-28 application/pdf http://hdl.handle.net/1956/20906 eng eng The University of Bergen container/a5/57/5a/6c/a5575a6c-2df9-4aec-8a16-d82833d7c5c8 urn:isbn:9788230848753 urn:isbn:9788230866757 http://hdl.handle.net/1956/20906 cristin:1708253 Attribution-NonCommercial (CC BY-NC) http://creativecommons.org/licenses/by-nc/4.0/ Copyright the author. Doctoral thesis 2019 ftunivbergen 2023-03-14T17:38:49Z In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic. Doctoral or Postdoctoral Thesis Lofoten Vesterålen University of Bergen: Bergen Open Research Archive (BORA-UiB) Lofoten Ner ENVELOPE(6.622,6.622,62.612,62.612) Senja ENVELOPE(16.803,16.803,69.081,69.081) Vesterålen ENVELOPE(14.939,14.939,68.754,68.754) |
institution |
Open Polar |
collection |
University of Bergen: Bergen Open Research Archive (BORA-UiB) |
op_collection_id |
ftunivbergen |
language |
English |
description |
In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic. |
format |
Doctoral or Postdoctoral Thesis |
author |
Johansen, Bjarte |
spellingShingle |
Johansen, Bjarte Automated analysis of Norwegian text |
author_facet |
Johansen, Bjarte |
author_sort |
Johansen, Bjarte |
title |
Automated analysis of Norwegian text |
title_short |
Automated analysis of Norwegian text |
title_full |
Automated analysis of Norwegian text |
title_fullStr |
Automated analysis of Norwegian text |
title_full_unstemmed |
Automated analysis of Norwegian text |
title_sort |
automated analysis of norwegian text |
publisher |
The University of Bergen |
publishDate |
2019 |
url |
http://hdl.handle.net/1956/20906 |
long_lat |
ENVELOPE(6.622,6.622,62.612,62.612) ENVELOPE(16.803,16.803,69.081,69.081) ENVELOPE(14.939,14.939,68.754,68.754) |
geographic |
Lofoten Ner Senja Vesterålen |
geographic_facet |
Lofoten Ner Senja Vesterålen |
genre |
Lofoten Vesterålen |
genre_facet |
Lofoten Vesterålen |
op_relation |
container/a5/57/5a/6c/a5575a6c-2df9-4aec-8a16-d82833d7c5c8 urn:isbn:9788230848753 urn:isbn:9788230866757 http://hdl.handle.net/1956/20906 cristin:1708253 |
op_rights |
Attribution-NonCommercial (CC BY-NC) http://creativecommons.org/licenses/by-nc/4.0/ Copyright the author. |
_version_ |
1766064037267767296 |