Automated analysis of Norwegian text

In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by traini...

Full description

Bibliographic Details
Main Author: Johansen, Bjarte
Format: Doctoral or Postdoctoral Thesis
Language:English
Published: The University of Bergen 2019
Subjects:
Ner
Online Access:http://hdl.handle.net/1956/20906
id ftunivbergen:oai:bora.uib.no:1956/20906
record_format openpolar
spelling ftunivbergen:oai:bora.uib.no:1956/20906 2023-05-15T17:08:18+02:00 Automated analysis of Norwegian text Johansen, Bjarte 2019-06-28 application/pdf http://hdl.handle.net/1956/20906 eng eng The University of Bergen container/a5/57/5a/6c/a5575a6c-2df9-4aec-8a16-d82833d7c5c8 urn:isbn:9788230848753 urn:isbn:9788230866757 http://hdl.handle.net/1956/20906 cristin:1708253 Attribution-NonCommercial (CC BY-NC) http://creativecommons.org/licenses/by-nc/4.0/ Copyright the author. Doctoral thesis 2019 ftunivbergen 2023-03-14T17:38:49Z In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic. Doctoral or Postdoctoral Thesis Lofoten Vesterålen University of Bergen: Bergen Open Research Archive (BORA-UiB) Lofoten Ner ENVELOPE(6.622,6.622,62.612,62.612) Senja ENVELOPE(16.803,16.803,69.081,69.081) Vesterålen ENVELOPE(14.939,14.939,68.754,68.754)
institution Open Polar
collection University of Bergen: Bergen Open Research Archive (BORA-UiB)
op_collection_id ftunivbergen
language English
description In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic.
format Doctoral or Postdoctoral Thesis
author Johansen, Bjarte
spellingShingle Johansen, Bjarte
Automated analysis of Norwegian text
author_facet Johansen, Bjarte
author_sort Johansen, Bjarte
title Automated analysis of Norwegian text
title_short Automated analysis of Norwegian text
title_full Automated analysis of Norwegian text
title_fullStr Automated analysis of Norwegian text
title_full_unstemmed Automated analysis of Norwegian text
title_sort automated analysis of norwegian text
publisher The University of Bergen
publishDate 2019
url http://hdl.handle.net/1956/20906
long_lat ENVELOPE(6.622,6.622,62.612,62.612)
ENVELOPE(16.803,16.803,69.081,69.081)
ENVELOPE(14.939,14.939,68.754,68.754)
geographic Lofoten
Ner
Senja
Vesterålen
geographic_facet Lofoten
Ner
Senja
Vesterålen
genre Lofoten
Vesterålen
genre_facet Lofoten
Vesterålen
op_relation container/a5/57/5a/6c/a5575a6c-2df9-4aec-8a16-d82833d7c5c8
urn:isbn:9788230848753
urn:isbn:9788230866757
http://hdl.handle.net/1956/20906
cristin:1708253
op_rights Attribution-NonCommercial (CC BY-NC)
http://creativecommons.org/licenses/by-nc/4.0/
Copyright the author.
_version_ 1766064037267767296