Automated analysis of Norwegian text

In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by traini...

Full description

Bibliographic Details
Main Author: Johansen, Bjarte
Format: Doctoral or Postdoctoral Thesis
Language:English
Published: The University of Bergen 2019
Subjects:
Ner
Online Access:http://hdl.handle.net/1956/20906
Description
Summary:In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic.