Aligning language models to professional domains using preference training

Recent research has shown that utilizing preference training methods, such as reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), can significantly improve the alignment of models with user intent and linguistic requirements. By using these training methods, sm...

Full description

Bibliographic Details
Main Author: Þórir Hrafn Harðarson 1981-
Other Authors: Háskólinn í Reykjavík
Format: Master Thesis
Language:English
Published: 2024
Subjects:
Online Access:http://hdl.handle.net/1946/47687
Description
Summary:Recent research has shown that utilizing preference training methods, such as reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), can significantly improve the alignment of models with user intent and linguistic requirements. By using these training methods, smaller models have also been shown to be able to produce solutions that are preferred over those of larger models that have not been aligned. Implementing preference training requires domain-specific data where humans rank generated outputs based on preference, a process that can be both costly and time-consuming. However, by assuming that a model instruction fine-tuned with labelled data will not be able to outperform a human domain expert, a pairwise comparison dataset can be created from the model's output and the human-generated label, thereby simplifying the training process. These approaches were applied to domain-specific datasets created by collecting court rulings from the Supreme Court of Iceland, along with summaries of those rulings. Models were then trained to perform the downstream task of generating summaries of court rulings, challenging their ability to create comprehensive, legally sound texts in Icelandic. Preliminary results suggest that by using training data created from this method to perform preference training, a model is able to improve its generative output beyond those capabilities gained by only using supervised fine-tuning. Further research is needed to get more conclusive results on potential performance gains by using preference training for domain-specific downstream tasks.