Automatic Pair Construction for Contrastive Post-training ...

Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare...

Full description

Bibliographic Details
Main Authors: Xu, Canwen, Rosset, Corby, Chau, Ethan C., Del Corro, Luciano, Mahajan, Shweti, McAuley, Julian, Neville, Jennifer, Awadallah, Ahmed Hassan, Rao, Nikhil
Format: Article in Journal/Newspaper
Language:unknown
Published: arXiv 2023
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2310.02263
https://arxiv.org/abs/2310.02263
id ftdatacite:10.48550/arxiv.2310.02263
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2310.02263 2024-06-09T07:48:53+00:00 Automatic Pair Construction for Contrastive Post-training ... Xu, Canwen Rosset, Corby Chau, Ethan C. Del Corro, Luciano Mahajan, Shweti McAuley, Julian Neville, Jennifer Awadallah, Ahmed Hassan Rao, Nikhil 2023 https://dx.doi.org/10.48550/arxiv.2310.02263 https://arxiv.org/abs/2310.02263 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences CreativeWork article Preprint Article 2023 ftdatacite https://doi.org/10.48550/arxiv.2310.02263 2024-05-13T10:44:41Z Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT. ... : NAACL 2024 (Findings) ... Article in Journal/Newspaper Orca DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Computation and Language cs.CL
Artificial Intelligence cs.AI
Machine Learning cs.LG
FOS Computer and information sciences
spellingShingle Computation and Language cs.CL
Artificial Intelligence cs.AI
Machine Learning cs.LG
FOS Computer and information sciences
Xu, Canwen
Rosset, Corby
Chau, Ethan C.
Del Corro, Luciano
Mahajan, Shweti
McAuley, Julian
Neville, Jennifer
Awadallah, Ahmed Hassan
Rao, Nikhil
Automatic Pair Construction for Contrastive Post-training ...
topic_facet Computation and Language cs.CL
Artificial Intelligence cs.AI
Machine Learning cs.LG
FOS Computer and information sciences
description Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT. ... : NAACL 2024 (Findings) ...
format Article in Journal/Newspaper
author Xu, Canwen
Rosset, Corby
Chau, Ethan C.
Del Corro, Luciano
Mahajan, Shweti
McAuley, Julian
Neville, Jennifer
Awadallah, Ahmed Hassan
Rao, Nikhil
author_facet Xu, Canwen
Rosset, Corby
Chau, Ethan C.
Del Corro, Luciano
Mahajan, Shweti
McAuley, Julian
Neville, Jennifer
Awadallah, Ahmed Hassan
Rao, Nikhil
author_sort Xu, Canwen
title Automatic Pair Construction for Contrastive Post-training ...
title_short Automatic Pair Construction for Contrastive Post-training ...
title_full Automatic Pair Construction for Contrastive Post-training ...
title_fullStr Automatic Pair Construction for Contrastive Post-training ...
title_full_unstemmed Automatic Pair Construction for Contrastive Post-training ...
title_sort automatic pair construction for contrastive post-training ...
publisher arXiv
publishDate 2023
url https://dx.doi.org/10.48550/arxiv.2310.02263
https://arxiv.org/abs/2310.02263
genre Orca
genre_facet Orca
op_rights arXiv.org perpetual, non-exclusive license
http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi https://doi.org/10.48550/arxiv.2310.02263
_version_ 1801380841598746624