Automatic Pair Construction for Contrastive Post-training ...
Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
arXiv
2023
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.2310.02263 https://arxiv.org/abs/2310.02263 |
id |
ftdatacite:10.48550/arxiv.2310.02263 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.2310.02263 2024-06-09T07:48:53+00:00 Automatic Pair Construction for Contrastive Post-training ... Xu, Canwen Rosset, Corby Chau, Ethan C. Del Corro, Luciano Mahajan, Shweti McAuley, Julian Neville, Jennifer Awadallah, Ahmed Hassan Rao, Nikhil 2023 https://dx.doi.org/10.48550/arxiv.2310.02263 https://arxiv.org/abs/2310.02263 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences CreativeWork article Preprint Article 2023 ftdatacite https://doi.org/10.48550/arxiv.2310.02263 2024-05-13T10:44:41Z Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT. ... : NAACL 2024 (Findings) ... Article in Journal/Newspaper Orca DataCite Metadata Store (German National Library of Science and Technology) |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences |
spellingShingle |
Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences Xu, Canwen Rosset, Corby Chau, Ethan C. Del Corro, Luciano Mahajan, Shweti McAuley, Julian Neville, Jennifer Awadallah, Ahmed Hassan Rao, Nikhil Automatic Pair Construction for Contrastive Post-training ... |
topic_facet |
Computation and Language cs.CL Artificial Intelligence cs.AI Machine Learning cs.LG FOS Computer and information sciences |
description |
Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT. ... : NAACL 2024 (Findings) ... |
format |
Article in Journal/Newspaper |
author |
Xu, Canwen Rosset, Corby Chau, Ethan C. Del Corro, Luciano Mahajan, Shweti McAuley, Julian Neville, Jennifer Awadallah, Ahmed Hassan Rao, Nikhil |
author_facet |
Xu, Canwen Rosset, Corby Chau, Ethan C. Del Corro, Luciano Mahajan, Shweti McAuley, Julian Neville, Jennifer Awadallah, Ahmed Hassan Rao, Nikhil |
author_sort |
Xu, Canwen |
title |
Automatic Pair Construction for Contrastive Post-training ... |
title_short |
Automatic Pair Construction for Contrastive Post-training ... |
title_full |
Automatic Pair Construction for Contrastive Post-training ... |
title_fullStr |
Automatic Pair Construction for Contrastive Post-training ... |
title_full_unstemmed |
Automatic Pair Construction for Contrastive Post-training ... |
title_sort |
automatic pair construction for contrastive post-training ... |
publisher |
arXiv |
publishDate |
2023 |
url |
https://dx.doi.org/10.48550/arxiv.2310.02263 https://arxiv.org/abs/2310.02263 |
genre |
Orca |
genre_facet |
Orca |
op_rights |
arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ |
op_doi |
https://doi.org/10.48550/arxiv.2310.02263 |
_version_ |
1801380841598746624 |